Skip to content

HDBSCAN

HDBSCAN

The HdbScan step is another clustering step that can be used to assign a "topic" label to each chunk. This step also performs outlier detection assigning an outlier_score to each chunk, the higher the score the more likely the chunk does not belong to any of the identified clusters. This awareness of noise in the data make HDBSCAN more robust to outliers.

The HdbScan step produces stable clustering results and has parameters that are more intuitive to set correctly for your data than other clustering algorithms.

The key parameter for this step is min_cluster_size which controls the smallest cluster that will be considered. Higher values of min_cluster_size will lead to fewer and larger clusters.

The other parameter is min_samples which has a default value set to min_cluster_size. min_samples determines how conservative the cluster selection is, the higher the value the more points will be considered to be noise, ie. a more dense area of points will be required to identify a cluster.

Like in other clustering steps you can use an open-ai model to assign labels to the clusters.

Step Args
Key Value Type Value Description
min_cluster_size int The smallest number of points that can form a cluster.
min_sample_size int How conservative the clustering is, higher values mean more points ar eocnsidered as noise.
assign_labels literal Optional openai model used to assign labels to the clusters