HDBSCAN
HDBSCAN
The HdbScan
step is another clustering step that can be used to assign a "topic" label to each chunk.
This step also performs outlier detection assigning an outlier_score
to each chunk, the higher the score
the more likely the chunk does not belong to any of the identified clusters. This awareness of noise
in the data make HDBSCAN more robust to outliers.
The HdbScan step produces stable clustering results and has parameters that are more intuitive to set correctly for your data than other clustering algorithms.
The key parameter for this step is min_cluster_size
which controls the smallest
cluster that will be considered. Higher values of min_cluster_size
will lead to fewer
and larger clusters.
The other parameter is min_samples
which has a default value set to min_cluster_size
.
min_samples
determines how conservative the cluster selection is, the higher the value the more
points will be considered to be noise, ie. a more dense area of points will be required to
identify a cluster.
Like in other clustering steps you can use an open-ai model to assign labels to the clusters.
Step Args
Key | Value Type | Value Description |
---|---|---|
min_cluster_size |
int |
The smallest number of points that can form a cluster. |
min_sample_size |
int |
How conservative the clustering is, higher values mean more points ar eocnsidered as noise. |
assign_labels |
literal |
Optional openai model used to assign labels to the clusters |