Chunker
Chunker
The Chunker
step is used to split the text into smaller 'chunks'.
Chunking is useful for two main reasons
- Splitting your text content to match the dimensions of your embedding model. (1)
- Searching over long form documents. (2)
- Embedding models are trained on inputs of a specific dimension size. For example, the majority of sentence-transformers models are trained on sentences of 768 dimensions. Due to this, these models get the best performance when passed chunks of a similar size to what they were trained on.
- For example, you may want to search over a 2,000 page PDF document. This won't fit into the dimension of your embedding model even if you tried. You can use the
Chunker
step to automatically split the document into smaller chunks, which can fit into the embedding model of your choosing.
Step Args
Key | Value Type | Value Description |
---|---|---|
chunk_size_words |
int | None |
the maximum number of words in each resulting chunk |
chunk_size_tokens |
int | None |
the maximum number of tokens in each resulting chunk |
chunk_overlap |
int |
the overlap in tokens / words between adjacent chunks |
The chunking strategy recursive to preserve as much signal from natrual partions in the text as possible by packing each chunk with the largest semantic block that fits in the next chunk.
The semantic splitting levels are as follows:
- characters
- words
- sentences
- paragraphs: sets of sentences separated by one or more new lines. The number of consequtive new line characters defines a new splitting level. i.e. sets of senteces seprated by 2 new lines is higher level than sets separated by one new line and so on.
Notes: you must set one of chunk_size_words
or chunk_size_tokens
not both
The default chunker is recursive.
Valid input steps: Preprocessor