SentenceTransformerEmbedder
SentenceTransformerEmbedder
The SentenceTransformerEmbedder
step is used to embed your text chunks into a vector space.
Choice of model is critical
Your choice of model is very important! Some things to bear in mind:
- Different domains of training data. For example, some are trained on Wikipedia, some on news articles, some on transcripts, and so on. Choose an embedding model which is trained on content from a domain which is similar to yours.
- Different languages. Obviously, be sure to choose an embedding model which has been trained on content in the same language as the content you are embedding.
- Different chunk-sizes. Different embedding models are trained on different sized chunks of text. Try to figure out ex-ante the average size of your chunks, and choose an embedding model accordingly. (1)
- N.B. don't just choose the embedding model with the largest chunk size. This will (a) result in wasted computation, as your model will just end up embedding a lot of padding tokens, (b) will result in a lot of wasted memory, as your embeddings stored in your Vector DB will be much larger than they need to be, (c) will result in a lot of wasted inference time (read: worse user-experience for your users), as you'll be searching in redundant dimensions, and (d) most importantly, will result in worse performance, because of the difference in distributions between the training data and your data.
Step Args
Key | Value Type | Value Description |
---|---|---|
model |
literal |
Any model from our available models list. |
include_metadata |
array |
Metadata fields to include on the embedding. For example, if you want to include the model and file_name metadata fields, you would set this to [ model, file_name ] .1 |
query |
str or None |
the query to embed. This argument is only required when the step is at the root of the pipeline like in a query pipelines precceding the Retriever |
Note if the input chunks are longer than the max tokens of the selected model the chunks will be truncated to fit the required length.
Valid input steps: Preprocessor
, Chunker