LexRank

The LexRank step is used to compute a relevancy score for each chunk.

This is incredibly useful for downstream text-summarisation tasks. Most input sources of text contain a lot of redundant information, and the Relevancy node is a great way to efficiently remove this redundancy.

For example, if you filter your end-results to only include chunks with a relevancy score of 0.8, you can:

achieve better downstream performance as you are only including the most relevant information.
save 80% on your downstream summarisation model's inference time and input-tokens cost! [^5]

How does it work?

Under the hood, we employ an optimised version of LexRank. We create a graph representation of the embeddings of your chunks, where the adjacency matrix of the graph is a connectivity matrix based on intra-sentence cosine similarities. We then compute the "importance" of each chunk based on the concept of eigenvector centrality in this graph representation. (1) We normalise the Relevancy score between 0 and 1, so that (a) it is comparable across different embedding models, and documents, and (b) so that users can easily filter on Relevancy > 0.8 for example, and be confident this will still yield 20% of their content.

We did not invent LexRank, for that, we have these guys to thank. We have implemented a very fast and efficient implementation of LexRank, which users can simply add as a node in their pipeline.
We did not invent LexRank, for that, we have these guys to thank. We have implemented a very fast and efficient implementation of LexRank, which users can simply add as a node in their pipeline.

If you include a LexRank step, you will find the lexrank score, and the lexrank percentile score, assigned to each Chunk, in the metadata_json field of the Chunk in the VectorDB. The two parameters will be in their own dictionary in the metadata_json field, with the keys lexrank_score and lexrank_percentile_score respectively. The title of the dictionary will be the name of the step. i.e. each Chunk in the vector DB will have {"lexranker_file" : { "lexrank_percentile_score": x , "lexrank_score" : y }} in its metadata_json field.