LexRank
LexRank
The LexRank
step is used to compute a relevancy score for each chunk.
This is incredibly useful for downstream text-summarisation tasks. Most input sources of text contain a lot of redundant information, and the Relevancy node is a great way to efficiently remove this redundancy.
For example, if you filter your end-results to only include chunks with a relevancy score of 0.8, you can:
- achieve better downstream performance as you are only including the most relevant information.
- save 80% on your downstream summarisation model's inference time and input-tokens cost! [^5]
How does it work?
Under the hood, we employ an optimised version of LexRank. We create a graph representation of the embeddings of your chunks, where the adjacency matrix of the graph is a connectivity matrix based on intra-sentence cosine similarities. We then compute the "importance" of each chunk based on the concept of eigenvector centrality in this graph representation. (1)
We normalise the Relevancy score between 0 and 1, so that (a) it is comparable across different embedding models, and documents, and (b) so that users can easily filter on Relevancy > 0.8
for example, and be confident this will still yield 20% of their content.
- We did not invent LexRank, for that, we have these guys to thank. We have implemented a very fast and efficient implementation of LexRank, which users can simply add as a
node
in their pipeline. - We did not invent LexRank, for that, we have these guys to thank. We have implemented a very fast and efficient implementation of LexRank, which users can simply add as a
node
in their pipeline.
If you include a LexRank step, you will find the lexrank score, and the lexrank percentile score, assigned to each Chunk
, in the metadata_json field of the Chunk
in the VectorDB. The two parameters will be in their own dictionary in the metadata_json field, with the keys lexrank_score
and lexrank_percentile_score
respectively. The title of the dictionary will be the name of the step. i.e. each Chunk
in the vector DB will have
{"lexranker_file" : { "lexrank_percentile_score": x , "lexrank_score" : y }}
in its metadata_json field.