MetadataGenerator
MetadataGenerator
The MetadataGenerator
step lets you use an OpenAI language model to generate
arbitrary metadata for each chunk passed to the step. This can be useful for classification, intent detection and extraction tasks.
You must register your OpenAI api key with OneContext to use this step.
Step Args
Key | Value Type | Value Description |
---|---|---|
model |
literal |
An OpenAI chat model. |
template |
str |
A Jinja2 template to create the prompt |
variables |
dict[str, str] | Additional variables to fill in the template. Note these can be passed at runtime. |
model_kwargs |
dict[str, Any] | Parameters to be passed to the OpenAI completion model (eg. temperature , stop , ...). |
parse_json |
bool | Whether to attempt to parse the LLM response into JSON. |
Results
Each chunk is passed through the generator separately. The results are stored in the chunk metadata with the key matching the name of the user defined step name.
Result Key | Value Type | Value Description |
---|---|---|
llm_response |
str |
The raw response from the LLM. |
prompt |
str |
The filled prompt passed to the LLM. |
model_kwargs |
dict[str, Any] | Parameters passed to the OpenAI completion model (eg. temperature , stop , ...). |
json |
dic[str, Any] None | the parsed JSON if parse_json is True and the parse succeeded |
Templates
Jinja2 templates are used to generate the prompts. You can learn more about the template language here.
Any values passed in variables
will be passed to the template. This allows
you to override the variables at runtime like any other pipeline field.
In addition to the variables passed by the user, the variable chunk
is provided
automatically for each prompt. Therefore you could include the chunk content
and metadata in the template string as below:
This is the content of the chunk:
{{chunk.content}}
Here is a metadata tag:
{{chunk.metadata_json["some_tag"]}}
Example:
Here is an example of how you could use this step to further critique the relevance of each retrieved chunk in a retrieval pipeline.
Note that we use the {"response_format": {"type" : {"json_object"}}
OpenAI model parameter to
constrain the output to json.
steps:
- step: SentenceTransformerEmbedder
name: embedder
step_args:
model_name: BAAI/bge-base-en-v1.5
query: placeholder
inputs: [ ]
- step: Retriever
name: retriever
step_args:
vector_index_name: my_vector_index
top_k: 100
return_embeddings: false
inputs: [ embedder ]
- step: Reranker
name: reranker
step_args:
query: placeholder
model_name: BAAI/bge-reranker-base
top_k: 10
inputs: [ retriever ]
- step: MetadataGenerator
name: critic
step_args:
variables: { "query" : "the user's query"}
parse_json: true
model_name: gpt-4-turbo
model_kwargs:
max_tokens: 10
response_format:
type: json_object
template: |
Determine if the candidate paragraph is specifically relevant
to the user's query.
Respond with a boolean true if it is specifically relevant,
and false if it is not.
ONLY respond true if you are CERTAIN that the paragraph of information
is relevant to the user's query.
The user's query is:
{{ query }}
and the candidate paragraph is:
{{ chunk.content}}
NOTE, your output should ONLY be in JSON format, using this schema:
{"is_relevant": true or false}
inputs: [ reranker ]
The resulting chunks will have the parsed JSON results stored in chunk.metadata_json['critic']['json']['is_relevant']