MetadataGenerator

The MetadataGenerator step lets you use an OpenAI language model to generate arbitrary metadata for each chunk passed to the step. This can be useful for classification, intent detection and extraction tasks.

You must register your OpenAI api key with OneContext to use this step.

Step Args

Key	Value Type	Value Description
`model`	`literal`	An OpenAI chat model.
`template`	`str`	A Jinja2 template to create the prompt
`variables`	dict[str, str]	Additional variables to fill in the template. Note these can be passed at runtime.
`model_kwargs`	dict[str, Any]	Parameters to be passed to the OpenAI completion model (eg. `temperature`, `stop`, ...).
`parse_json`	bool	Whether to attempt to parse the LLM response into JSON.

Results

Each chunk is passed through the generator separately. The results are stored in the chunk metadata with the key matching the name of the user defined step name.

Result Key	Value Type	Value Description
`llm_response`	`str`	The raw response from the LLM.
`prompt`	`str`	The filled prompt passed to the LLM.
`model_kwargs`	dict[str, Any]	Parameters passed to the OpenAI completion model (eg. `temperature`, `stop`, ...).
`json`	dic[str, Any] None	the parsed JSON if `parse_json` is True and the parse succeeded

Templates

Jinja2 templates are used to generate the prompts. You can learn more about the template language here.

Any values passed in variables will be passed to the template. This allows you to override the variables at runtime like any other pipeline field.

In addition to the variables passed by the user, the variable chunk is provided automatically for each prompt. Therefore you could include the chunk content and metadata in the template string as below:

This is the content of the chunk:

{{chunk.content}}

Here is a metadata tag:

{{chunk.metadata_json["some_tag"]}}

Example:

Here is an example of how you could use this step to further critique the relevance of each retrieved chunk in a retrieval pipeline.

Note that we use the {"response_format": {"type" : {"json_object"}} OpenAI model parameter to constrain the output to json.

steps:
  - step: SentenceTransformerEmbedder
    name: embedder
    step_args:
      model_name: BAAI/bge-base-en-v1.5
      query: placeholder
    inputs: [ ]

  - step: Retriever
    name: retriever
    step_args:
      vector_index_name: my_vector_index
      top_k: 100
      return_embeddings: false
    inputs: [ embedder ]

  - step: Reranker
    name: reranker
    step_args:
      query: placeholder
      model_name: BAAI/bge-reranker-base
      top_k: 10
    inputs: [ retriever ]

  - step: MetadataGenerator
    name: critic
    step_args:
      variables: { "query" : "the user's query"}
      parse_json: true
      model_name: gpt-4-turbo
      model_kwargs:
        max_tokens: 10
        response_format:
          type: json_object
      template: |

        Determine if the candidate paragraph is specifically relevant
        to the user's query.

        Respond with a boolean true if it is specifically relevant,
        and false if it is not.

        ONLY respond true if you are CERTAIN that the paragraph of information
        is relevant to the user's query.

        The user's query is:

        {{ query }}

        and the candidate paragraph is:

        {{ chunk.content}}


        NOTE, your output should ONLY be in JSON format, using this schema:
        {"is_relevant": true or false}

    inputs: [ reranker ]

The resulting chunks will have the parsed JSON results stored in chunk.metadata_json['critic']['json']['is_relevant']