Pipelines

Pipelines are a fully-specified recipe for how to go from content, to embeddings, and back again.

Example ingestion pipeline

steps:
  - step: KnowledgeBaseFiles
    name: input
    step_args:
      # specify the source knowledgebases to sync
      knowledgebase_names: ["my_kb"]
      # specify the target index
      vector_index_name: my_vector_index
    inputs: []

  - step: Preprocessor
    name: preprocessor
    step_args: {}
    inputs: [input]

  - step: Chunker
    name: simple_chunker
    step_args:
      chunk_size_words: 320
      chunk_overlap: 30
    inputs: [input]

  - step: SentenceTransformerEmbedder
    name: sentence-transformers
    step_args:
      model_name: BAAI/bge-base-en-v1.5
      include_metadata: [ title, file_name ]
    inputs: [ simple_chunker ]

  - step: ChunkWriter
    name: save
    step_args:
      vector_index_name: my_vector_index
    inputs: [sentence-transformers]

Example query pipeline

steps:
  - step: SentenceTransformerEmbedder
    name: query_embedder
    step_args:
      model_name: BAAI/bge-base-en-v1.5
      include_metadata: [ title, file_name ]
      query: "placeholder"
    inputs: [ ]

  - step: Retriever
    name: retriever
    step_args:
      vector_index_name: my_vector_index
      top_k: 100
      metadata_filters: { }
    inputs: ["query_embedder"]

  - step: Reranker
    name: reranker
    step_args:
      query: "placeholder"
      model_name: BAAI/bge-reranker-base
      top_k: 5
      metadata_filters: { }
    inputs: [ retriever ]

Steps

Each step is an atomic, constituent component of a pipeline. Each step is defined by three things:

Key	Value Type	Value Description
`name`	`string`	A name for this step (you can use this name to refer to this step in subsequent `inputs` sections).
`step_args`	`object`	The step-specific arguments which are passed to this step at runtime. See these docs for the specific arguments you can pass for each step.
`inputs`	`array`	An array of step names which this step depends on. A step's execution is triggered once all the steps in the dependency array have executed successfully.

Dependency Graphs

Behind the scenes, OneContext builds an execution graph of your steps based on the dependencies in your dependency-arrays. This graph is then used to execute your pipeline in the most efficient way possible.

Deploy a new Pipeline

onecli pipeline create --pipeline-name=demoIndexingPipeline --pipeline-yaml=./index.yaml

where index.yaml refers to the pipeline configuration file in YAML format.

List all the Pipelines

onecli pipeline list

Delete a Pipeline

onecli pipeline delete --pipeline-name=demoIndexingPipeline

Run a Pipeline with Override Arguments

Overriding specific step arguments in a pipeline allows for customized processing and retrieval:

The override_args parameter allows you to modify the default arguments of each step in the pipeline for a specific run. Passed as a dictionary, it specifies the step names as keys, and the step arguments to override as key-value pairs.

With the default arguments

onecli pipeline run sync --pipeline-name=retriever_pipeline

Passing overrides to the default arguments

onecli pipeline run sync --pipeline-name=retrieve_fast --override-args='{"query_embedder" : {"query" : "the difference between ipv4 and ipv6 and what it means for the internet"}, {"retriever": {"top_k": 1}}'

List all pipeline runs (active and inactive)

onecli pipeline run status

You can also filter on a particular run ID

onecli pipeline run status --runid=<RUNID>

You can also filter on a particular status

onecli pipeline run status --status=SUCCESSFUL

Limit and skip are also provided for easy pagination

onecli pipeline run status --status=SUCCESSFUL --limit=10 --skip=10

Sorting

Pipeline runs are sorted by time by default, but you can override this via the sort flag

onecli pipeline run status --sort=status

Show steps

onecli pipeline run status --runid=<RUNID> --show-steps

Running with the show steps flag will show you the status and output of each step in the pipeline

Show config

onecli pipeline run status --runid=<RUNID> --show-config

Running with the show config flag will also show you the config (yaml file) for each pipeline