Pipelines
Pipelines are a fully-specified recipe for how to go from content, to embeddings, and back again.
Example ingestion pipeline
steps:
- step: KnowledgeBaseFiles
name: input
step_args:
# specify the source knowledgebases to sync
knowledgebase_names: ["my_kb"]
# specify the target index
vector_index_name: my_vector_index
inputs: []
- step: Preprocessor
name: preprocessor
step_args: {}
inputs: [input]
- step: Chunker
name: simple_chunker
step_args:
chunk_size_words: 320
chunk_overlap: 30
inputs: [input]
- step: SentenceTransformerEmbedder
name: sentence-transformers
step_args:
model_name: BAAI/bge-base-en-v1.5
include_metadata: [ title, file_name ]
inputs: [ simple_chunker ]
- step: ChunkWriter
name: save
step_args:
vector_index_name: my_vector_index
inputs: [sentence-transformers]
Example query pipeline
steps:
- step: SentenceTransformerEmbedder
name: query_embedder
step_args:
model_name: BAAI/bge-base-en-v1.5
include_metadata: [ title, file_name ]
query: "placeholder"
inputs: [ ]
- step: Retriever
name: retriever
step_args:
vector_index_name: my_vector_index
top_k: 100
metadata_filters: { }
inputs: ["query_embedder"]
- step: Reranker
name: reranker
step_args:
query: "placeholder"
model_name: BAAI/bge-reranker-base
top_k: 5
metadata_filters: { }
inputs: [ retriever ]
Steps
Each step
is an atomic, constituent component of a pipeline. Each step
is defined by three things:
Key | Value Type | Value Description |
---|---|---|
name |
string |
A name for this step (you can use this name to refer to this step in subsequent inputs sections). |
step_args |
object |
The step-specific arguments which are passed to this step at runtime. See these docs for the specific arguments you can pass for each step. |
inputs |
array |
An array of step names which this step depends on. A step's execution is triggered once all the steps in the dependency array have executed successfully. |
Dependency Graphs
Behind the scenes, OneContext builds an execution graph of your steps based on the dependencies in your dependency-arrays. This graph is then used to execute your pipeline in the most efficient way possible.
Deploy a new Pipeline
where index.yaml
refers to the pipeline configuration file in YAML format.
List all the Pipelines
Delete a Pipeline
Run a Pipeline with Override Arguments
Overriding specific step arguments in a pipeline allows for customized processing and retrieval:
The override_args
parameter allows you to modify the default arguments of
each step in the pipeline for a specific run. Passed as a dictionary, it
specifies the step names as keys, and the step arguments to override as
key-value pairs.
With the default arguments
Passing overrides to the default arguments
onecli pipeline run sync --pipeline-name=retrieve_fast --override-args='{"query_embedder" : {"query" : "the difference between ipv4 and ipv6 and what it means for the internet"}, {"retriever": {"top_k": 1}}'
List all pipeline runs (active and inactive)
You can also filter on a particular run ID
You can also filter on a particular status
Limit and skip are also provided for easy pagination
Sorting
Pipeline runs are sorted by time by default, but you can override this via the sort
flag