Skip to main content
Pipelines chain processors — convert, extract, segment, and custom — into a single reusable unit. Define a pipeline once, version it, and run it against any document with one API call. Before you begin, make sure you have:
  1. A Datalab account with an API key (new accounts include $5 in free credits)
  2. Python 3.10+ installed
  3. The Datalab SDK: pip install datalab-python-sdk
  4. Your DATALAB_API_KEY environment variable set

Why Pipelines

Individual endpoints like /convert and /extract work well for one-off tasks. Pipelines are better when you need to:
  • Chain processors — Convert a document, then extract structured data, in one call
  • Version your configuration — Pin production integrations to a specific version while iterating on drafts
  • Standardize processing — Share pipeline configurations across your team
  • Track execution — Monitor each processor’s status as a pipeline runs
You can build pipelines visually in Forge or programmatically via the SDK and API.

How Pipelines Work

A pipeline is an ordered chain of processors. Each processor processes the document and passes its output to the next via checkpoints.
convert → segment → extract
The convert processor always runs first. Downstream processors depend on it.

Processor Types

ProcessorDescriptionCan Follow
convertParse document to markdown/HTML/JSONMust be first
segmentSplit document into logical sectionsconvert
extractExtract structured data using a JSON schemaconvert, segment, custom
customRun a custom processorconvert

Composition Rules

  • Every pipeline starts with a convert processor
  • extract is always terminal (nothing can follow it)
  • segment can feed into extract
  • custom can feed into extract
Common patterns:
PatternUse Case
convertSimple document parsing
convert → extractParse and extract structured fields
convert → segmentParse and split into sections
convert → segment → extractSplit, then extract from each section
convert → custom → extractApply custom processing, then extract

Pipeline Lifecycle

Pipelines have three states:
  1. Draft — Edits auto-save. Not versioned yet.
  2. Saved — Named and visible in your pipeline list.
  3. Published — An immutable version snapshot. Safe to use in production.
Create (draft) → Save (named) → Publish version (immutable)
         ↑                              |
         └──── Edit (new draft) ←───────┘
When you edit a published pipeline, your changes go into a draft. The published version remains unchanged until you publish a new version. You can discard the draft at any time to revert. See Pipeline Versioning for the full lifecycle.

Quick Example

Create a pipeline that converts a document and extracts invoice data:
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

# Define steps
steps = [
    PipelineProcessor(type="convert", settings={"mode": "balanced"}),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "total_amount": {"type": "number"},
                "vendor_name": {"type": "string"}
            }
        }
    })
]

# Create and save
pipeline = client.create_pipeline(steps=steps)
pipeline = client.save_pipeline(pipeline.pipeline_id, name="Invoice Extractor")

# Run on a document
execution = client.run_pipeline(
    pipeline.pipeline_id,
    file_path="invoice.pdf"
)

# Poll until complete
execution = client.get_pipeline_execution(
    execution.execution_id,
    max_polls=300
)

# Get extraction result
result = client.get_step_result(execution.execution_id, step_index=1)
print(result)

Pipelines vs Individual Endpoints

Individual EndpointsPipelines
ProcessorsOne at a timeChain multiple processors
VersioningNoneDraft, saved, published versions
ConfigurationPass options per requestConfigure once, reuse
Forge UIPlaygroundFull pipeline builder
Best forQuick tests, simple tasksProduction integrations
Individual endpoints (/convert, /extract, /segment) are not going away. Use them for simple, one-off processing. Use Pipelines when you need repeatability, versioning, or multi-processor chains.

Next Steps

Create a Pipeline

Build your first pipeline with Forge or the SDK.

Pipeline Versioning

Manage drafts, versions, and production deployments.

Run a Pipeline

Execute pipelines with overrides, polling, and webhooks.

SDK Reference

Full SDK reference for all pipeline methods.