Skip to main content
Before you begin, make sure you have:
  1. A Datalab account with an API key (new accounts include $5 in free credits)
  2. Python 3.10+ installed
  3. The Datalab SDK: pip install datalab-python-sdk
  4. Your DATALAB_API_KEY environment variable set

Using Forge

Forge provides a visual pipeline builder where you can:
  1. Start from a template or create a blank pipeline
  2. Add processors — click to add convert, extract, segment, or custom processors
  3. Configure each processor — set processing mode, schemas, and options in the configuration panel
  4. Test with a document — run the pipeline and watch each processor complete in real-time
  5. Save and version — name your pipeline and publish versions for production use
Edits in Forge auto-save as a draft. Your published versions remain unchanged until you explicitly publish a new version.

Using the SDK

Create a Pipeline

Define processors using PipelineProcessor and create the pipeline:
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

steps = [
    PipelineProcessor(type="convert", settings={
        "mode": "balanced",
        "output_format": "markdown"
    }),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string", "description": "Document title"},
                "date": {"type": "string", "description": "Document date"},
                "summary": {"type": "string", "description": "Brief summary"}
            }
        }
    })
]

pipeline = client.create_pipeline(steps=steps)
print(f"Created: {pipeline.pipeline_id}")  # pl_XXXXX
The pipeline starts as an unsaved draft.

Save the Pipeline

Name and save the pipeline so it appears in your pipeline list:
pipeline = client.save_pipeline(
    pipeline.pipeline_id,
    name="Document Summarizer"
)
print(f"Saved: {pipeline.name}")

Update Steps

Update a pipeline’s steps. This creates a draft if the pipeline has a published version:
updated_steps = [
    PipelineProcessor(type="convert", settings={
        "mode": "accurate",  # Changed from balanced
        "output_format": "markdown"
    }),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "date": {"type": "string"},
                "summary": {"type": "string"},
                "author": {"type": "string"}  # Added field
            }
        }
    })
]

pipeline = client.update_pipeline(pipeline.pipeline_id, steps=updated_steps)

Using the REST API

curl -X POST https://www.datalab.to/api/v1/pipelines \
  -H "X-API-Key: $DATALAB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "steps": [
      {"type": "convert", "settings": {"mode": "balanced"}},
      {"type": "extract", "settings": {
        "page_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}"
      }}
    ]
  }'

Processor Configuration Reference

Convert Processor

Controls how the document is parsed.
PipelineProcessor(type="convert", settings={
    "mode": "balanced",           # fast, balanced, accurate
    "output_format": "markdown",  # markdown, html, json, chunks
    "paginate": True,             # Add page delimiters
    "include_images": True,       # Extract images
    "include_image_captions": True,
    "add_block_ids": False,       # Block IDs for citations
})
SettingTypeDefaultDescription
modestr"fast"Processing mode
output_formatstr"markdown"Output format
paginateboolfalseAdd page delimiters
include_imagesbooltrueExtract images from document
include_image_captionsbooltrueGenerate image captions
include_headers_footersboolfalseInclude page headers/footers
add_block_idsboolfalseAdd block IDs for citation tracking
fence_synthetic_captionsboolfalseFence synthetic image captions

Extract Processor

Extracts structured data using a JSON schema. Requires a preceding convert processor (or segment / custom).
PipelineProcessor(type="extract", settings={
    "page_schema": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string", "description": "Invoice ID"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "amount": {"type": "number"}
                    }
                }
            }
        }
    }
})
SettingTypeDescription
page_schemadictJSON schema defining fields to extract
Use detailed description fields in your schema to improve extraction accuracy. Tell the model what to look for.

Segment Processor

Splits a document into logical sections. Requires a preceding convert processor.
PipelineProcessor(type="segment", settings={
    "segmentation_schema": {
        "Cover Letter": "The cover letter or introductory section",
        "Resume": "The applicant's resume or CV",
        "References": "Reference letters or contact information"
    }
})
SettingTypeDescription
segmentation_schemadictMap of section names to descriptions

Custom Processor

Applies use-case-specific customizations to convert output. Requires a preceding convert processor. See Custom Processors for details.
PipelineProcessor(
    type="custom",
    settings={},
    custom_processor_id="cp_abc123"  # Your custom processor ID
)
FieldTypeDescription
custom_processor_idstrID of the custom processor (cp_XXXXX)
eval_rubric_idintOptional evaluation rubric to apply

List and Manage Pipelines

# List saved pipelines
result = client.list_pipelines(saved_only=True, limit=50)
for p in result["pipelines"]:
    print(f"{p.pipeline_id}: {p.name} (v{p.active_version})")

# Get a specific pipeline
pipeline = client.get_pipeline("pl_abc123")

# Archive (soft-delete)
client.archive_pipeline("pl_abc123")

# Restore
client.unarchive_pipeline("pl_abc123")

Next Steps

Pipeline Versioning

Manage drafts, publish versions, and pin production deployments.

Run a Pipeline

Execute pipelines with overrides and track results.

Structured Extraction

Deep dive on extraction schemas and confidence scoring.

SDK Reference

Full SDK reference for all pipeline methods.