Before you begin , make sure you have:
A Datalab account with an API key (new accounts include $5 in free credits)
Python 3.10+ installed
The Datalab SDK: pip install datalab-python-sdk
Your DATALAB_API_KEY environment variable set
Using Forge
Forge provides a visual pipeline builder where you can:
Start from a template or create a blank pipeline
Add processors — click to add convert, extract, segment, or custom processors
Configure each processor — set processing mode, schemas, and options in the configuration panel
Test with a document — run the pipeline and watch each processor complete in real-time
Save and version — name your pipeline and publish versions for production use
Edits in Forge auto-save as a draft. Your published versions remain unchanged until you explicitly publish a new version.
Using the SDK
Create a Pipeline
Define processors using PipelineProcessor and create the pipeline:
from datalab_sdk import DatalabClient, PipelineProcessor
client = DatalabClient()
steps = [
PipelineProcessor( type = "convert" , settings = {
"mode" : "balanced" ,
"output_format" : "markdown"
}),
PipelineProcessor( type = "extract" , settings = {
"page_schema" : {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" , "description" : "Document title" },
"date" : { "type" : "string" , "description" : "Document date" },
"summary" : { "type" : "string" , "description" : "Brief summary" }
}
}
})
]
pipeline = client.create_pipeline( steps = steps)
print ( f "Created: { pipeline.pipeline_id } " ) # pl_XXXXX
The pipeline starts as an unsaved draft.
Save the Pipeline
Name and save the pipeline so it appears in your pipeline list:
pipeline = client.save_pipeline(
pipeline.pipeline_id,
name = "Document Summarizer"
)
print ( f "Saved: { pipeline.name } " )
Update Steps
Update a pipeline’s steps. This creates a draft if the pipeline has a published version:
updated_steps = [
PipelineProcessor( type = "convert" , settings = {
"mode" : "accurate" , # Changed from balanced
"output_format" : "markdown"
}),
PipelineProcessor( type = "extract" , settings = {
"page_schema" : {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" },
"date" : { "type" : "string" },
"summary" : { "type" : "string" },
"author" : { "type" : "string" } # Added field
}
}
})
]
pipeline = client.update_pipeline(pipeline.pipeline_id, steps = updated_steps)
Using the REST API
curl -X POST https://www.datalab.to/api/v1/pipelines \
-H "X-API-Key: $DATALAB_API_KEY " \
-H "Content-Type: application/json" \
-d '{
"steps": [
{"type": "convert", "settings": {"mode": "balanced"}},
{"type": "extract", "settings": {
"page_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}"
}}
]
}'
Processor Configuration Reference
Convert Processor
Controls how the document is parsed.
PipelineProcessor( type = "convert" , settings = {
"mode" : "balanced" , # fast, balanced, accurate
"output_format" : "markdown" , # markdown, html, json, chunks
"paginate" : True , # Add page delimiters
"include_images" : True , # Extract images
"include_image_captions" : True ,
"add_block_ids" : False , # Block IDs for citations
})
Setting Type Default Description modestr "fast"Processing mode output_formatstr "markdown"Output format paginatebool falseAdd page delimiters include_imagesbool trueExtract images from document include_image_captionsbool trueGenerate image captions include_headers_footersbool falseInclude page headers/footers add_block_idsbool falseAdd block IDs for citation tracking fence_synthetic_captionsbool falseFence synthetic image captions
Extracts structured data using a JSON schema. Requires a preceding convert processor (or segment / custom).
PipelineProcessor( type = "extract" , settings = {
"page_schema" : {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" , "description" : "Invoice ID" },
"line_items" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"description" : { "type" : "string" },
"amount" : { "type" : "number" }
}
}
}
}
}
})
Setting Type Description page_schemadict JSON schema defining fields to extract
Use detailed description fields in your schema to improve extraction accuracy. Tell the model what to look for.
Segment Processor
Splits a document into logical sections. Requires a preceding convert processor.
PipelineProcessor( type = "segment" , settings = {
"segmentation_schema" : {
"Cover Letter" : "The cover letter or introductory section" ,
"Resume" : "The applicant's resume or CV" ,
"References" : "Reference letters or contact information"
}
})
Setting Type Description segmentation_schemadict Map of section names to descriptions
Custom Processor
Applies use-case-specific customizations to convert output. Requires a preceding convert processor. See Custom Processors for details.
PipelineProcessor(
type = "custom" ,
settings = {},
custom_processor_id = "cp_abc123" # Your custom processor ID
)
Field Type Description custom_processor_idstr ID of the custom processor (cp_XXXXX) eval_rubric_idint Optional evaluation rubric to apply
List and Manage Pipelines
# List saved pipelines
result = client.list_pipelines( saved_only = True , limit = 50 )
for p in result[ "pipelines" ]:
print ( f " { p.pipeline_id } : { p.name } (v { p.active_version } )" )
# Get a specific pipeline
pipeline = client.get_pipeline( "pl_abc123" )
# Archive (soft-delete)
client.archive_pipeline( "pl_abc123" )
# Restore
client.unarchive_pipeline( "pl_abc123" )
Next Steps
Pipeline Versioning Manage drafts, publish versions, and pin production deployments.
Run a Pipeline Execute pipelines with overrides and track results.
Structured Extraction Deep dive on extraction schemas and confidence scoring.
SDK Reference Full SDK reference for all pipeline methods.