Documentation Index Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
Use this file to discover all available pages before exploring further.
Automatically identify and split PDFs that contain multiple documents (like batch-scanned files) into their component parts.
Before you begin , make sure you have:
A Datalab account with an API key (new accounts include $5 in free credits)
Python 3.10+ installed
The Datalab SDK: pip install datalab-python-sdk
Your DATALAB_API_KEY environment variable set
Building for production? Use Pipelines to chain processors, version your configuration, and deploy with a single API call.
Quick Start
Python SDK
cURL
Python (requests)
import json
from datalab_sdk import DatalabClient, SegmentOptions
client = DatalabClient()
# Define segmentation schema
segmentation_schema = {
"segments" : []
}
options = SegmentOptions(
segmentation_schema = json.dumps(segmentation_schema),
mode = "balanced"
)
result = client.segment( "combined_documents.pdf" , options = options)
# Access segmentation results
for segment in result.segmentation_results[ "segments" ]:
print ( f " { segment[ 'name' ] } : pages { segment[ 'pages' ] } " )
When to Use
Segmentation is useful when:
Batch-scanned documents are combined into a single PDF
Multiple document types are stapled together
You need to apply different processing to different sections
{
"segmentation_results" : {
"segments" : [
{
"name" : "Research Paper" ,
"pages" : [ 0 , 1 , 2 ],
"confidence" : "medium"
},
{
"name" : "Invoice" ,
"pages" : [ 3 , 4 ],
"confidence" : "high"
}
],
"metadata" : {
"total_pages" : 5 ,
"segmentation_method" : "auto_detected"
}
}
}
Process Each Segment
After segmentation, process each segment separately:
import json
from datalab_sdk import DatalabClient, SegmentOptions, ExtractOptions
client = DatalabClient()
# First, get segments
seg_options = SegmentOptions(
segmentation_schema = json.dumps({ "segments" : []}),
mode = "balanced"
)
result = client.segment( "combined.pdf" , options = seg_options)
# Process each segment with appropriate schema using the Extract API
extraction_schemas = {
"Invoice" : {
"type" : "object" ,
"properties" : {
"invoice_number" : { "type" : "string" },
"total" : { "type" : "number" }
}
},
"Contract" : {
"type" : "object" ,
"properties" : {
"parties" : { "type" : "array" , "items" : { "type" : "string" }},
"effective_date" : { "type" : "string" }
}
}
}
extracted_data = {}
for segment in result.segmentation_results[ "segments" ]:
segment_name = segment[ "name" ]
pages = segment[ "pages" ]
schema = extraction_schemas.get(segment_name)
if schema:
# Build page range string
page_range = "," .join( str (p) for p in pages)
options = ExtractOptions(
page_schema = json.dumps(schema),
page_range = page_range,
mode = "balanced"
)
seg_result = client.extract( "combined.pdf" , options = options)
extracted_data[segment_name] = json.loads(seg_result.extraction_schema_json)
print (extracted_data)
Using Checkpoints
If you already converted a document with save_checkpoint=True using the Convert API , pass the checkpoint_id to SegmentOptions to skip re-parsing. This saves time and cost when running segmentation on a previously converted document.
from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions
import json
client = DatalabClient()
# Step 1: Convert and save checkpoint
convert_result = client.convert( "combined.pdf" , options = ConvertOptions( save_checkpoint = True ))
checkpoint_id = convert_result.checkpoint_id
# Step 2: Segment using checkpoint (no re-parsing needed)
options = SegmentOptions(
segmentation_schema = json.dumps({ "segments" : []}),
checkpoint_id = checkpoint_id
)
result = client.segment( "combined.pdf" , options = options)
Custom Segmentation Schema
Define expected segment types for better accuracy:
segmentation_schema = {
"segments" : [
{ "type" : "invoice" , "description" : "Invoice or billing document" },
{ "type" : "contract" , "description" : "Legal contract or agreement" },
{ "type" : "receipt" , "description" : "Payment receipt" }
]
}
Next Steps
Structured Extraction Extract structured data from document segments using JSON schemas.
Handling Long Documents Tips for TOC-based segmentation on documents with 50+ pages.
Document Conversion Convert documents to Markdown, HTML, JSON, or chunks.
Pipelines Chain processors into versioned, reusable pipelines.