Skip to main content
Automatically identify and split PDFs that contain multiple documents (like batch-scanned files) into their component parts. Before you begin, make sure you have:
  1. A Datalab account with an API key (new accounts include $5 in free credits)
  2. Python 3.10+ installed
  3. The Datalab SDK: pip install datalab-python-sdk
  4. Your DATALAB_API_KEY environment variable set

Quick Start

import json
from datalab_sdk import DatalabClient, SegmentOptions

client = DatalabClient()

# Define segmentation schema
segmentation_schema = {
    "segments": []
}

options = SegmentOptions(
    segmentation_schema=json.dumps(segmentation_schema),
    mode="balanced"
)

result = client.segment("combined_documents.pdf", options=options)

# Access segmentation results
for segment in result.segmentation_results["segments"]:
    print(f"{segment['name']}: pages {segment['pages']}")

When to Use

Segmentation is useful when:
  • Batch-scanned documents are combined into a single PDF
  • Multiple document types are stapled together
  • You need to apply different processing to different sections

Response Format

{
  "segmentation_results": {
    "segments": [
      {
        "name": "Research Paper",
        "pages": [0, 1, 2],
        "confidence": "medium"
      },
      {
        "name": "Invoice",
        "pages": [3, 4],
        "confidence": "high"
      }
    ],
    "metadata": {
      "total_pages": 5,
      "segmentation_method": "auto_detected"
    }
  }
}

Process Each Segment

After segmentation, process each segment separately:
import json
from datalab_sdk import DatalabClient, SegmentOptions, ExtractOptions

client = DatalabClient()

# First, get segments
seg_options = SegmentOptions(
    segmentation_schema=json.dumps({"segments": []}),
    mode="balanced"
)
result = client.segment("combined.pdf", options=seg_options)

# Process each segment with appropriate schema using the Extract API
extraction_schemas = {
    "Invoice": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "total": {"type": "number"}
        }
    },
    "Contract": {
        "type": "object",
        "properties": {
            "parties": {"type": "array", "items": {"type": "string"}},
            "effective_date": {"type": "string"}
        }
    }
}

extracted_data = {}

for segment in result.segmentation_results["segments"]:
    segment_name = segment["name"]
    pages = segment["pages"]

    schema = extraction_schemas.get(segment_name)
    if schema:
        # Build page range string
        page_range = ",".join(str(p) for p in pages)

        options = ExtractOptions(
            page_schema=json.dumps(schema),
            page_range=page_range,
            mode="balanced"
        )

        seg_result = client.extract("combined.pdf", options=options)
        extracted_data[segment_name] = json.loads(seg_result.extraction_schema_json)

print(extracted_data)

Using Checkpoints

If you already converted a document with save_checkpoint=True using the Convert API, pass the checkpoint_id to SegmentOptions to skip re-parsing. This saves time and cost when running segmentation on a previously converted document.
from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
convert_result = client.convert("combined.pdf", options=ConvertOptions(save_checkpoint=True))
checkpoint_id = convert_result.checkpoint_id

# Step 2: Segment using checkpoint (no re-parsing needed)
options = SegmentOptions(
    segmentation_schema=json.dumps({"segments": []}),
    checkpoint_id=checkpoint_id
)
result = client.segment("combined.pdf", options=options)

Custom Segmentation Schema

Define expected segment types for better accuracy:
segmentation_schema = {
    "segments": [
        {"type": "invoice", "description": "Invoice or billing document"},
        {"type": "contract", "description": "Legal contract or agreement"},
        {"type": "receipt", "description": "Payment receipt"}
    ]
}

Next Steps

Structured Extraction

Extract structured data from document segments using JSON schemas.

Handling Long Documents

Tips for TOC-based segmentation on documents with 50+ pages.

Document Conversion

Convert documents to Markdown, HTML, JSON, or chunks.

Workflows

Create and execute document processing pipelines.