> ## Documentation Index
> Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Segmentation

> Automatically split multi-document PDFs into separate segments.

Automatically identify and split PDFs that contain multiple documents (like batch-scanned files) into their component parts.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

<Info>
  **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call.
</Info>

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  import json
  from datalab_sdk import DatalabClient, SegmentOptions

  client = DatalabClient()

  # Define segmentation schema
  segmentation_schema = {
      "segments": []
  }

  options = SegmentOptions(
      segmentation_schema=json.dumps(segmentation_schema),
      mode="balanced"
  )

  result = client.segment("combined_documents.pdf", options=options)

  # Access segmentation results
  for segment in result.segmentation_results["segments"]:
      print(f"{segment['name']}: pages {segment['pages']}")
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/segment \
    -H "X-API-Key: YOUR_API_KEY" \
    -F "file=@combined_documents.pdf" \
    -F "output_format=markdown" \
    -F "mode=balanced" \
    -F 'segmentation_schema={"segments": []}'
  ```

  ```python Python (requests) theme={null}
  import requests
  import json
  import time

  API_KEY = "YOUR_API_KEY"
  headers = {"X-API-Key": API_KEY}

  # Submit segmentation request
  with open("combined.pdf", "rb") as f:
      response = requests.post(
          "https://www.datalab.to/api/v1/segment",
          files={"file": ("combined.pdf", f, "application/pdf")},
          data={
              "output_format": "markdown",
              "mode": "balanced",
              "segmentation_schema": json.dumps({"segments": []})
          },
          headers=headers
      )

  check_url = response.json()["request_check_url"]

  # Poll for results
  while True:
      result = requests.get(check_url, headers=headers).json()

      if result["status"] == "complete":
          segments = result["segmentation_results"]["segments"]
          for seg in segments:
              print(f"{seg['name']}: pages {seg['pages']}")
          break
      elif result["status"] == "failed":
          print(f"Error: {result.get('error')}")
          break

      time.sleep(2)
  ```
</CodeGroup>

## When to Use

Segmentation is useful when:

* Batch-scanned documents are combined into a single PDF
* Multiple document types are stapled together
* You need to apply different processing to different sections

## Response Format

```json theme={null}
{
  "segmentation_results": {
    "segments": [
      {
        "name": "Research Paper",
        "pages": [0, 1, 2],
        "confidence": "medium"
      },
      {
        "name": "Invoice",
        "pages": [3, 4],
        "confidence": "high"
      }
    ],
    "metadata": {
      "total_pages": 5,
      "segmentation_method": "auto_detected"
    }
  }
}
```

## Process Each Segment

After segmentation, process each segment separately:

```python theme={null}
import json
from datalab_sdk import DatalabClient, SegmentOptions, ExtractOptions

client = DatalabClient()

# First, get segments
seg_options = SegmentOptions(
    segmentation_schema=json.dumps({"segments": []}),
    mode="balanced"
)
result = client.segment("combined.pdf", options=seg_options)

# Process each segment with appropriate schema using the Extract API
extraction_schemas = {
    "Invoice": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "total": {"type": "number"}
        }
    },
    "Contract": {
        "type": "object",
        "properties": {
            "parties": {"type": "array", "items": {"type": "string"}},
            "effective_date": {"type": "string"}
        }
    }
}

extracted_data = {}

for segment in result.segmentation_results["segments"]:
    segment_name = segment["name"]
    pages = segment["pages"]

    schema = extraction_schemas.get(segment_name)
    if schema:
        # Build page range string
        page_range = ",".join(str(p) for p in pages)

        options = ExtractOptions(
            page_schema=json.dumps(schema),
            page_range=page_range,
            mode="balanced"
        )

        seg_result = client.extract("combined.pdf", options=options)
        extracted_data[segment_name] = json.loads(seg_result.extraction_schema_json)

print(extracted_data)
```

## Using Checkpoints

If you already converted a document with `save_checkpoint=True` using the [Convert API](/docs/recipes/conversion/conversion-api-overview), pass the `checkpoint_id` to `SegmentOptions` to skip re-parsing. This saves time and cost when running segmentation on a previously converted document.

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
convert_result = client.convert("combined.pdf", options=ConvertOptions(save_checkpoint=True))
checkpoint_id = convert_result.checkpoint_id

# Step 2: Segment using checkpoint (no re-parsing needed)
options = SegmentOptions(
    segmentation_schema=json.dumps({"segments": []}),
    checkpoint_id=checkpoint_id
)
result = client.segment("combined.pdf", options=options)
```

## Custom Segmentation Schema

Define expected segment types for better accuracy:

```python theme={null}
segmentation_schema = {
    "segments": [
        {"type": "invoice", "description": "Invoice or billing document"},
        {"type": "contract", "description": "Legal contract or agreement"},
        {"type": "receipt", "description": "Payment receipt"}
    ]
}
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data from document segments using JSON schemas.
  </Card>

  <Card title="Handling Long Documents" icon="file-lines" href="/docs/recipes/structured-extraction/handling-long-documents">
    Tips for TOC-based segmentation on documents with 50+ pages.
  </Card>

  <Card title="Document Conversion" icon="file-export" href="/docs/recipes/conversion/conversion-api-overview">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>
</CardGroup>
