> ## Documentation Index
> Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Segmentation

> Segment documents into logical sections using the Datalab SDK.

## Basic Usage

```python theme={null}
import json
from datalab_sdk import DatalabClient, SegmentOptions

client = DatalabClient()

# Define a segmentation schema with section names and descriptions
segmentation_schema = json.dumps({
    "sections": [
        {"name": "introduction", "description": "Introduction and overview"},
        {"name": "methodology", "description": "Methods and approach"},
        {"name": "results", "description": "Findings and results"},
        {"name": "conclusion", "description": "Summary and conclusions"},
        {"name": "references", "description": "Bibliography and references"}
    ]
})

options = SegmentOptions(segmentation_schema=segmentation_schema)
result = client.segment("research_paper.pdf", options=options)

# Access segmentation results
segments = result.segmentation_results
for segment in segments:
    print(f"{segment['name']}: pages {segment['page_range']}")
```

## Segment Options

Use `SegmentOptions` to configure segmentation behavior:

| Option                | Type | Default      | Description                                                                             |
| --------------------- | ---- | ------------ | --------------------------------------------------------------------------------------- |
| `segmentation_schema` | str  | **Required** | JSON schema defining segment names and descriptions                                     |
| `checkpoint_id`       | str  | None         | Checkpoint ID from a previous `convert()` call                                          |
| `mode`                | str  | `"fast"`     | Processing mode: `"fast"`, `"balanced"`, `"accurate"`                                   |
| `save_checkpoint`     | bool | `False`      | Save checkpoint for reuse with subsequent calls                                         |
| `max_pages`           | int  | None         | Maximum number of pages to process                                                      |
| `page_range`          | str  | None         | Specific pages to process (e.g., `"0-5,10"`). For spreadsheets, filters by sheet index. |
| `skip_cache`          | bool | `False`      | Skip cached results, force reprocessing                                                 |
| `webhook_url`         | str  | None         | Webhook URL for completion notification                                                 |

## Checkpoint Reuse

Use checkpoints to avoid re-parsing a document when running segmentation after conversion. First convert with `save_checkpoint=True`, then segment using the returned `checkpoint_id`:

```python theme={null}
import json
from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions

client = DatalabClient()

# Step 1: Convert and save a checkpoint
convert_options = ConvertOptions(
    mode="accurate",
    save_checkpoint=True,
)
convert_result = client.convert("report.pdf", options=convert_options)
print(convert_result.markdown)

# Step 2: Segment using the checkpoint (no re-parsing needed)
segmentation_schema = json.dumps({
    "sections": [
        {"name": "executive_summary", "description": "Executive summary"},
        {"name": "financials", "description": "Financial data and analysis"},
        {"name": "outlook", "description": "Future outlook and projections"},
    ]
})

segment_options = SegmentOptions(
    segmentation_schema=segmentation_schema,
    checkpoint_id=convert_result.checkpoint_id,
)
segment_result = client.segment("report.pdf", options=segment_options)
print(segment_result.segmentation_results)
```

## Segmentation Result

The result object contains the segmentation data alongside standard conversion fields:

```python theme={null}
result = client.segment("document.pdf", options=options)

# Segmentation results (list of segments with names and page ranges)
segments = result.segmentation_results
for segment in segments:
    print(f"Section: {segment['name']}")
    print(f"  Pages: {segment['page_range']}")

# Standard conversion fields are also available
print(result.success)
print(result.markdown)
print(result.page_count)
print(result.cost_breakdown)
```

## Async Usage

```python theme={null}
import asyncio
import json
from datalab_sdk import AsyncDatalabClient, SegmentOptions

async def segment_document():
    async with AsyncDatalabClient() as client:
        segmentation_schema = json.dumps({
            "sections": [
                {"name": "introduction", "description": "Introduction"},
                {"name": "body", "description": "Main content"},
                {"name": "conclusion", "description": "Conclusion"},
            ]
        })
        options = SegmentOptions(segmentation_schema=segmentation_schema)
        result = await client.segment("document.pdf", options=options)
        return result.segmentation_results

segments = asyncio.run(segment_document())
print(segments)
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Segmentation Recipe" icon="scissors" href="/docs/recipes/document-segmentation/auto-segmentation">
    Learn more about document segmentation patterns and use cases.
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/welcome/sdk/extraction">
    Extract structured data from documents using JSON schemas.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/welcome/sdk/conversion">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>
</CardGroup>
