> ## Documentation Index
> Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
> Use this file to discover all available pages before exploring further.

# Structured Extraction

> Extract structured data from documents using JSON schemas with the Datalab SDK.

## Basic Usage

```python theme={null}
import json
from datalab_sdk import DatalabClient, ExtractOptions

client = DatalabClient()

# Define a JSON schema for extraction
page_schema = json.dumps({
    "invoice_number": {"type": "string", "description": "Invoice number"},
    "total": {"type": "number", "description": "Total amount due"},
    "vendor": {"type": "string", "description": "Vendor or company name"},
    "items": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "amount": {"type": "number"}
            }
        }
    }
})

options = ExtractOptions(page_schema=page_schema)
result = client.extract("invoice.pdf", options=options)

# Access the extracted data
extracted = json.loads(result.extraction_schema_json)
print(extracted)
```

## Extract Options

Use `ExtractOptions` to configure extraction behavior:

| Option            | Type | Default      | Description                                                                                                                                                |
| ----------------- | ---- | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `page_schema`     | str  | **Required** | JSON schema defining the fields to extract                                                                                                                 |
| `checkpoint_id`   | str  | None         | Checkpoint ID from a previous `convert()` call                                                                                                             |
| `mode`            | str  | `"fast"`     | Parse mode: `"fast"`, `"balanced"`, `"accurate"`. Controls document parsing.                                                                               |
| `extraction_mode` | str  | None         | Extraction mode: `"fast"` or `"balanced"` (default). Controls extraction pipeline. See [Balanced Mode](/docs/recipes/structured-extraction/balanced-mode). |
| `output_format`   | str  | `"markdown"` | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"`                                                                                                |
| `save_checkpoint` | bool | `False`      | Save checkpoint for reuse with subsequent calls                                                                                                            |
| `max_pages`       | int  | None         | Maximum number of pages to process                                                                                                                         |
| `page_range`      | str  | None         | Specific pages to process (e.g., `"0-5,10"`). For spreadsheets, filters by sheet index.                                                                    |
| `skip_cache`      | bool | `False`      | Skip cached results, force reprocessing                                                                                                                    |
| `webhook_url`     | str  | None         | Webhook URL for completion notification                                                                                                                    |

<Note>
  `mode` and `extraction_mode` are independent. `mode` controls how the document is parsed (OCR, layout analysis). `extraction_mode` controls how structured data is extracted from the parsed document. You can combine them freely — e.g., `mode="fast"` with `extraction_mode="balanced"`.
</Note>

## Checkpoint Reuse

Use checkpoints to avoid re-parsing a document when running extraction after conversion. First convert with `save_checkpoint=True`, then extract using the returned `checkpoint_id`:

```python theme={null}
import json
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions

client = DatalabClient()

# Step 1: Convert and save a checkpoint
convert_options = ConvertOptions(
    mode="accurate",
    save_checkpoint=True,
)
convert_result = client.convert("report.pdf", options=convert_options)
print(convert_result.markdown)

# Step 2: Extract using the checkpoint (no re-parsing needed)
page_schema = json.dumps({
    "title": {"type": "string", "description": "Document title"},
    "author": {"type": "string", "description": "Author name"},
    "date": {"type": "string", "description": "Publication date"},
    "summary": {"type": "string", "description": "Brief summary of the document"},
})

extract_options = ExtractOptions(
    page_schema=page_schema,
    checkpoint_id=convert_result.checkpoint_id,
)
extract_result = client.extract("report.pdf", options=extract_options)
extracted = json.loads(extract_result.extraction_schema_json)
print(extracted)
```

## Extraction Result

The result object contains the extracted data alongside standard conversion fields:

```python theme={null}
result = client.extract("invoice.pdf", options=options)

# Extracted structured data (JSON string)
extracted = json.loads(result.extraction_schema_json)
print(extracted["invoice_number"])
print(extracted["total"])

# Standard conversion fields are also available
print(result.success)
print(result.markdown)
print(result.page_count)
print(result.cost_breakdown)
```

## Async Usage

```python theme={null}
import asyncio
import json
from datalab_sdk import AsyncDatalabClient, ExtractOptions

async def extract_data():
    async with AsyncDatalabClient() as client:
        page_schema = json.dumps({
            "title": {"type": "string", "description": "Document title"},
            "author": {"type": "string", "description": "Author name"},
        })
        options = ExtractOptions(page_schema=page_schema)
        result = await client.extract("document.pdf", options=options)
        return json.loads(result.extraction_schema_json)

extracted = asyncio.run(extract_data())
print(extracted)
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Extraction Recipe" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Learn more about structured extraction patterns and best practices.
  </Card>

  <Card title="Document Segmentation" icon="scissors" href="/docs/welcome/sdk/segmentation">
    Segment documents into logical sections using schemas.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/welcome/sdk/conversion">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents efficiently in parallel.
  </Card>
</CardGroup>
