Structured Extraction - Datalab Documentation

Basic Usage

import json
from datalab_sdk import DatalabClient, ExtractOptions

client = DatalabClient()

# Define a JSON schema for extraction
page_schema = json.dumps({
    "invoice_number": {"type": "string", "description": "Invoice number"},
    "total": {"type": "number", "description": "Total amount due"},
    "vendor": {"type": "string", "description": "Vendor or company name"},
    "items": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "amount": {"type": "number"}
            }
        }
    }
})

options = ExtractOptions(page_schema=page_schema)
result = client.extract("invoice.pdf", options=options)

# Access the extracted data
extracted = json.loads(result.extraction_schema_json)
print(extracted)

Extract Options

Use ExtractOptions to configure extraction behavior:

Option	Type	Default	Description
`page_schema`	str	Required	JSON schema defining the fields to extract. Mutually exclusive with `schema_id`.
`schema_id`	str	None	ID of a saved extraction schema (e.g. `sch_k8Hx9mP2nQ4v`). Mutually exclusive with `page_schema`.
`schema_version`	int	None	Schema version to pin to. Only valid with `schema_id`.
`checkpoint_id`	str	None	Checkpoint ID from a previous `convert()` call
`mode`	str	`"fast"`	Parse mode: `"fast"`, `"balanced"`, `"accurate"`. Controls document parsing quality.
`output_format`	str	`"markdown"`	Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"`
`save_checkpoint`	bool	`False`	Save checkpoint for reuse with subsequent calls
`max_pages`	int	None	Maximum number of pages to process
`page_range`	str	None	Specific pages to process (e.g., `"0-5,10"`). For spreadsheets, filters by sheet index.
`skip_cache`	bool	`False`	Skip cached results, force reprocessing
`webhook_url`	str	None	Webhook URL for completion notification

To control the extraction pipeline mode (fast vs. balanced), pass extraction_mode as a form field via the REST API directly — it is not yet exposed in ExtractOptions. See Balanced Extraction Mode for details on the two modes.

Checkpoint Reuse

Use checkpoints to avoid re-parsing a document when running extraction after conversion. First convert with save_checkpoint=True, then extract using the returned checkpoint_id:

import json
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions

client = DatalabClient()

# Step 1: Convert and save a checkpoint
convert_options = ConvertOptions(
    mode="accurate",
    save_checkpoint=True,
)
convert_result = client.convert("report.pdf", options=convert_options)
print(convert_result.markdown)

# Step 2: Extract using the checkpoint (no re-parsing needed)
page_schema = json.dumps({
    "title": {"type": "string", "description": "Document title"},
    "author": {"type": "string", "description": "Author name"},
    "date": {"type": "string", "description": "Publication date"},
    "summary": {"type": "string", "description": "Brief summary of the document"},
})

extract_options = ExtractOptions(
    page_schema=page_schema,
    checkpoint_id=convert_result.checkpoint_id,
)
extract_result = client.extract("report.pdf", options=extract_options)
extracted = json.loads(extract_result.extraction_schema_json)
print(extracted)

Extraction Result

The result object contains the extracted data alongside standard conversion fields:

result = client.extract("invoice.pdf", options=options)

# Extracted structured data (JSON string)
extracted = json.loads(result.extraction_schema_json)
print(extracted["invoice_number"])
print(extracted["total"])

# Standard conversion fields are also available
print(result.success)
print(result.markdown)
print(result.page_count)
print(result.cost_breakdown)

Async Usage

import asyncio
import json
from datalab_sdk import AsyncDatalabClient, ExtractOptions

async def extract_data():
    async with AsyncDatalabClient() as client:
        page_schema = json.dumps({
            "title": {"type": "string", "description": "Document title"},
            "author": {"type": "string", "description": "Author name"},
        })
        options = ExtractOptions(page_schema=page_schema)
        result = await client.extract("document.pdf", options=options)
        return json.loads(result.extraction_schema_json)

extracted = asyncio.run(extract_data())
print(extracted)

Next Steps

Extraction Recipe

Learn more about structured extraction patterns and best practices.

Document Segmentation

Segment documents into logical sections using schemas.

Document Conversion

Convert documents to Markdown, HTML, JSON, or chunks.

Batch Processing

Process multiple documents efficiently in parallel.

​Basic Usage

​Extract Options

​Checkpoint Reuse

​Extraction Result

​Async Usage

​Next Steps

Extraction Recipe

Document Segmentation

Document Conversion

Batch Processing

Basic Usage

Extract Options

Checkpoint Reuse

Extraction Result

Async Usage

Next Steps