Skip to main content
Extract specific fields from documents by providing a JSON schema. Marker parses the document and fills in your schema with extracted values. Before you begin, make sure you have:
  1. A Datalab account with an API key (new accounts include $5 in free credits)
  2. Python 3.10+ installed
  3. The Datalab SDK: pip install datalab-python-sdk
  4. Your DATALAB_API_KEY environment variable set

Quick Start

import json
from datalab_sdk import DatalabClient, ExtractOptions

client = DatalabClient()

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID or number"},
        "total_amount": {"type": "number", "description": "Total amount due"},
        "vendor_name": {"type": "string", "description": "Company or vendor name"}
    },
    "required": ["invoice_number", "total_amount"]
}

options = ExtractOptions(
    page_schema=json.dumps(schema),
    mode="balanced"
)

result = client.extract("invoice.pdf", options=options)
extracted = json.loads(result.extraction_schema_json)
print(f"Invoice: {extracted['invoice_number']}")
print(f"Total: ${extracted['total_amount']}")

Schema Format

Use JSON Schema format to define what you want to extract:
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "Describe what this field contains"
    },
    "numeric_field": {
      "type": "number",
      "description": "A numeric value"
    },
    "list_field": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "nested_field": {"type": "string"}
        }
      }
    }
  },
  "required": ["field_name"]
}

Tips for Better Extraction

  1. Use descriptive field names - invoice_number is clearer than id
  2. Add descriptions - The description field helps the model understand context
  3. Specify types correctly - Use number for numeric values, string for text
  4. Use arrays for repeating data - Line items, table rows, etc.
Common schema pitfalls:
  • Using vague field names like data or info — be specific (e.g., invoice_number, total_amount)
  • Forgetting description fields — these help the model understand what to extract
  • Setting type: "string" for numeric values — use type: "number" for amounts, quantities, etc.
  • Deeply nested schemas — keep schemas as flat as possible for better extraction accuracy

Response

The extracted data is returned in extraction_schema_json:
{
  "status": "complete",
  "success": true,
  "json": {...},
  "extraction_schema_json": "{\"invoice_number\": \"INV-2024-001\", \"total_amount\": 1500.00, ...}",
  "page_count": 2
}

Citation Tracking

Each extracted field includes citations to the source blocks:
{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123", "block_124"],
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"]
}
Use these block IDs with the json output to trace extracted values back to the source document.

Schema Examples

Financial Document

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string", "description": "Company name"},
        "fiscal_year": {"type": "string", "description": "Fiscal year"},
        "total_revenue": {"type": "number", "description": "Total revenue in dollars"},
        "net_income": {"type": "number", "description": "Net income in dollars"},
        "eps": {"type": "number", "description": "Earnings per share"}
    }
}

Scientific Paper

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "Paper title"},
        "authors": {
            "type": "array",
            "items": {"type": "string"},
            "description": "List of author names"
        },
        "abstract": {"type": "string", "description": "Paper abstract"},
        "keywords": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Keywords or tags"
        }
    }
}

Contract

schema = {
    "type": "object",
    "properties": {
        "parties": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"}
                }
            }
        },
        "effective_date": {"type": "string", "description": "Contract start date"},
        "termination_date": {"type": "string", "description": "Contract end date"},
        "total_value": {"type": "number", "description": "Total contract value"}
    }
}

Using Checkpoints

If you already converted a document with save_checkpoint=True using the Convert API, pass the checkpoint_id to ExtractOptions to skip re-parsing. This saves time and cost when running extraction on a previously converted document.
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
convert_result = client.convert("invoice.pdf", options=ConvertOptions(save_checkpoint=True))
checkpoint_id = convert_result.checkpoint_id

# Step 2: Extract using checkpoint (no re-parsing needed)
schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID"},
        "total_amount": {"type": "number", "description": "Total due"}
    }
}

options = ExtractOptions(
    page_schema=json.dumps(schema),
    checkpoint_id=checkpoint_id
)
result = client.extract("invoice.pdf", options=options)
extracted = json.loads(result.extraction_schema_json)
The extract endpoint accepts the following parameters: file, page_schema (required), mode, max_pages, page_range, save_checkpoint, checkpoint_id, and webhook_url.

Auto-Generate Schemas

Don’t want to write schemas by hand? Use the schema generation endpoint to automatically suggest schemas for your document. This requires a checkpoint from a previous conversion:
import os, requests, json, time

headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

# Step 1: Convert with checkpoint
with open("invoice.pdf", "rb") as f:
    resp = requests.post(
        "https://www.datalab.to/api/v1/convert",
        files={"file": ("invoice.pdf", f, "application/pdf")},
        data={"save_checkpoint": "true", "output_format": "markdown"},
        headers=headers
    )
check_url = resp.json()["request_check_url"]

# Poll until complete
while True:
    result = requests.get(check_url, headers=headers).json()
    if result["status"] == "complete":
        checkpoint_id = result["checkpoint_id"]
        break
    time.sleep(2)

# Step 2: Generate schemas
resp = requests.post(
    "https://www.datalab.to/api/v1/marker/extraction/gen_schemas",
    json={"checkpoint_id": checkpoint_id},
    headers=headers
)
gen_check_url = resp.json()["request_check_url"]

while True:
    result = requests.get(gen_check_url, headers=headers).json()
    if result["status"] == "complete":
        suggestions = result["suggestions"]
        print("Simple schema:", suggestions["simple_schema"])
        print("Moderate schema:", suggestions["moderate_schema"])
        print("Complex schema:", suggestions["complex_schema"])
        break
    time.sleep(2)
The endpoint returns three schema options at different complexity levels — use the one that best matches your needs, then customize it.

Using Forge Playground

Create and test schemas visually in Forge Playground:
  1. Upload a sample document
  2. Define fields in the visual editor
  3. Switch to JSON Editor to copy the schema
  4. Test extraction before deploying

Next Steps