Skip to main content
Extract specific fields from documents by providing a JSON schema. Marker parses the document and fills in your schema with extracted values.

SDK Usage

import json
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

# Define your extraction schema
schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID or number"},
        "total_amount": {"type": "number", "description": "Total amount due"},
        "vendor_name": {"type": "string", "description": "Company or vendor name"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "number"},
                    "unit_price": {"type": "number"},
                    "total": {"type": "number"}
                }
            }
        }
    },
    "required": ["invoice_number", "total_amount"]
}

options = ConvertOptions(
    page_schema=json.dumps(schema),
    mode="balanced"
)

result = client.convert("invoice.pdf", options=options)

# Access extracted data
extracted = json.loads(result.extraction_schema_json)
print(f"Invoice: {extracted['invoice_number']}")
print(f"Total: ${extracted['total_amount']}")

Schema Format

Use JSON Schema format to define what you want to extract:
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "Describe what this field contains"
    },
    "numeric_field": {
      "type": "number",
      "description": "A numeric value"
    },
    "list_field": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "nested_field": {"type": "string"}
        }
      }
    }
  },
  "required": ["field_name"]
}

Tips for Better Extraction

  1. Use descriptive field names - invoice_number is clearer than id
  2. Add descriptions - The description field helps the model understand context
  3. Specify types correctly - Use number for numeric values, string for text
  4. Use arrays for repeating data - Line items, table rows, etc.

REST API

curl -X POST https://www.datalab.to/api/v1/marker \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "[email protected]" \
  -F "output_format=json" \
  -F "mode=balanced" \
  -F 'page_schema={"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}}}'

Python Example

import requests
import json
import time

API_KEY = "YOUR_API_KEY"
headers = {"X-API-Key": API_KEY}

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID"},
        "total_amount": {"type": "number", "description": "Total due"}
    }
}

# Submit request
with open("invoice.pdf", "rb") as f:
    response = requests.post(
        "https://www.datalab.to/api/v1/marker",
        files={"file": ("invoice.pdf", f, "application/pdf")},
        data={
            "page_schema": json.dumps(schema),
            "output_format": "json",
            "mode": "balanced"
        },
        headers=headers
    )

check_url = response.json()["request_check_url"]

# Poll for results
while True:
    result = requests.get(check_url, headers=headers).json()

    if result["status"] == "complete":
        extracted = json.loads(result["extraction_schema_json"])
        print(extracted)
        break
    elif result["status"] == "failed":
        print(f"Error: {result.get('error')}")
        break

    time.sleep(2)

Response

The extracted data is returned in extraction_schema_json:
{
  "status": "complete",
  "success": true,
  "json": {...},
  "extraction_schema_json": "{\"invoice_number\": \"INV-2024-001\", \"total_amount\": 1500.00, ...}",
  "page_count": 2
}

Citation Tracking

Each extracted field includes citations to the source blocks:
{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123", "block_124"],
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"]
}
Use these block IDs with the json output to trace extracted values back to the source document.

Examples

Financial Document

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string", "description": "Company name"},
        "fiscal_year": {"type": "string", "description": "Fiscal year"},
        "total_revenue": {"type": "number", "description": "Total revenue in dollars"},
        "net_income": {"type": "number", "description": "Net income in dollars"},
        "eps": {"type": "number", "description": "Earnings per share"}
    }
}

Scientific Paper

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "Paper title"},
        "authors": {
            "type": "array",
            "items": {"type": "string"},
            "description": "List of author names"
        },
        "abstract": {"type": "string", "description": "Paper abstract"},
        "keywords": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Keywords or tags"
        }
    }
}

Contract

schema = {
    "type": "object",
    "properties": {
        "parties": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"}
                }
            }
        },
        "effective_date": {"type": "string", "description": "Contract start date"},
        "termination_date": {"type": "string", "description": "Contract end date"},
        "total_value": {"type": "number", "description": "Total contract value"}
    }
}

Using Forge Playground

Create and test schemas visually in Forge Playground:
  1. Upload a sample document
  2. Define fields in the visual editor
  3. Switch to JSON Editor to copy the schema
  4. Test extraction before deploying

Try Datalab

Get started with our API in less than a minute. We include free credits.