Skip to main content
Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.
For the simplest integration, use the Python SDK. The SDK handles authentication, polling, and provides typed responses.

Authentication

All requests require an API key in the X-API-Key header:
curl -X POST https://www.datalab.to/api/v1/marker \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "[email protected]"
Get your API key from datalab.to/settings.

Request Pattern

All processing endpoints follow this pattern:
  1. Submit a document for processing (returns immediately with a request_id)
  2. Poll the status endpoint until processing completes
  3. Retrieve results from the completed response

Submit Request

POST /api/v1/{endpoint}
Response:
{
  "success": true,
  "request_id": "abc123",
  "request_check_url": "https://www.datalab.to/api/v1/{endpoint}/abc123"
}

Poll for Results

GET /api/v1/{endpoint}/{request_id}
Response while processing:
{
  "status": "processing"
}
Response when complete:
{
  "status": "complete",
  "success": true,
  ...results...
}
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.

Document Conversion (Marker)

Convert documents to Markdown, HTML, JSON, or chunks. Endpoint: POST /api/v1/marker

Request

import requests

url = "https://www.datalab.to/api/v1/marker"
headers = {"X-API-Key": "YOUR_API_KEY"}

with open("document.pdf", "rb") as f:
    response = requests.post(
        url,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={
            "output_format": "markdown",
            "mode": "balanced",
        },
        headers=headers
    )

data = response.json()
check_url = data["request_check_url"]

Parameters

ParameterTypeDefaultDescription
filefile-Document file (multipart upload)
file_urlstring-URL to document (alternative to file upload)
output_formatstringmarkdownOutput format: markdown, html, json, chunks
modestringbalancedProcessing mode: fast, balanced, accurate
max_pagesint-Maximum pages to process
page_rangestring-Specific pages (e.g., "0-5,10", 0-indexed)
paginateboolfalseAdd page delimiters to output
skip_cacheboolfalseSkip cached results
disable_image_extractionboolfalseDon’t extract images
disable_image_captionsboolfalseDon’t generate image captions
page_schemastring-JSON schema for structured extraction
segmentation_schemastring-JSON schema for document segmentation
save_checkpointboolfalseSave checkpoint for reuse
extrasstring-Comma-separated: track_changes, chart_understanding, extract_links
add_block_idsboolfalseAdd block IDs to HTML for citations
additional_configstring-JSON with extra config options
webhook_urlstring-Override webhook URL for this request

Processing Modes

ModeDescription
fastLowest latency, good for simple documents
balancedBalance of speed and accuracy (default)
accurateHighest accuracy, best for complex layouts

Response

Poll request_check_url until status is complete:
import time

while True:
    response = requests.get(check_url, headers=headers)
    result = response.json()

    if result["status"] == "complete":
        break
    time.sleep(2)

print(result["markdown"])
Response fields:
FieldTypeDescription
statusstringprocessing, complete, or failed
successboolWhether conversion succeeded
markdownstringMarkdown output (if format is markdown)
htmlstringHTML output (if format is html)
jsonobjectJSON output (if format is json)
chunksobjectChunked output (if format is chunks)
imagesobjectExtracted images as {filename: base64}
metadataobjectDocument metadata
page_countintNumber of pages processed
parse_quality_scorefloatQuality score (0-5)
cost_breakdownobjectCost in cents
extraction_schema_jsonstringExtracted data (if page_schema provided)
segmentation_resultsobjectSegmentation results (if schema provided)
errorstringError message if failed

Structured Extraction

Extract structured data from documents using a JSON schema. This is a feature of the Marker endpoint.
import json

schema = {
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "line_items": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "amount": {"type": "number"}
            }
        }
    }
}

response = requests.post(
    "https://www.datalab.to/api/v1/marker",
    files={"file": ("invoice.pdf", open("invoice.pdf", "rb"), "application/pdf")},
    data={
        "page_schema": json.dumps(schema),
        "mode": "balanced"
    },
    headers=headers
)
The extracted data is returned in extraction_schema_json. See Structured Extraction for detailed examples.

Form Filling

Fill forms in PDFs and images. Endpoint: POST /api/v1/fill

Request

import json

field_data = {
    "full_name": {"value": "John Doe", "description": "Full legal name"},
    "date": {"value": "2024-01-15", "description": "Today's date"},
    "signature": {"value": "John Doe", "description": "Signature field"}
}

response = requests.post(
    "https://www.datalab.to/api/v1/fill",
    files={"file": ("form.pdf", open("form.pdf", "rb"), "application/pdf")},
    data={
        "field_data": json.dumps(field_data),
        "confidence_threshold": "0.5"
    },
    headers=headers
)

Parameters

ParameterTypeDefaultDescription
filefile-Form file (PDF or image)
file_urlstring-URL to form
field_datastring-JSON mapping field names to values
contextstring-Additional context for field matching
confidence_thresholdfloat0.5Minimum confidence for matching (0-1)
page_rangestring-Specific pages to process
skip_cacheboolfalseSkip cached results

Field Data Format

{
  "field_key": {
    "value": "The value to fill",
    "description": "Description to help match the field"
  }
}

Response

FieldTypeDescription
statusstringProcessing status
successboolWhether filling succeeded
output_formatstringpdf or png
output_base64stringBase64-encoded filled form
fields_filledarraySuccessfully filled field names
fields_not_foundarrayUnmatched field names
page_countintPages processed
cost_breakdownobjectCost details
See Form Filling for more examples.

File Management

Upload and manage files for use in workflows.

Upload File

Step 1: Request an upload URL
POST /api/v1/files/upload
Content-Type: application/json

{
  "filename": "document.pdf",
  "content_type": "application/pdf"
}
Response:
{
  "file_id": 123,
  "upload_url": "https://...",
  "reference": "datalab://file-abc123"
}
Step 2: Upload directly to the presigned URL
PUT {upload_url}
Content-Type: application/pdf

<file contents>
Step 3: Confirm upload
GET /api/v1/files/{file_id}/confirm

List Files

GET /api/v1/files?limit=50&offset=0

Get File Metadata

GET /api/v1/files/{file_id}

Get Download URL

GET /api/v1/files/{file_id}/download?expires_in=3600

Delete File

DELETE /api/v1/files/{file_id}
See File Management for detailed examples.

Webhooks

Configure webhooks to receive notifications when processing completes instead of polling. Set a default webhook URL in your account settings, or override per-request with the webhook_url parameter. See Webhooks for configuration details.

Rate Limits

Default rate limits apply per API key. If you exceed limits, you’ll receive a 429 response. See Rate Limits for details and how to request higher limits.

Try Datalab

Get started with our API in less than a minute. We include free credits.