Skip to main content
Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.
For the simplest integration, use the Python SDK. The SDK handles authentication, polling, and provides typed responses.

Authentication

All requests require an API key in the X-API-Key header:
curl -X POST https://www.datalab.to/api/v1/convert \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@document.pdf"
Get your API key from the API Keys dashboard.

Request Pattern

All processing endpoints follow this pattern:
  1. Submit a document for processing (returns immediately with a request_id)
  2. Poll the status endpoint until processing completes
  3. Retrieve results from the completed response

Submit Request

POST /api/v1/{endpoint}
Response:
{
  "success": true,
  "request_id": "abc123",
  "request_check_url": "https://www.datalab.to/api/v1/{endpoint}/abc123"
}

Poll for Results

GET /api/v1/{endpoint}/{request_id}
Response while processing:
{
  "status": "processing"
}
Response when complete:
{
  "status": "complete",
  "success": true,
  ...results...
}
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.

Document Conversion

Convert documents to Markdown, HTML, JSON, or chunks. Endpoint: POST /api/v1/convert

Request

import requests

url = "https://www.datalab.to/api/v1/convert"
headers = {"X-API-Key": "YOUR_API_KEY"}

with open("document.pdf", "rb") as f:
    response = requests.post(
        url,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={
            "output_format": "markdown",
            "mode": "balanced",
        },
        headers=headers
    )

data = response.json()
check_url = data["request_check_url"]

Parameters

ParameterTypeDefaultDescription
filefile-Document file (multipart upload)
file_urlstring-URL to document (alternative to file upload)
output_formatstringmarkdownOutput format: markdown, html, json, chunks
modestringbalancedProcessing mode: fast, balanced, accurate
max_pagesint-Maximum pages to process
page_rangestring-Specific pages (e.g., "0-5,10", 0-indexed)
paginateboolfalseAdd page delimiters to output
skip_cacheboolfalseSkip cached results
disable_image_extractionboolfalseDon’t extract images
disable_image_captionsboolfalseDon’t generate image captions
save_checkpointboolfalseSave checkpoint for reuse
extrasstring-Comma-separated: track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types
add_block_idsboolfalseAdd block IDs to HTML for citations
additional_configstring-JSON with extra config options
webhook_urlstring-Override webhook URL for this request

Processing Modes

ModeDescription
fastLowest latency, good for simple documents
balancedBalance of speed and accuracy (default)
accurateHighest accuracy, best for complex layouts

Response

Poll request_check_url until status is complete:
import time

while True:
    response = requests.get(check_url, headers=headers)
    result = response.json()

    if result["status"] == "complete":
        break
    time.sleep(2)

print(result["markdown"])
Response fields:
FieldTypeDescription
statusstringprocessing, complete, or failed
successboolWhether conversion succeeded
markdownstringMarkdown output (if format is markdown)
htmlstringHTML output (if format is html)
jsonobjectJSON output (if format is json)
chunksobjectChunked output (if format is chunks)
imagesobjectExtracted images as {filename: base64}
metadataobjectDocument metadata
page_countintNumber of pages processed
parse_quality_scorefloatQuality score (0-5)
cost_breakdownobjectCost in cents
errorstringError message if failed
For structured data extraction, see the Extract endpoint. For document segmentation, see the Segment endpoint.

Structured Extraction

Extract structured data from documents using a JSON schema. Endpoint: POST /api/v1/extract

Request

import requests
import json

headers = {"X-API-Key": "YOUR_API_KEY"}

schema = {
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "line_items": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "amount": {"type": "number"}
            }
        }
    }
}

response = requests.post(
    "https://www.datalab.to/api/v1/extract",
    files={"file": ("invoice.pdf", open("invoice.pdf", "rb"), "application/pdf")},
    data={
        "page_schema": json.dumps(schema),
        "mode": "balanced"
    },
    headers=headers
)

data = response.json()
check_url = data["request_check_url"]

Parameters

ParameterTypeDefaultDescription
filefile-Document file (multipart upload)
file_urlstring-URL to document (alternative to file upload)
page_schemastringrequiredJSON schema defining the data to extract
checkpoint_idstring-Reuse a previously saved checkpoint
modestringbalancedProcessing mode: fast, balanced, accurate
output_formatstringmarkdownOutput format: markdown, html, json
max_pagesint-Maximum pages to process
page_rangestring-Specific pages (e.g., "0-5,10", 0-indexed)
The extracted data is returned in extraction_schema_json in the poll response. See Structured Extraction for detailed examples.

Document Segmentation

Segment documents into structured sections using a JSON schema. Endpoint: POST /api/v1/segment

Parameters

ParameterTypeDefaultDescription
filefile-Document file (multipart upload)
file_urlstring-URL to document (alternative to file upload)
segmentation_schemastringrequiredJSON schema defining the segments to extract
checkpoint_idstring-Reuse a previously saved checkpoint
modestringbalancedProcessing mode: fast, balanced, accurate
See Document Segmentation for detailed examples.

Track Changes

Extract tracked changes (insertions and deletions) from DOCX files. Endpoint: POST /api/v1/track-changes
response = requests.post(
    "https://www.datalab.to/api/v1/track-changes",
    files={"file": ("document.docx", open("document.docx", "rb"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")},
    headers=headers
)
See Track Changes for detailed examples.

Custom Pipeline

This feature is currently in beta. The API may change.
Execute custom AI-powered pipelines generated from natural language descriptions. Endpoint: POST /api/v1/custom-pipeline See Document Conversion for more details on processing modes and options.

Form Filling

Fill forms in PDFs and images. Endpoint: POST /api/v1/fill

Request

import json

field_data = {
    "full_name": {"value": "John Doe", "description": "Full legal name"},
    "date": {"value": "2024-01-15", "description": "Today's date"},
    "signature": {"value": "John Doe", "description": "Signature field"}
}

response = requests.post(
    "https://www.datalab.to/api/v1/fill",
    files={"file": ("form.pdf", open("form.pdf", "rb"), "application/pdf")},
    data={
        "field_data": json.dumps(field_data),
        "confidence_threshold": "0.5"
    },
    headers=headers
)

Parameters

ParameterTypeDefaultDescription
filefile-Form file (PDF or image)
file_urlstring-URL to form
field_datastring-JSON mapping field names to values
contextstring-Additional context for field matching
confidence_thresholdfloat0.5Minimum confidence for matching (0-1)
page_rangestring-Specific pages to process
skip_cacheboolfalseSkip cached results

Field Data Format

{
  "field_key": {
    "value": "The value to fill",
    "description": "Description to help match the field"
  }
}

Response

FieldTypeDescription
statusstringProcessing status
successboolWhether filling succeeded
output_formatstringpdf or png
output_base64stringBase64-encoded filled form
fields_filledarraySuccessfully filled field names
fields_not_foundarrayUnmatched field names
page_countintPages processed
cost_breakdownobjectCost details
See Form Filling for more examples.

File Management

Upload and manage files for use in workflows.

Upload File

Step 1: Request an upload URL
POST /api/v1/files/upload
Content-Type: application/json

{
  "filename": "document.pdf",
  "content_type": "application/pdf"
}
Response:
{
  "file_id": 123,
  "upload_url": "https://...",
  "reference": "datalab://file-abc123"
}
Step 2: Upload directly to the presigned URL
PUT {upload_url}
Content-Type: application/pdf

<file contents>
Step 3: Confirm upload
GET /api/v1/files/{file_id}/confirm

List Files

GET /api/v1/files?limit=50&offset=0

Get File Metadata

GET /api/v1/files/{file_id}

Get Download URL

GET /api/v1/files/{file_id}/download?expires_in=3600

Delete File

DELETE /api/v1/files/{file_id}
See File Management for detailed examples.

Thumbnails

Generate page thumbnails from a previously processed document:
GET /api/v1/thumbnails/{lookup_key}?thumb_width=300&page_range=0-2
ParameterTypeDefaultDescription
lookup_keystringRequiredThe request ID from a previous conversion
thumb_widthint300Thumbnail width in pixels
page_rangestringAll pagesPages to generate (e.g., "0,2-4")
Response:
{
  "success": true,
  "thumbnails": ["base64_encoded_jpg_1", "base64_encoded_jpg_2"]
}
Thumbnails are returned as base64-encoded JPG images.

Create Document

Generate DOCX files from markdown with track changes support:
POST /api/v1/create-document
Content-Type: application/json

{
  "markdown": "# Title\n\nThis is <ins data-revision-author=\"Editor\">newly added</ins> text.",
  "output_format": "docx"
}
See Create Document for detailed examples.

Webhooks

Configure webhooks to receive notifications when processing completes instead of polling. Set a default webhook URL in your account settings, or override per-request with the webhook_url parameter. See Webhooks for configuration details.

Rate Limits

Default rate limits apply per API key. If you exceed limits, you’ll receive a 429 response. See Rate Limits for details and how to request higher limits.

Next Steps