Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.
For the simplest integration, use the Python SDK. The SDK handles authentication, polling, and provides typed responses.
Authentication
All requests require an API key in the X-API-Key header:
curl -X POST https://www.datalab.to/api/v1/convert \
-H "X-API-Key: YOUR_API_KEY" \
-F "file=@document.pdf"
Get your API key from the API Keys dashboard.
Request Pattern
All processing endpoints follow this pattern:
- Submit a document for processing (returns immediately with a
request_id)
- Poll the status endpoint until processing completes
- Retrieve results from the completed response
Submit Request
Response:
{
"success": true,
"request_id": "abc123",
"request_check_url": "https://www.datalab.to/api/v1/{endpoint}/abc123"
}
Poll for Results
GET /api/v1/{endpoint}/{request_id}
Response while processing:
{
"status": "processing"
}
Response when complete:
{
"status": "complete",
"success": true,
...results...
}
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
Document Conversion
Convert documents to Markdown, HTML, JSON, or chunks.
Endpoint: POST /api/v1/convert
Request
import requests
url = "https://www.datalab.to/api/v1/convert"
headers = {"X-API-Key": "YOUR_API_KEY"}
with open("document.pdf", "rb") as f:
response = requests.post(
url,
files={"file": ("document.pdf", f, "application/pdf")},
data={
"output_format": "markdown",
"mode": "balanced",
},
headers=headers
)
data = response.json()
check_url = data["request_check_url"]
Parameters
| Parameter | Type | Default | Description |
|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document (alternative to file upload) |
output_format | string | markdown | Output format: markdown, html, json, chunks |
mode | string | balanced | Processing mode: fast, balanced, accurate |
max_pages | int | - | Maximum pages to process |
page_range | string | - | Specific pages (e.g., "0-5,10", 0-indexed) |
paginate | bool | false | Add page delimiters to output |
skip_cache | bool | false | Skip cached results |
disable_image_extraction | bool | false | Don’t extract images |
disable_image_captions | bool | false | Don’t generate image captions |
save_checkpoint | bool | false | Save checkpoint for reuse |
extras | string | - | Comma-separated: track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types |
add_block_ids | bool | false | Add block IDs to HTML for citations |
additional_config | string | - | JSON with extra config options |
webhook_url | string | - | Override webhook URL for this request |
Processing Modes
| Mode | Description |
|---|
fast | Lowest latency, good for simple documents |
balanced | Balance of speed and accuracy (default) |
accurate | Highest accuracy, best for complex layouts |
Response
Poll request_check_url until status is complete:
import time
while True:
response = requests.get(check_url, headers=headers)
result = response.json()
if result["status"] == "complete":
break
time.sleep(2)
print(result["markdown"])
Response fields:
| Field | Type | Description |
|---|
status | string | processing, complete, or failed |
success | bool | Whether conversion succeeded |
markdown | string | Markdown output (if format is markdown) |
html | string | HTML output (if format is html) |
json | object | JSON output (if format is json) |
chunks | object | Chunked output (if format is chunks) |
images | object | Extracted images as {filename: base64} |
metadata | object | Document metadata |
page_count | int | Number of pages processed |
parse_quality_score | float | Quality score (0-5) |
cost_breakdown | object | Cost in cents |
error | string | Error message if failed |
Extract structured data from documents using a JSON schema.
Endpoint: POST /api/v1/extract
Request
import requests
import json
headers = {"X-API-Key": "YOUR_API_KEY"}
schema = {
"invoice_number": {"type": "string", "description": "Invoice ID"},
"total": {"type": "number", "description": "Total amount"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"}
}
}
}
}
response = requests.post(
"https://www.datalab.to/api/v1/extract",
files={"file": ("invoice.pdf", open("invoice.pdf", "rb"), "application/pdf")},
data={
"page_schema": json.dumps(schema),
"mode": "balanced"
},
headers=headers
)
data = response.json()
check_url = data["request_check_url"]
Parameters
| Parameter | Type | Default | Description |
|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document (alternative to file upload) |
page_schema | string | required | JSON schema defining the data to extract |
checkpoint_id | string | - | Reuse a previously saved checkpoint |
mode | string | balanced | Processing mode: fast, balanced, accurate |
output_format | string | markdown | Output format: markdown, html, json |
max_pages | int | - | Maximum pages to process |
page_range | string | - | Specific pages (e.g., "0-5,10", 0-indexed) |
The extracted data is returned in extraction_schema_json in the poll response.
See Structured Extraction for detailed examples.
Document Segmentation
Segment documents into structured sections using a JSON schema.
Endpoint: POST /api/v1/segment
Parameters
| Parameter | Type | Default | Description |
|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document (alternative to file upload) |
segmentation_schema | string | required | JSON schema defining the segments to extract |
checkpoint_id | string | - | Reuse a previously saved checkpoint |
mode | string | balanced | Processing mode: fast, balanced, accurate |
See Document Segmentation for detailed examples.
Track Changes
Extract tracked changes (insertions and deletions) from DOCX files.
Endpoint: POST /api/v1/track-changes
response = requests.post(
"https://www.datalab.to/api/v1/track-changes",
files={"file": ("document.docx", open("document.docx", "rb"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")},
headers=headers
)
See Track Changes for detailed examples.
Custom Pipeline
This feature is currently in beta. The API may change.
Execute custom AI-powered pipelines generated from natural language descriptions.
Endpoint: POST /api/v1/custom-pipeline
See Document Conversion for more details on processing modes and options.
Fill forms in PDFs and images.
Endpoint: POST /api/v1/fill
Request
import json
field_data = {
"full_name": {"value": "John Doe", "description": "Full legal name"},
"date": {"value": "2024-01-15", "description": "Today's date"},
"signature": {"value": "John Doe", "description": "Signature field"}
}
response = requests.post(
"https://www.datalab.to/api/v1/fill",
files={"file": ("form.pdf", open("form.pdf", "rb"), "application/pdf")},
data={
"field_data": json.dumps(field_data),
"confidence_threshold": "0.5"
},
headers=headers
)
Parameters
| Parameter | Type | Default | Description |
|---|
file | file | - | Form file (PDF or image) |
file_url | string | - | URL to form |
field_data | string | - | JSON mapping field names to values |
context | string | - | Additional context for field matching |
confidence_threshold | float | 0.5 | Minimum confidence for matching (0-1) |
page_range | string | - | Specific pages to process |
skip_cache | bool | false | Skip cached results |
{
"field_key": {
"value": "The value to fill",
"description": "Description to help match the field"
}
}
Response
| Field | Type | Description |
|---|
status | string | Processing status |
success | bool | Whether filling succeeded |
output_format | string | pdf or png |
output_base64 | string | Base64-encoded filled form |
fields_filled | array | Successfully filled field names |
fields_not_found | array | Unmatched field names |
page_count | int | Pages processed |
cost_breakdown | object | Cost details |
See Form Filling for more examples.
File Management
Upload and manage files for use in workflows.
Upload File
Step 1: Request an upload URL
POST /api/v1/files/upload
Content-Type: application/json
{
"filename": "document.pdf",
"content_type": "application/pdf"
}
Response:
{
"file_id": 123,
"upload_url": "https://...",
"reference": "datalab://file-abc123"
}
Step 2: Upload directly to the presigned URL
PUT {upload_url}
Content-Type: application/pdf
<file contents>
Step 3: Confirm upload
GET /api/v1/files/{file_id}/confirm
List Files
GET /api/v1/files?limit=50&offset=0
GET /api/v1/files/{file_id}
Get Download URL
GET /api/v1/files/{file_id}/download?expires_in=3600
Delete File
DELETE /api/v1/files/{file_id}
See File Management for detailed examples.
Thumbnails
Generate page thumbnails from a previously processed document:
GET /api/v1/thumbnails/{lookup_key}?thumb_width=300&page_range=0-2
| Parameter | Type | Default | Description |
|---|
lookup_key | string | Required | The request ID from a previous conversion |
thumb_width | int | 300 | Thumbnail width in pixels |
page_range | string | All pages | Pages to generate (e.g., "0,2-4") |
Response:
{
"success": true,
"thumbnails": ["base64_encoded_jpg_1", "base64_encoded_jpg_2"]
}
Thumbnails are returned as base64-encoded JPG images.
Create Document
Generate DOCX files from markdown with track changes support:
POST /api/v1/create-document
Content-Type: application/json
{
"markdown": "# Title\n\nThis is <ins data-revision-author=\"Editor\">newly added</ins> text.",
"output_format": "docx"
}
See Create Document for detailed examples.
Webhooks
Configure webhooks to receive notifications when processing completes instead of polling.
Set a default webhook URL in your account settings, or override per-request with the webhook_url parameter.
See Webhooks for configuration details.
Rate Limits
Default rate limits apply per API key. If you exceed limits, you’ll receive a 429 response.
See Rate Limits for details and how to request higher limits.
Next Steps