Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.
For the simplest integration, use the Python SDK . The SDK handles authentication, polling, and provides typed responses.
Authentication
All requests require an API key in the X-API-Key header:
curl -X POST https://www.datalab.to/api/v1/convert \
-H "X-API-Key: YOUR_API_KEY" \
-F "file=@document.pdf"
Get your API key from the API Keys dashboard .
Request Pattern
All processing endpoints follow this pattern:
Submit a document for processing (returns immediately with a request_id)
Poll the status endpoint until processing completes
Retrieve results from the completed response
Submit Request
Response:
{
"success" : true ,
"request_id" : "abc123" ,
"request_check_url" : "https://www.datalab.to/api/v1/{endpoint}/abc123"
}
Poll for Results
GET /api/v1/{endpoint}/{request_id}
Response while processing:
{
"status" : "processing"
}
Response when complete:
{
"status" : "complete" ,
"success" : true ,
...results...
}
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
Document Conversion
Convert documents to Markdown, HTML, JSON, or chunks.
Endpoint: POST /api/v1/convert
Request
import requests
url = "https://www.datalab.to/api/v1/convert"
headers = { "X-API-Key" : "YOUR_API_KEY" }
with open ( "document.pdf" , "rb" ) as f:
response = requests.post(
url,
files = { "file" : ( "document.pdf" , f, "application/pdf" )},
data = {
"output_format" : "markdown" ,
"mode" : "balanced" ,
},
headers = headers
)
data = response.json()
check_url = data[ "request_check_url" ]
Parameters
Parameter Type Default Description filefile - Document file (multipart upload) file_urlstring - URL to document (alternative to file upload) output_formatstring markdownOutput format: markdown, html, json, chunks modestring fastProcessing mode: fast, balanced, accurate max_pagesint - Maximum pages to process page_rangestring - Specific pages (e.g., "0-5,10", 0-indexed). For spreadsheets, filters by sheet index. paginatebool falseAdd page delimiters to output skip_cachebool falseSkip cached results disable_image_extractionbool falseDon’t extract images disable_image_captionsbool falseDon’t generate image captions save_checkpointbool falseSave checkpoint for reuse extrasstring - Comma-separated: track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types add_block_idsbool falseAdd block IDs to HTML for citations include_markdown_in_chunksbool falseInclude markdown content in chunks output token_efficient_markdownbool falseOptimize markdown for LLM token efficiency fence_synthetic_captionsbool falseWrap synthetic image captions in HTML comments additional_configstring - JSON with extra config options webhook_urlstring - Override webhook URL for this request
Processing Modes
Mode Description fastLowest latency, good for simple documents (default) balancedBalance of speed and accuracy accurateHighest accuracy, best for complex layouts
Response
Poll request_check_url until status is complete:
import time
while True :
response = requests.get(check_url, headers = headers)
result = response.json()
if result[ "status" ] == "complete" :
break
time.sleep( 2 )
print (result[ "markdown" ])
Response fields:
Field Type Description statusstring processing, complete, or failedsuccessbool Whether conversion succeeded markdownstring Markdown output (if format is markdown) htmlstring HTML output (if format is html) jsonobject JSON output (if format is json) chunksobject Chunked output (if format is chunks) imagesobject Extracted images as {filename: base64} metadataobject Document metadata page_countint Number of pages processed parse_quality_scorefloat Quality score (0-5) cost_breakdownobject Cost in cents errorstring Error message if failed
Extract structured data from documents using a JSON schema.
Endpoint: POST /api/v1/extract
Request
import requests
import json
headers = { "X-API-Key" : "YOUR_API_KEY" }
schema = {
"invoice_number" : { "type" : "string" , "description" : "Invoice ID" },
"total" : { "type" : "number" , "description" : "Total amount" },
"line_items" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"description" : { "type" : "string" },
"amount" : { "type" : "number" }
}
}
}
}
response = requests.post(
"https://www.datalab.to/api/v1/extract" ,
files = { "file" : ( "invoice.pdf" , open ( "invoice.pdf" , "rb" ), "application/pdf" )},
data = {
"page_schema" : json.dumps(schema),
"mode" : "balanced"
},
headers = headers
)
data = response.json()
check_url = data[ "request_check_url" ]
Parameters
Parameter Type Default Description filefile - Document file (multipart upload) file_urlstring - URL to document (alternative to file upload) page_schemastring - JSON schema defining the data to extract. Required unless schema_id is provided. schema_idstring - ID of a saved extraction schema (e.g. sch_k8Hx9mP2nQ4v). Mutually exclusive with page_schema. schema_versionint - Version of the saved schema to use. Only valid with schema_id; defaults to the latest version. checkpoint_idstring - Checkpoint ID from a previous /convert call (with save_checkpoint=true). Skips re-parsing. modestring fastProcessing mode: fast, balanced, accurate output_formatstring markdownOutput format: markdown, html, json, chunks max_pagesint - Maximum pages to process page_rangestring - Specific pages (e.g., "0-5,10", 0-indexed). For spreadsheets, filters by sheet index. save_checkpointbool falseSave a checkpoint after processing for reuse with subsequent calls webhook_urlstring - Override webhook URL for this request
The extracted data is returned in extraction_schema_json in the poll response.
See Structured Extraction for detailed examples.
Document Segmentation
Segment documents into structured sections using a JSON schema.
Endpoint: POST /api/v1/segment
Parameters
Parameter Type Default Description filefile - Document file (multipart upload) file_urlstring - URL to document (alternative to file upload) segmentation_schemastring required JSON schema defining the segments to extract checkpoint_idstring - Checkpoint ID from a previous /convert call (with save_checkpoint=true). Skips re-parsing. modestring fastProcessing mode: fast, balanced, accurate
See Document Segmentation for detailed examples.
Track Changes
Extract tracked changes (insertions and deletions) from DOCX files.
Endpoint: POST /api/v1/track-changes
response = requests.post(
"https://www.datalab.to/api/v1/track-changes" ,
files = { "file" : ( "document.docx" , open ( "document.docx" , "rb" ), "application/vnd.openxmlformats-officedocument.wordprocessingml.document" )},
headers = headers
)
See Track Changes for detailed examples.
Custom Processor
This feature is currently in beta. The API may change.
Execute custom AI-powered processors on documents.
Endpoint: POST /api/v1/custom-processor
POST /api/v1/custom-pipeline is deprecated (sunset: September 30, 2026). Migrate to POST /api/v1/custom-processor.
Parameters
Parameter Type Default Description filefile - Document file (multipart upload) file_urlstring - URL to document pipeline_idstring required Custom processor ID (cp_XXXXX) versionint - Processor version to run (default: active version) run_evalbool falseRun evaluation rules defined for the processor modestring fastProcessing mode: fast, balanced, accurate output_formatstring markdownOutput format: markdown, html, json, chunks webhook_urlstring - URL to POST when complete
Fill forms in PDFs and images.
Endpoint: POST /api/v1/fill
Request
import json
field_data = {
"full_name" : { "value" : "John Doe" , "description" : "Full legal name" },
"date" : { "value" : "2024-01-15" , "description" : "Today's date" },
"signature" : { "value" : "John Doe" , "description" : "Signature field" }
}
response = requests.post(
"https://www.datalab.to/api/v1/fill" ,
files = { "file" : ( "form.pdf" , open ( "form.pdf" , "rb" ), "application/pdf" )},
data = {
"field_data" : json.dumps(field_data),
"confidence_threshold" : "0.5"
},
headers = headers
)
Parameters
Parameter Type Default Description filefile - Form file (PDF or image) file_urlstring - URL to form field_datastring - JSON mapping field names to values contextstring - Additional context for field matching confidence_thresholdfloat 0.5Minimum confidence for matching (0-1) page_rangestring - Specific pages to process skip_cachebool falseSkip cached results
{
"field_key" : {
"value" : "The value to fill" ,
"description" : "Description to help match the field"
}
}
Response
Field Type Description statusstring Processing status successbool Whether filling succeeded output_formatstring pdf or pngoutput_base64string Base64-encoded filled form fields_filledarray Successfully filled field names fields_not_foundarray Unmatched field names page_countint Pages processed cost_breakdownobject Cost details
See Form Filling for more examples.
File Management
Upload and manage files for use in pipelines.
Upload File
Step 1: Request an upload URL
POST /api/v1/files/upload
Content-Type: application/json
{
"filename" : "document.pdf",
"content_type" : "application/pdf"
}
Response:
{
"file_id" : 123 ,
"upload_url" : "https://..." ,
"reference" : "datalab://file-abc123"
}
Step 2: Upload directly to the presigned URL
PUT {upload_url}
Content-Type: application/pdf
< file contents >
Step 3: Confirm upload
GET /api/v1/files/{file_id}/confirm
List Files
GET /api/v1/files?limit= 50 & offset = 0
GET /api/v1/files/{file_id}
Get Download URL
GET /api/v1/files/{file_id}/download?expires_in= 3600
Delete File
DELETE /api/v1/files/{file_id}
See File Management for detailed examples.
Thumbnails
Generate page thumbnails from a previously processed document:
GET /api/v1/thumbnails/{lookup_key}?thumb_width= 300 & page_range = 0-2
Parameter Type Default Description lookup_keystring Required The request ID from a previous conversion thumb_widthint 300 Thumbnail width in pixels page_rangestring All pages Pages to generate (e.g., "0,2-4")
Response:
{
"success" : true ,
"thumbnails" : [ "base64_encoded_jpg_1" , "base64_encoded_jpg_2" ]
}
Thumbnails are returned as base64-encoded JPG images.
Create Document
Generate DOCX files from markdown with track changes support:
POST /api/v1/create-document
Content-Type: application/json
{
"markdown" : "# Title\n\nThis is <ins data-revision-author= \" Editor \" >newly added</ins> text.",
"output_format" : "docx"
}
See Create Document for detailed examples.
Webhooks
Configure webhooks to receive notifications when processing completes instead of polling.
Set a default webhook URL in your account settings , or override per-request with the webhook_url parameter.
See Webhooks for configuration details.
Rate Limits
Default rate limits apply per API key. If you exceed limits, you’ll receive a 429 response.
See Rate Limits for details and how to request higher limits.
Next Steps
SDK Reference Use the Python SDK for a simpler integration with typed responses.
Webhooks Receive notifications when processing completes instead of polling.
API Limits Understand file size limits, page limits, and rate limiting.
Document Conversion Detailed guide to converting documents to Markdown, HTML, or JSON.