Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.
For the simplest integration, use the Python SDK . The SDK handles authentication, polling, and provides typed responses.
Authentication
All requests require an API key in the X-API-Key header:
curl -X POST https://www.datalab.to/api/v1/marker \
-H "X-API-Key: YOUR_API_KEY" \
-F "[email protected] "
Get your API key from datalab.to/settings .
Request Pattern
All processing endpoints follow this pattern:
Submit a document for processing (returns immediately with a request_id)
Poll the status endpoint until processing completes
Retrieve results from the completed response
Submit Request
Response:
{
"success" : true ,
"request_id" : "abc123" ,
"request_check_url" : "https://www.datalab.to/api/v1/{endpoint}/abc123"
}
Poll for Results
GET /api/v1/{endpoint}/{request_id}
Response while processing:
{
"status" : "processing"
}
Response when complete:
{
"status" : "complete" ,
"success" : true ,
...results...
}
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
Document Conversion (Marker)
Convert documents to Markdown, HTML, JSON, or chunks.
Endpoint: POST /api/v1/marker
Request
import requests
url = "https://www.datalab.to/api/v1/marker"
headers = { "X-API-Key" : "YOUR_API_KEY" }
with open ( "document.pdf" , "rb" ) as f:
response = requests.post(
url,
files = { "file" : ( "document.pdf" , f, "application/pdf" )},
data = {
"output_format" : "markdown" ,
"mode" : "balanced" ,
},
headers = headers
)
data = response.json()
check_url = data[ "request_check_url" ]
Parameters
Parameter Type Default Description filefile - Document file (multipart upload) file_urlstring - URL to document (alternative to file upload) output_formatstring markdownOutput format: markdown, html, json, chunks modestring balancedProcessing mode: fast, balanced, accurate max_pagesint - Maximum pages to process page_rangestring - Specific pages (e.g., "0-5,10", 0-indexed) paginatebool falseAdd page delimiters to output skip_cachebool falseSkip cached results disable_image_extractionbool falseDon’t extract images disable_image_captionsbool falseDon’t generate image captions page_schemastring - JSON schema for structured extraction segmentation_schemastring - JSON schema for document segmentation save_checkpointbool falseSave checkpoint for reuse extrasstring - Comma-separated: track_changes, chart_understanding, extract_links add_block_idsbool falseAdd block IDs to HTML for citations additional_configstring - JSON with extra config options webhook_urlstring - Override webhook URL for this request
Processing Modes
Mode Description fastLowest latency, good for simple documents balancedBalance of speed and accuracy (default) accurateHighest accuracy, best for complex layouts
Response
Poll request_check_url until status is complete:
import time
while True :
response = requests.get(check_url, headers = headers)
result = response.json()
if result[ "status" ] == "complete" :
break
time.sleep( 2 )
print (result[ "markdown" ])
Response fields:
Field Type Description statusstring processing, complete, or failedsuccessbool Whether conversion succeeded markdownstring Markdown output (if format is markdown) htmlstring HTML output (if format is html) jsonobject JSON output (if format is json) chunksobject Chunked output (if format is chunks) imagesobject Extracted images as {filename: base64} metadataobject Document metadata page_countint Number of pages processed parse_quality_scorefloat Quality score (0-5) cost_breakdownobject Cost in cents extraction_schema_jsonstring Extracted data (if page_schema provided) segmentation_resultsobject Segmentation results (if schema provided) errorstring Error message if failed
Extract structured data from documents using a JSON schema. This is a feature of the Marker endpoint.
import json
schema = {
"invoice_number" : { "type" : "string" , "description" : "Invoice ID" },
"total" : { "type" : "number" , "description" : "Total amount" },
"line_items" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"description" : { "type" : "string" },
"amount" : { "type" : "number" }
}
}
}
}
response = requests.post(
"https://www.datalab.to/api/v1/marker" ,
files = { "file" : ( "invoice.pdf" , open ( "invoice.pdf" , "rb" ), "application/pdf" )},
data = {
"page_schema" : json.dumps(schema),
"mode" : "balanced"
},
headers = headers
)
The extracted data is returned in extraction_schema_json.
See Structured Extraction for detailed examples.
Fill forms in PDFs and images.
Endpoint: POST /api/v1/fill
Request
import json
field_data = {
"full_name" : { "value" : "John Doe" , "description" : "Full legal name" },
"date" : { "value" : "2024-01-15" , "description" : "Today's date" },
"signature" : { "value" : "John Doe" , "description" : "Signature field" }
}
response = requests.post(
"https://www.datalab.to/api/v1/fill" ,
files = { "file" : ( "form.pdf" , open ( "form.pdf" , "rb" ), "application/pdf" )},
data = {
"field_data" : json.dumps(field_data),
"confidence_threshold" : "0.5"
},
headers = headers
)
Parameters
Parameter Type Default Description filefile - Form file (PDF or image) file_urlstring - URL to form field_datastring - JSON mapping field names to values contextstring - Additional context for field matching confidence_thresholdfloat 0.5Minimum confidence for matching (0-1) page_rangestring - Specific pages to process skip_cachebool falseSkip cached results
{
"field_key" : {
"value" : "The value to fill" ,
"description" : "Description to help match the field"
}
}
Response
Field Type Description statusstring Processing status successbool Whether filling succeeded output_formatstring pdf or pngoutput_base64string Base64-encoded filled form fields_filledarray Successfully filled field names fields_not_foundarray Unmatched field names page_countint Pages processed cost_breakdownobject Cost details
See Form Filling for more examples.
File Management
Upload and manage files for use in workflows.
Upload File
Step 1: Request an upload URL
POST /api/v1/files/upload
Content-Type: application/json
{
"filename" : "document.pdf",
"content_type" : "application/pdf"
}
Response:
{
"file_id" : 123 ,
"upload_url" : "https://..." ,
"reference" : "datalab://file-abc123"
}
Step 2: Upload directly to the presigned URL
PUT {upload_url}
Content-Type: application/pdf
< file contents >
Step 3: Confirm upload
GET /api/v1/files/{file_id}/confirm
List Files
GET /api/v1/files?limit= 50 & offset = 0
GET /api/v1/files/{file_id}
Get Download URL
GET /api/v1/files/{file_id}/download?expires_in= 3600
Delete File
DELETE /api/v1/files/{file_id}
See File Management for detailed examples.
Webhooks
Configure webhooks to receive notifications when processing completes instead of polling.
Set a default webhook URL in your account settings , or override per-request with the webhook_url parameter.
See Webhooks for configuration details.
Rate Limits
Default rate limits apply per API key. If you exceed limits, you’ll receive a 429 response.
See Rate Limits for details and how to request higher limits.
Try Datalab Get started with our API in less than a minute. We include free credits.