Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images.
SDK Usage
The simplest way to convert documents:
from datalab_sdk import DatalabClient, ConvertOptions
client = DatalabClient()
# Basic conversion
result = client.convert( "document.pdf" )
print (result.markdown)
# With options
options = ConvertOptions(
output_format = "markdown" ,
mode = "balanced" ,
paginate = True
)
result = client.convert( "document.pdf" , options = options)
# Save images
for filename, image_data in result.images.items():
with open (filename, "wb" ) as f:
f.write(image_data)
See SDK Conversion for complete SDK documentation.
REST API
Submit Request
curl -X POST https://www.datalab.to/api/v1/marker \
-H "X-API-Key: YOUR_API_KEY" \
-F "[email protected] " \
-F "output_format=markdown" \
-F "mode=balanced"
Response:
{
"success" : true ,
"request_id" : "abc123" ,
"request_check_url" : "https://www.datalab.to/api/v1/marker/abc123"
}
Poll for Results
curl https://www.datalab.to/api/v1/marker/abc123 \
-H "X-API-Key: YOUR_API_KEY"
Response when complete:
{
"status" : "complete" ,
"success" : true ,
"output_format" : "markdown" ,
"markdown" : "# Document Title \n\n Content here..." ,
"images" : {},
"metadata" : {},
"page_count" : 5 ,
"parse_quality_score" : 4.2 ,
"cost_breakdown" : {
"total_cents" : 1.5
}
}
Parameters
Core Parameters
Parameter Type Default Description filefile - Document file (multipart upload) file_urlstring - URL to document (alternative to file) output_formatstring markdownOutput format: markdown, html, json, chunks modestring balancedProcessing mode (see below)
Processing Modes
Mode Description fastLowest latency, good for simple documents balancedBalance of speed and accuracy (default) accurateHighest accuracy, best for complex layouts
Page Control
Parameter Type Default Description max_pagesint - Maximum pages to process page_rangestring - Specific pages (e.g., "0-5,10", 0-indexed) paginatebool falseAdd page delimiters to output
Image Handling
Parameter Type Default Description disable_image_extractionbool falseDon’t extract images disable_image_captionsbool falseDon’t generate image captions
Advanced Options
Parameter Type Default Description add_block_idsbool falseAdd data-block-id attributes to HTML elements skip_cachebool falseSkip cached results save_checkpointbool falseSave checkpoint for reuse page_schemastring - JSON schema for structured extraction segmentation_schemastring - JSON schema for document segmentation extrasstring - Comma-separated: track_changes, chart_understanding, extract_links additional_configstring - JSON with extra config (see below) webhook_urlstring - Override webhook URL for this request
Additional Config Options
Pass as JSON string in additional_config:
Key Type Description keep_spreadsheet_formattingbool Preserve spreadsheet formatting keep_pageheader_in_outputbool Include page headers keep_pagefooter_in_outputbool Include page footers
Example:
import json
options = ConvertOptions(
additional_config = json.dumps({
"keep_spreadsheet_formatting" : True ,
"keep_pageheader_in_output" : False
})
)
Response Fields
Field Type Description statusstring processing, complete, or failedsuccessbool Whether conversion succeeded output_formatstring Requested output format markdownstring Markdown output (if format is markdown) htmlstring HTML output (if format is html) jsonobject JSON output (if format is json) chunksobject Chunked output (if format is chunks) imagesobject Extracted images as {filename: base64} metadataobject Document metadata page_countint Number of pages processed parse_quality_scorefloat Quality score (0-5) cost_breakdownobject Cost in cents checkpoint_idstring Checkpoint ID (if save_checkpoint was true) errorstring Error message if failed
Examples
Convert with High Accuracy
from datalab_sdk import DatalabClient, ConvertOptions
client = DatalabClient()
options = ConvertOptions(
mode = "accurate" ,
output_format = "markdown"
)
result = client.convert( "complex_document.pdf" , options = options)
print ( f "Quality score: { result.parse_quality_score } " )
print (result.markdown)
HTML with Block IDs for Citations
options = ConvertOptions(
output_format = "html" ,
add_block_ids = True
)
result = client.convert( "document.pdf" , options = options)
# HTML elements have data-block-id attributes for citation tracking
Process Specific Pages
options = ConvertOptions(
page_range = "0-4,10,15-20" , # Pages 0-4, 10, and 15-20
output_format = "markdown"
)
result = client.convert( "large_document.pdf" , options = options)
options = ConvertOptions(
extras = "track_changes" ,
output_format = "json"
)
result = client.convert( "document_with_changes.docx" , options = options)
Full Python Example (REST API)
import os
import time
import requests
API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = os.getenv( "DATALAB_API_KEY" )
def convert_document ( file_path , output_format = "markdown" , mode = "balanced" ):
headers = { "X-API-Key" : API_KEY }
# Submit request
with open (file_path, "rb" ) as f:
response = requests.post(
API_URL ,
files = { "file" : (file_path, f, "application/pdf" )},
data = {
"output_format" : output_format,
"mode" : mode
},
headers = headers
)
data = response.json()
check_url = data[ "request_check_url" ]
# Poll for completion
for _ in range ( 300 ):
response = requests.get(check_url, headers = headers)
result = response.json()
if result[ "status" ] == "complete" :
return result
elif result[ "status" ] == "failed" :
raise Exception ( f "Conversion failed: { result.get( 'error' ) } " )
time.sleep( 2 )
raise Exception ( "Timeout waiting for conversion" )
# Usage
result = convert_document( "document.pdf" , mode = "balanced" )
print (result[ "markdown" ])
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
Try Datalab Get started with our API in less than a minute. We include free credits.