Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images.
Before you begin , make sure you have:
A Datalab account with an API key (new accounts include $5 in free credits)
Python 3.10+ installed
The Datalab SDK: pip install datalab-python-sdk
Your DATALAB_API_KEY environment variable set
Quick Start
Python SDK
cURL
Python (requests)
from datalab_sdk import DatalabClient, ConvertOptions
client = DatalabClient()
# Basic conversion
result = client.convert( "document.pdf" )
print (result.markdown)
# With options
options = ConvertOptions(
output_format = "markdown" ,
mode = "balanced" ,
paginate = True
)
result = client.convert( "document.pdf" , options = options)
The SDK handles polling automatically. For the REST API, you submit a request and poll the request_check_url until the status is complete.
See SDK Conversion for complete SDK documentation.
File limits: Maximum file size is 200 MB, with up to 7,000 pages per request. See API Limits for the full list.
Parameters
Core Parameters
Parameter Type Default Description filefile - Document file (multipart upload) file_urlstring - URL to document (alternative to file) output_formatstring markdownOutput format: markdown, html, json, chunks modestring balancedProcessing mode (see below)
Which output format should I use?
LLM/RAG pipelines → markdown (default, most compatible)
Web display → html (preserves visual structure)
Programmatic access to blocks → json (includes bounding boxes and block types)
Embedding and search → chunks (pre-chunked for vector databases)
Processing Modes
Mode Description Best For fastLowest latency, good for simple documents High-throughput pipelines, simple layouts balancedBalance of speed and accuracy (recommended) Most use cases accurateHighest accuracy, best for complex layouts Complex tables, dense layouts, scanned documents
Which mode should I use?
Most use cases → balanced (recommended default)
Simple, clean PDFs at high throughput → fast
Scanned documents, complex tables, or dense layouts → accurate
Page Control
Parameter Type Default Description max_pagesint - Maximum pages to process page_rangestring - Specific pages (e.g., "0-5,10", 0-indexed) paginatebool falseAdd page delimiters to output
Image Handling
Parameter Type Default Description disable_image_extractionbool falseDon’t extract images disable_image_captionsbool falseDon’t generate image captions
Advanced Options
Parameter Type Default Description add_block_idsbool falseAdd data-block-id attributes to HTML elements skip_cachebool falseSkip cached results save_checkpointbool falseSave checkpoint for reuse extrasstring - Comma-separated: track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types additional_configstring - JSON with extra config (see below) webhook_urlstring - Override webhook URL for this request
Additional Config Options
Pass as JSON string in additional_config:
Key Type Description keep_spreadsheet_formattingbool Preserve spreadsheet formatting keep_pageheader_in_outputbool Include page headers keep_pagefooter_in_outputbool Include page footers
Example:
import json
options = ConvertOptions(
additional_config = json.dumps({
"keep_spreadsheet_formatting" : True ,
"keep_pageheader_in_output" : False
})
)
Response Fields
Field Type Description statusstring processing, complete, or failedsuccessbool Whether conversion succeeded output_formatstring Requested output format markdownstring Markdown output (if format is markdown) htmlstring HTML output (if format is html) jsonobject JSON output (if format is json) chunksobject Chunked output (if format is chunks) imagesobject Extracted images as {filename: base64} metadataobject Document metadata page_countint Number of pages processed parse_quality_scorefloat Quality score (0-5) cost_breakdownobject Cost in cents checkpoint_idstring Checkpoint ID (if save_checkpoint was true) errorstring Error message if failed
Examples
Convert with High Accuracy
from datalab_sdk import DatalabClient, ConvertOptions
client = DatalabClient()
options = ConvertOptions(
mode = "accurate" ,
output_format = "markdown"
)
result = client.convert( "complex_document.pdf" , options = options)
print ( f "Quality score: { result.parse_quality_score } " )
print (result.markdown)
HTML with Block IDs for Citations
options = ConvertOptions(
output_format = "html" ,
add_block_ids = True
)
result = client.convert( "document.pdf" , options = options)
# HTML elements have data-block-id attributes for citation tracking
Process Specific Pages
options = ConvertOptions(
page_range = "0-4,10,15-20" , # Pages 0-4, 10, and 15-20
output_format = "markdown"
)
result = client.convert( "large_document.pdf" , options = options)
options = ConvertOptions(
extras = "track_changes" ,
output_format = "json"
)
result = client.convert( "document_with_changes.docx" , options = options)
Parse Quality Score
Every conversion response includes a parse_quality_score (0-5) that indicates how well the document was parsed:
Score Range Quality Recommended Action 4.0 - 5.0 Excellent Use the output directly 3.0 - 3.9 Good Review for minor issues 2.0 - 2.9 Fair Consider retrying with accurate mode 0.0 - 1.9 Poor Retry with accurate mode or check the input file
Use quality scores to build automated quality gates:
result = client.convert( "document.pdf" , options = ConvertOptions( mode = "balanced" ))
if result.parse_quality_score < 3.0 :
# Retry with higher accuracy
result = client.convert( "document.pdf" , options = ConvertOptions( mode = "accurate" ))
See Conditional Routing for quality-based workflow routing.
Checkpoints
Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing:
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json
client = DatalabClient()
# Step 1: Convert and save checkpoint
options = ConvertOptions(
save_checkpoint = True ,
output_format = "markdown"
)
result = client.convert( "document.pdf" , options = options)
checkpoint_id = result.checkpoint_id
# Step 2: Use checkpoint for extraction (no re-processing needed)
extraction_options = ExtractOptions(
page_schema = json.dumps({ "type" : "object" , "properties" : { "title" : { "type" : "string" }}}),
checkpoint_id = checkpoint_id
)
extract_result = client.extract( "document.pdf" , options = extraction_options)
Checkpoints save time and cost when you need to run multiple operations (extraction, segmentation) on the same document.
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
Next Steps