Skip to main content
Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images. Before you begin, make sure you have:
  1. A Datalab account with an API key (new accounts include $5 in free credits)
  2. Python 3.10+ installed
  3. The Datalab SDK: pip install datalab-python-sdk
  4. Your DATALAB_API_KEY environment variable set

Quick Start

from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

# Basic conversion
result = client.convert("document.pdf")
print(result.markdown)

# With options
options = ConvertOptions(
    output_format="markdown",
    mode="balanced",
    paginate=True
)
result = client.convert("document.pdf", options=options)
The SDK handles polling automatically. For the REST API, you submit a request and poll the request_check_url until the status is complete. See SDK Conversion for complete SDK documentation.
File limits: Maximum file size is 200 MB, with up to 7,000 pages per request. See API Limits for the full list.

Parameters

Core Parameters

ParameterTypeDefaultDescription
filefile-Document file (multipart upload)
file_urlstring-URL to document (alternative to file)
output_formatstringmarkdownOutput format: markdown, html, json, chunks
modestringbalancedProcessing mode (see below)
Which output format should I use?
  • LLM/RAG pipelinesmarkdown (default, most compatible)
  • Web displayhtml (preserves visual structure)
  • Programmatic access to blocksjson (includes bounding boxes and block types)
  • Embedding and searchchunks (pre-chunked for vector databases)

Processing Modes

ModeDescriptionBest For
fastLowest latency, good for simple documentsHigh-throughput pipelines, simple layouts
balancedBalance of speed and accuracy (recommended)Most use cases
accurateHighest accuracy, best for complex layoutsComplex tables, dense layouts, scanned documents
Which mode should I use?
  • Most use casesbalanced (recommended default)
  • Simple, clean PDFs at high throughput → fast
  • Scanned documents, complex tables, or dense layoutsaccurate

Page Control

ParameterTypeDefaultDescription
max_pagesint-Maximum pages to process
page_rangestring-Specific pages (e.g., "0-5,10", 0-indexed)
paginateboolfalseAdd page delimiters to output

Image Handling

ParameterTypeDefaultDescription
disable_image_extractionboolfalseDon’t extract images
disable_image_captionsboolfalseDon’t generate image captions

Advanced Options

ParameterTypeDefaultDescription
add_block_idsboolfalseAdd data-block-id attributes to HTML elements
skip_cacheboolfalseSkip cached results
save_checkpointboolfalseSave checkpoint for reuse
extrasstring-Comma-separated: track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types
additional_configstring-JSON with extra config (see below)
webhook_urlstring-Override webhook URL for this request
For structured extraction, use the Extract API. For document segmentation, use the Segment API.
The track_changes extra is supported on this endpoint. You can also use the dedicated Track Changes endpoint.

Additional Config Options

Pass as JSON string in additional_config:
KeyTypeDescription
keep_spreadsheet_formattingboolPreserve spreadsheet formatting
keep_pageheader_in_outputboolInclude page headers
keep_pagefooter_in_outputboolInclude page footers
Example:
import json

options = ConvertOptions(
    additional_config=json.dumps({
        "keep_spreadsheet_formatting": True,
        "keep_pageheader_in_output": False
    })
)

Response Fields

FieldTypeDescription
statusstringprocessing, complete, or failed
successboolWhether conversion succeeded
output_formatstringRequested output format
markdownstringMarkdown output (if format is markdown)
htmlstringHTML output (if format is html)
jsonobjectJSON output (if format is json)
chunksobjectChunked output (if format is chunks)
imagesobjectExtracted images as {filename: base64}
metadataobjectDocument metadata
page_countintNumber of pages processed
parse_quality_scorefloatQuality score (0-5)
cost_breakdownobjectCost in cents
checkpoint_idstringCheckpoint ID (if save_checkpoint was true)
errorstringError message if failed

Examples

Convert with High Accuracy

from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    mode="accurate",
    output_format="markdown"
)

result = client.convert("complex_document.pdf", options=options)
print(f"Quality score: {result.parse_quality_score}")
print(result.markdown)

HTML with Block IDs for Citations

options = ConvertOptions(
    output_format="html",
    add_block_ids=True
)

result = client.convert("document.pdf", options=options)
# HTML elements have data-block-id attributes for citation tracking

Process Specific Pages

options = ConvertOptions(
    page_range="0-4,10,15-20",  # Pages 0-4, 10, and 15-20
    output_format="markdown"
)

result = client.convert("large_document.pdf", options=options)

Extract Track Changes from Word Documents

options = ConvertOptions(
    extras="track_changes",
    output_format="json"
)

result = client.convert("document_with_changes.docx", options=options)

Parse Quality Score

Every conversion response includes a parse_quality_score (0-5) that indicates how well the document was parsed:
Score RangeQualityRecommended Action
4.0 - 5.0ExcellentUse the output directly
3.0 - 3.9GoodReview for minor issues
2.0 - 2.9FairConsider retrying with accurate mode
0.0 - 1.9PoorRetry with accurate mode or check the input file
Use quality scores to build automated quality gates:
result = client.convert("document.pdf", options=ConvertOptions(mode="balanced"))

if result.parse_quality_score < 3.0:
    # Retry with higher accuracy
    result = client.convert("document.pdf", options=ConvertOptions(mode="accurate"))
See Conditional Routing for quality-based workflow routing.

Checkpoints

Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing:
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
options = ConvertOptions(
    save_checkpoint=True,
    output_format="markdown"
)
result = client.convert("document.pdf", options=options)
checkpoint_id = result.checkpoint_id

# Step 2: Use checkpoint for extraction (no re-processing needed)
extraction_options = ExtractOptions(
    page_schema=json.dumps({"type": "object", "properties": {"title": {"type": "string"}}}),
    checkpoint_id=checkpoint_id
)
extract_result = client.extract("document.pdf", options=extraction_options)
Checkpoints save time and cost when you need to run multiple operations (extraction, segmentation) on the same document.
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.

Next Steps