Skip to main content
Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images.

SDK Usage

The simplest way to convert documents:
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

# Basic conversion
result = client.convert("document.pdf")
print(result.markdown)

# With options
options = ConvertOptions(
    output_format="markdown",
    mode="balanced",
    paginate=True
)
result = client.convert("document.pdf", options=options)

# Save images
for filename, image_data in result.images.items():
    with open(filename, "wb") as f:
        f.write(image_data)
See SDK Conversion for complete SDK documentation.

REST API

Submit Request

curl -X POST https://www.datalab.to/api/v1/marker \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "[email protected]" \
  -F "output_format=markdown" \
  -F "mode=balanced"
Response:
{
  "success": true,
  "request_id": "abc123",
  "request_check_url": "https://www.datalab.to/api/v1/marker/abc123"
}

Poll for Results

curl https://www.datalab.to/api/v1/marker/abc123 \
  -H "X-API-Key: YOUR_API_KEY"
Response when complete:
{
  "status": "complete",
  "success": true,
  "output_format": "markdown",
  "markdown": "# Document Title\n\nContent here...",
  "images": {},
  "metadata": {},
  "page_count": 5,
  "parse_quality_score": 4.2,
  "cost_breakdown": {
    "total_cents": 1.5
  }
}

Parameters

Core Parameters

ParameterTypeDefaultDescription
filefile-Document file (multipart upload)
file_urlstring-URL to document (alternative to file)
output_formatstringmarkdownOutput format: markdown, html, json, chunks
modestringbalancedProcessing mode (see below)

Processing Modes

ModeDescription
fastLowest latency, good for simple documents
balancedBalance of speed and accuracy (default)
accurateHighest accuracy, best for complex layouts

Page Control

ParameterTypeDefaultDescription
max_pagesint-Maximum pages to process
page_rangestring-Specific pages (e.g., "0-5,10", 0-indexed)
paginateboolfalseAdd page delimiters to output

Image Handling

ParameterTypeDefaultDescription
disable_image_extractionboolfalseDon’t extract images
disable_image_captionsboolfalseDon’t generate image captions

Advanced Options

ParameterTypeDefaultDescription
add_block_idsboolfalseAdd data-block-id attributes to HTML elements
skip_cacheboolfalseSkip cached results
save_checkpointboolfalseSave checkpoint for reuse
page_schemastring-JSON schema for structured extraction
segmentation_schemastring-JSON schema for document segmentation
extrasstring-Comma-separated: track_changes, chart_understanding, extract_links
additional_configstring-JSON with extra config (see below)
webhook_urlstring-Override webhook URL for this request

Additional Config Options

Pass as JSON string in additional_config:
KeyTypeDescription
keep_spreadsheet_formattingboolPreserve spreadsheet formatting
keep_pageheader_in_outputboolInclude page headers
keep_pagefooter_in_outputboolInclude page footers
Example:
import json

options = ConvertOptions(
    additional_config=json.dumps({
        "keep_spreadsheet_formatting": True,
        "keep_pageheader_in_output": False
    })
)

Response Fields

FieldTypeDescription
statusstringprocessing, complete, or failed
successboolWhether conversion succeeded
output_formatstringRequested output format
markdownstringMarkdown output (if format is markdown)
htmlstringHTML output (if format is html)
jsonobjectJSON output (if format is json)
chunksobjectChunked output (if format is chunks)
imagesobjectExtracted images as {filename: base64}
metadataobjectDocument metadata
page_countintNumber of pages processed
parse_quality_scorefloatQuality score (0-5)
cost_breakdownobjectCost in cents
checkpoint_idstringCheckpoint ID (if save_checkpoint was true)
errorstringError message if failed

Examples

Convert with High Accuracy

from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    mode="accurate",
    output_format="markdown"
)

result = client.convert("complex_document.pdf", options=options)
print(f"Quality score: {result.parse_quality_score}")
print(result.markdown)

HTML with Block IDs for Citations

options = ConvertOptions(
    output_format="html",
    add_block_ids=True
)

result = client.convert("document.pdf", options=options)
# HTML elements have data-block-id attributes for citation tracking

Process Specific Pages

options = ConvertOptions(
    page_range="0-4,10,15-20",  # Pages 0-4, 10, and 15-20
    output_format="markdown"
)

result = client.convert("large_document.pdf", options=options)

Extract Track Changes from Word Documents

options = ConvertOptions(
    extras="track_changes",
    output_format="json"
)

result = client.convert("document_with_changes.docx", options=options)

Full Python Example (REST API)

import os
import time
import requests

API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = os.getenv("DATALAB_API_KEY")

def convert_document(file_path, output_format="markdown", mode="balanced"):
    headers = {"X-API-Key": API_KEY}

    # Submit request
    with open(file_path, "rb") as f:
        response = requests.post(
            API_URL,
            files={"file": (file_path, f, "application/pdf")},
            data={
                "output_format": output_format,
                "mode": mode
            },
            headers=headers
        )

    data = response.json()
    check_url = data["request_check_url"]

    # Poll for completion
    for _ in range(300):
        response = requests.get(check_url, headers=headers)
        result = response.json()

        if result["status"] == "complete":
            return result
        elif result["status"] == "failed":
            raise Exception(f"Conversion failed: {result.get('error')}")

        time.sleep(2)

    raise Exception("Timeout waiting for conversion")

# Usage
result = convert_document("document.pdf", mode="balanced")
print(result["markdown"])
Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.

Try Datalab

Get started with our API in less than a minute. We include free credits.