Document Conversion

Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images. Before you begin, make sure you have:

A Datalab account with an API key (new accounts include $5 in free credits)
Python 3.10+ installed
The Datalab SDK: pip install datalab-python-sdk
Your DATALAB_API_KEY environment variable set

Quick Start

from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

# Basic conversion
result = client.convert("document.pdf")
print(result.markdown)

# With options
options = ConvertOptions(
    output_format="markdown",
    mode="balanced",
    paginate=True
)
result = client.convert("document.pdf", options=options)

The SDK handles polling automatically. For the REST API, you submit a request and poll the request_check_url until the status is complete. See SDK Conversion for complete SDK documentation.

File limits: Maximum file size is 200 MB, with up to 7,000 pages per request. See API Limits for the full list.

Parameters

Core Parameters

Parameter	Type	Default	Description
`file`	file	-	Document file (multipart upload)
`file_url`	string	-	URL to document (alternative to file)
`output_format`	string	`markdown`	Output format: `markdown`, `html`, `json`, `chunks`
`mode`	string	`balanced`	Processing mode (see below)

Which output format should I use?

LLM/RAG pipelines → markdown (default, most compatible)
Web display → html (preserves visual structure)
Programmatic access to blocks → json (includes bounding boxes and block types)
Embedding and search → chunks (pre-chunked for vector databases)

Processing Modes

Mode	Description	Best For
`fast`	Lowest latency, good for simple documents	High-throughput pipelines, simple layouts
`balanced`	Balance of speed and accuracy (recommended)	Most use cases
`accurate`	Highest accuracy, best for complex layouts	Complex tables, dense layouts, scanned documents

Which mode should I use?

Most use cases → balanced (recommended default)
Simple, clean PDFs at high throughput → fast
Scanned documents, complex tables, or dense layouts → accurate

Page Control

Parameter	Type	Default	Description
`max_pages`	int	-	Maximum pages to process
`page_range`	string	-	Specific pages (e.g., `"0-5,10"`, 0-indexed)
`paginate`	bool	`false`	Add page delimiters to output

Image Handling

Parameter	Type	Default	Description
`disable_image_extraction`	bool	`false`	Don’t extract images
`disable_image_captions`	bool	`false`	Don’t generate image captions

Advanced Options

Parameter	Type	Default	Description
`add_block_ids`	bool	`false`	Add `data-block-id` attributes to HTML elements
`skip_cache`	bool	`false`	Skip cached results
`save_checkpoint`	bool	`false`	Save checkpoint for reuse
`extras`	string	-	Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_row_bboxes`, `infographic`, `new_block_types`
`additional_config`	string	-	JSON with extra config (see below)
`webhook_url`	string	-	Override webhook URL for this request

For structured extraction, use the Extract API. For document segmentation, use the Segment API.

The track_changes extra is supported on this endpoint. You can also use the dedicated Track Changes endpoint.

Additional Config Options

Pass as JSON string in additional_config:

Key	Type	Description
`keep_spreadsheet_formatting`	bool	Preserve spreadsheet formatting
`keep_pageheader_in_output`	bool	Include page headers
`keep_pagefooter_in_output`	bool	Include page footers

Example:

import json

options = ConvertOptions(
    additional_config=json.dumps({
        "keep_spreadsheet_formatting": True,
        "keep_pageheader_in_output": False
    })
)

Response Fields

Field	Type	Description
`status`	string	`processing`, `complete`, or `failed`
`success`	bool	Whether conversion succeeded
`output_format`	string	Requested output format
`markdown`	string	Markdown output (if format is markdown)
`html`	string	HTML output (if format is html)
`json`	object	JSON output (if format is json)
`chunks`	object	Chunked output (if format is chunks)
`images`	object	Extracted images as `{filename: base64}`
`metadata`	object	Document metadata
`page_count`	int	Number of pages processed
`parse_quality_score`	float	Quality score (0-5)
`cost_breakdown`	object	Cost in cents
`checkpoint_id`	string	Checkpoint ID (if `save_checkpoint` was true)
`error`	string	Error message if failed

Examples

Convert with High Accuracy

from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    mode="accurate",
    output_format="markdown"
)

result = client.convert("complex_document.pdf", options=options)
print(f"Quality score: {result.parse_quality_score}")
print(result.markdown)

HTML with Block IDs for Citations

options = ConvertOptions(
    output_format="html",
    add_block_ids=True
)

result = client.convert("document.pdf", options=options)
# HTML elements have data-block-id attributes for citation tracking

Process Specific Pages

options = ConvertOptions(
    page_range="0-4,10,15-20",  # Pages 0-4, 10, and 15-20
    output_format="markdown"
)

result = client.convert("large_document.pdf", options=options)

Extract Track Changes from Word Documents

options = ConvertOptions(
    extras="track_changes",
    output_format="json"
)

result = client.convert("document_with_changes.docx", options=options)

Parse Quality Score

Every conversion response includes a parse_quality_score (0-5) that indicates how well the document was parsed:

Score Range	Quality	Recommended Action
4.0 - 5.0	Excellent	Use the output directly
3.0 - 3.9	Good	Review for minor issues
2.0 - 2.9	Fair	Consider retrying with `accurate` mode
0.0 - 1.9	Poor	Retry with `accurate` mode or check the input file

Use quality scores to build automated quality gates:

result = client.convert("document.pdf", options=ConvertOptions(mode="balanced"))

if result.parse_quality_score < 3.0:
    # Retry with higher accuracy
    result = client.convert("document.pdf", options=ConvertOptions(mode="accurate"))

See Conditional Routing for quality-based workflow routing.

Checkpoints

Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing:

from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
options = ConvertOptions(
    save_checkpoint=True,
    output_format="markdown"
)
result = client.convert("document.pdf", options=options)
checkpoint_id = result.checkpoint_id

# Step 2: Use checkpoint for extraction (no re-processing needed)
extraction_options = ExtractOptions(
    page_schema=json.dumps({"type": "object", "properties": {"title": {"type": "string"}}}),
    checkpoint_id=checkpoint_id
)
extract_result = client.extract("document.pdf", options=extraction_options)

Checkpoints save time and cost when you need to run multiple operations (extraction, segmentation) on the same document.

Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.

Next Steps

Structured Extraction

Extract structured data from documents using JSON schemas

Batch Processing

Process multiple documents concurrently

Document Segmentation

Split multi-document PDFs into segments

Webhooks

Get notified when conversions complete

General

Structured Extraction

Document Segmentation

Form Filling

File Management

Workflows

Create Document

Track Changes

Table Recognition (Deprecated)

Forge Evals

Document Conversion

Quick Start

Parameters

Core Parameters

Processing Modes

Page Control

Image Handling

Advanced Options

Additional Config Options

Response Fields

Examples

Convert with High Accuracy

HTML with Block IDs for Citations

Process Specific Pages

Extract Track Changes from Word Documents

Parse Quality Score

Checkpoints

Next Steps

Structured Extraction

Batch Processing

Document Segmentation

Webhooks

General

Document Conversion

Structured Extraction

Document Segmentation

Form Filling

File Management

Workflows

Create Document

Track Changes

Table Recognition (Deprecated)

Forge Evals

​Quick Start

​Parameters

​Core Parameters

​Processing Modes

​Page Control

​Image Handling

​Advanced Options

​Additional Config Options

​Response Fields

​Examples

​Convert with High Accuracy

​HTML with Block IDs for Citations

​Process Specific Pages

​Extract Track Changes from Word Documents

​Parse Quality Score

​Checkpoints

​Next Steps

Structured Extraction

Batch Processing

Document Segmentation

Webhooks

Quick Start

Parameters

Core Parameters

Processing Modes

Page Control

Image Handling

Advanced Options

Additional Config Options

Response Fields

Examples

Convert with High Accuracy

HTML with Block IDs for Citations

Process Specific Pages

Extract Track Changes from Word Documents

Parse Quality Score

Checkpoints

Next Steps