- A Datalab account with an API key (new accounts include $5 in free credits)
- Python 3.10+ installed
- The Datalab SDK:
pip install datalab-python-sdk - Your
DATALAB_API_KEYenvironment variable set
Building for production? Use Pipelines to chain processors, version your configuration, and deploy with a single API call.
Quick Start
request_check_url until the status is complete.
See SDK Conversion for complete SDK documentation.
File limits: Maximum file size is 200 MB, with up to 7,000 pages per request. See API Limits for the full list.
Parameters
Core Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | - | Document file (multipart upload) |
file_url | string | - | URL to document (alternative to file) |
output_format | string | markdown | Output format: markdown, html, json, chunks |
mode | string | fast | Processing mode (see below) |
Processing Modes
| Mode | Description | Best For |
|---|---|---|
fast | Lowest latency, good for simple documents | High-throughput pipelines, simple layouts |
balanced | Balance of speed and accuracy (recommended) | Most use cases |
accurate | Highest accuracy, best for complex layouts | Complex tables, dense layouts, scanned documents |
Page Control
| Parameter | Type | Default | Description |
|---|---|---|---|
max_pages | int | - | Maximum pages to process |
page_range | string | - | Specific pages (e.g., "0-5,10", 0-indexed). For spreadsheets, filters by sheet index. |
paginate | bool | false | Add page delimiters to output |
Image Handling
| Parameter | Type | Default | Description |
|---|---|---|---|
disable_image_extraction | bool | false | Don’t extract images |
disable_image_captions | bool | false | Don’t generate image captions |
Advanced Options
| Parameter | Type | Default | Description |
|---|---|---|---|
add_block_ids | bool | false | Add data-block-id attributes to HTML elements |
skip_cache | bool | false | Skip cached results |
save_checkpoint | bool | false | Save checkpoint for reuse |
extras | string | - | Comma-separated: track_changes, chart_understanding, extract_links, table_row_bboxes, infographic, new_block_types |
include_markdown_in_chunks | bool | false | Include markdown content in chunks/JSON output |
token_efficient_markdown | bool | false | Optimize markdown for LLM token efficiency |
fence_synthetic_captions | bool | false | Wrap synthetic image captions in HTML comments |
additional_config | string | - | JSON with extra config (see below) |
webhook_url | string | - | Override webhook URL for this request |
For structured extraction, use the Extract API. For document segmentation, use the Segment API.
The
track_changes extra is supported on this endpoint. You can also use the dedicated Track Changes endpoint.Additional Config Options
Pass as JSON string inadditional_config:
| Key | Type | Description |
|---|---|---|
keep_spreadsheet_formatting | bool | Preserve spreadsheet formatting |
keep_pageheader_in_output | bool | Include page headers |
keep_pagefooter_in_output | bool | Include page footers |
Response Fields
| Field | Type | Description |
|---|---|---|
status | string | processing, complete, or failed |
success | bool | Whether conversion succeeded |
output_format | string | Requested output format |
markdown | string | Markdown output (if format is markdown) |
html | string | HTML output (if format is html) |
json | object | JSON output (if format is json) |
chunks | object | Chunked output (if format is chunks) |
images | object | Extracted images as {filename: base64} |
metadata | object | Document metadata |
page_count | int | Number of pages processed |
parse_quality_score | float | Quality score (0-5) |
cost_breakdown | object | Cost in cents |
checkpoint_id | string | Checkpoint ID (if save_checkpoint was true) |
error | string | Error message if failed |
Examples
Convert with High Accuracy
HTML with Block IDs for Citations
Process Specific Pages
Process Specific Sheets from a Spreadsheet
For spreadsheet files,page_range filters by sheet index (0-based):
Extract Track Changes from Word Documents
Parse Quality Score
Every conversion response includes aparse_quality_score (0-5) that indicates how well the document was parsed:
| Score Range | Quality | Recommended Action |
|---|---|---|
| 4.0 - 5.0 | Excellent | Use the output directly |
| 3.0 - 3.9 | Good | Review for minor issues |
| 2.0 - 2.9 | Fair | Consider retrying with accurate mode |
| 0.0 - 1.9 | Poor | Retry with accurate mode or check the input file |
Checkpoints
Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing:Next Steps
Structured Extraction
Extract structured data from documents using JSON schemas
Batch Processing
Process multiple documents concurrently
Document Segmentation
Split multi-document PDFs into segments
Webhooks
Get notified when conversions complete