> ## Documentation Index > Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt > Use this file to discover all available pages before exploring further. # Document Conversion > Convert documents to Markdown, HTML, JSON, or chunks using the Convert API. Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call. ## Quick Start ```python Python SDK theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() # Basic conversion result = client.convert("document.pdf") print(result.markdown) # With options options = ConvertOptions( output_format="markdown", mode="balanced", paginate=True ) result = client.convert("document.pdf", options=options) ``` ```bash cURL theme={null} # Submit request curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=markdown" \ -F "mode=balanced" # Poll for results (use request_check_url from response) curl https://www.datalab.to/api/v1/convert/REQUEST_ID \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import os, time, requests API_URL = "https://www.datalab.to/api/v1/convert" headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} # Submit request with open("document.pdf", "rb") as f: response = requests.post( API_URL, files={"file": ("document.pdf", f, "application/pdf")}, data={"output_format": "markdown", "mode": "balanced"}, headers=headers ) check_url = response.json()["request_check_url"] # Poll for completion for _ in range(300): result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": print(result["markdown"]) break time.sleep(2) ``` The SDK handles polling automatically. For the REST API, you submit a request and poll the `request_check_url` until the status is `complete`. See [SDK Conversion](/docs/welcome/sdk/conversion) for complete SDK documentation. **File limits:** Maximum file size is 200 MB, with up to 7,000 pages per request. See [API Limits](/docs/common/limits) for the full list. ## Parameters ### Core Parameters | Parameter | Type | Default | Description | | --------------- | ------ | ---------- | --------------------------------------------------- | | `file` | file | - | Document file (multipart upload) | | `file_url` | string | - | URL to document (alternative to file) | | `output_format` | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks` | | `mode` | string | `fast` | Processing mode (see below) | **Which output format should I use?** * **LLM/RAG pipelines** → `markdown` (default, most compatible) * **Web display** → `html` (preserves visual structure) * **Programmatic access to blocks** → `json` (includes bounding boxes and block types) * **Embedding and search** → `chunks` (pre-chunked for vector databases) ### Processing Modes | Mode | Description | Best For | | ---------- | ----------------------------------------------- | ------------------------------------------------ | | `fast` | Lowest latency, good for simple documents | High-throughput pipelines, simple layouts | | `balanced` | Balance of speed and accuracy **(recommended)** | Most use cases | | `accurate` | Highest accuracy, best for complex layouts | Complex tables, dense layouts, scanned documents | **Which mode should I use?** * **Most use cases** → `balanced` (recommended default) * **Simple, clean PDFs** at high throughput → `fast` * **Scanned documents, complex tables, or dense layouts** → `accurate` ### Page Control | Parameter | Type | Default | Description | | ------------ | ------ | ------- | --------------------------------------------------------------------------------------- | | `max_pages` | int | - | Maximum pages to process | | `page_range` | string | - | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index. | | `paginate` | bool | `false` | Add page delimiters to output | ### Image Handling | Parameter | Type | Default | Description | | -------------------------- | ---- | ------- | ----------------------------- | | `disable_image_extraction` | bool | `false` | Don't extract images | | `disable_image_captions` | bool | `false` | Don't generate image captions | ### Advanced Options | Parameter | Type | Default | Description | | ---------------------------- | ------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `add_block_ids` | bool | `false` | Add `data-block-id` attributes to HTML elements | | `skip_cache` | bool | `false` | Skip cached results | | `save_checkpoint` | bool | `false` | Save checkpoint for reuse | | `word_bboxes` | bool | `false` | Predict per-word bounding boxes with confidence scores. Each word is inlined into HTML output as a `` element (markdown output strips these). Billed at \$0.30 per 1K pages. | | `extras` | string | - | Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_cell_bboxes`, `list_item_bboxes`, `infographic`, `new_block_types`. (`table_row_bboxes` is deprecated — use `table_cell_bboxes` instead.) | | `include_markdown_in_chunks` | bool | `false` | Include markdown content in chunks/JSON output | | `token_efficient_markdown` | bool | `false` | Optimize markdown for LLM token efficiency | | `fence_synthetic_captions` | bool | `false` | Wrap synthetic image captions in HTML comments | | `additional_config` | string | - | JSON with extra config (see below) | | `webhook_url` | string | - | Override webhook URL for this request | | `processing_location` | string | - | Data residency region override: `"eu"` or `"us"`. When set, use `file_url` or a pre-uploaded `datalab://` reference — multipart uploads are not supported. EU processing carries a regional pricing premium. | For structured extraction, use the [Extract API](/docs/recipes/structured-extraction/api-overview). For document segmentation, use the [Segment API](/docs/recipes/document-segmentation/auto-segmentation). The `track_changes` extra is supported on this endpoint. You can also use the dedicated [Track Changes endpoint](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents). ### Bounding Box Add-ons Three add-ons annotate HTML output with spatial coordinates and confidence scores. All are billed at **\$0.30 per 1K pages** each (additive on top of the base conversion rate) and require the `html` output format to expose the attributes. | Add-on | How to enable | What it annotates | | ----------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------- | | Word bboxes | `word_bboxes=True` | Every word in the document gets a `data-bbox` and `data-confidence` span in HTML | | Table cell bboxes | `extras="table_cell_bboxes"` | ``, ``, and ``/`` elements get `data-bbox`/`data-confidence`; also enables `word_bboxes` | | List item bboxes | `extras="list_item_bboxes"` | Each `

` element gets `data-bbox`/`data-confidence`; also enables `word_bboxes` | ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() # Get table cell bboxes (also includes word bboxes) options = ConvertOptions( output_format="html", extras="table_cell_bboxes,list_item_bboxes", ) result = client.convert("document.pdf", options=options) # HTML contains data-bbox and data-confidence on table cells, list items, and words ``` ### Additional Config Options Pass as JSON string in `additional_config`: | Key | Type | Description | | ----------------------------- | ---- | ------------------------------- | | `keep_spreadsheet_formatting` | bool | Preserve spreadsheet formatting | | `keep_pageheader_in_output` | bool | Include page headers | | `keep_pagefooter_in_output` | bool | Include page footers | Example: ```python theme={null} options = ConvertOptions( additional_config={ "keep_spreadsheet_formatting": True, "keep_pageheader_in_output": False } ) ``` ## Response Fields | Field | Type | Description | | --------------------- | ------ | --------------------------------------------- | | `status` | string | `processing`, `complete`, or `failed` | | `success` | bool | Whether conversion succeeded | | `output_format` | string | Requested output format | | `markdown` | string | Markdown output (if format is markdown) | | `html` | string | HTML output (if format is html) | | `json` | object | JSON output (if format is json) | | `chunks` | object | Chunked output (if format is chunks) | | `images` | object | Extracted images as `{filename: base64}` | | `metadata` | object | Document metadata | | `page_count` | int | Number of pages processed | | `parse_quality_score` | float | Quality score (0-5) | | `cost_breakdown` | object | Cost in cents | | `checkpoint_id` | string | Checkpoint ID (if `save_checkpoint` was true) | | `error` | string | Error message if failed | ## Examples ### Convert with High Accuracy ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() options = ConvertOptions( mode="accurate", output_format="markdown" ) result = client.convert("complex_document.pdf", options=options) print(f"Quality score: {result.parse_quality_score}") print(result.markdown) ``` ### HTML with Block IDs for Citations ```python theme={null} options = ConvertOptions( output_format="html", add_block_ids=True ) result = client.convert("document.pdf", options=options) # HTML elements have data-block-id attributes for citation tracking ``` ### Process Specific Pages ```python theme={null} options = ConvertOptions( page_range="0-4,10,15-20", # Pages 0-4, 10, and 15-20 output_format="markdown" ) result = client.convert("large_document.pdf", options=options) ``` ### Process Specific Sheets from a Spreadsheet For spreadsheet files, `page_range` filters by sheet index (0-based): ```python theme={null} options = ConvertOptions( page_range="0,2", # First and third sheets only output_format="markdown" ) result = client.convert("workbook.xlsx", options=options) ``` ### Extract Track Changes from Word Documents ```python theme={null} options = ConvertOptions( extras="track_changes", output_format="json" ) result = client.convert("document_with_changes.docx", options=options) ``` ## Parse Quality Score Every conversion response includes a `parse_quality_score` (0-5) that indicates how well the document was parsed: | Score Range | Quality | Recommended Action | | ----------- | --------- | -------------------------------------------------- | | 4.0 - 5.0 | Excellent | Use the output directly | | 3.0 - 3.9 | Good | Review for minor issues | | 2.0 - 2.9 | Fair | Consider retrying with `accurate` mode | | 0.0 - 1.9 | Poor | Retry with `accurate` mode or check the input file | Use quality scores to build automated quality gates: ```python theme={null} result = client.convert("document.pdf", options=ConvertOptions(mode="balanced")) if result.parse_quality_score < 3.0: # Retry with higher accuracy result = client.convert("document.pdf", options=ConvertOptions(mode="accurate")) ``` Use quality scores to gate pipeline execution or route documents to different processing configurations. ## Checkpoints Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions import json client = DatalabClient() # Step 1: Convert and save checkpoint options = ConvertOptions( save_checkpoint=True, output_format="markdown" ) result = client.convert("document.pdf", options=options) checkpoint_id = result.checkpoint_id # Step 2: Use checkpoint for extraction (no re-processing needed) extraction_options = ExtractOptions( page_schema=json.dumps({"type": "object", "properties": {"title": {"type": "string"}}}), checkpoint_id=checkpoint_id ) extract_result = client.extract("document.pdf", options=extraction_options) ``` Checkpoints save time and cost when you need to run multiple operations (extraction, segmentation) on the same document. Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly. ## Next Steps Extract structured data from documents using JSON schemas Process multiple documents concurrently Split multi-document PDFs into segments Get notified when conversions complete