> ## Documentation Index
> Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Conversion

> Convert documents to Markdown, HTML, JSON, or chunks using the Convert API.

Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

<Info>
  **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call.
</Info>

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, ConvertOptions

  client = DatalabClient()

  # Basic conversion
  result = client.convert("document.pdf")
  print(result.markdown)

  # With options
  options = ConvertOptions(
      output_format="markdown",
      mode="balanced",
      paginate=True
  )
  result = client.convert("document.pdf", options=options)
  ```

  ```bash cURL theme={null}
  # Submit request
  curl -X POST https://www.datalab.to/api/v1/convert \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "output_format=markdown" \
    -F "mode=balanced"

  # Poll for results (use request_check_url from response)
  curl https://www.datalab.to/api/v1/convert/REQUEST_ID \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, time, requests

  API_URL = "https://www.datalab.to/api/v1/convert"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  # Submit request
  with open("document.pdf", "rb") as f:
      response = requests.post(
          API_URL,
          files={"file": ("document.pdf", f, "application/pdf")},
          data={"output_format": "markdown", "mode": "balanced"},
          headers=headers
      )

  check_url = response.json()["request_check_url"]

  # Poll for completion
  for _ in range(300):
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          print(result["markdown"])
          break
      time.sleep(2)
  ```
</CodeGroup>

The SDK handles polling automatically. For the REST API, you submit a request and poll the `request_check_url` until the status is `complete`.

See [SDK Conversion](/docs/welcome/sdk/conversion) for complete SDK documentation.

<Info>
  **File limits:** Maximum file size is 200 MB, with up to 7,000 pages per request. See [API Limits](/docs/common/limits) for the full list.
</Info>

## Parameters

### Core Parameters

| Parameter       | Type   | Default    | Description                                         |
| --------------- | ------ | ---------- | --------------------------------------------------- |
| `file`          | file   | -          | Document file (multipart upload)                    |
| `file_url`      | string | -          | URL to document (alternative to file)               |
| `output_format` | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks` |
| `mode`          | string | `fast`     | Processing mode (see below)                         |

<Tip>
  **Which output format should I use?**

  * **LLM/RAG pipelines** → `markdown` (default, most compatible)
  * **Web display** → `html` (preserves visual structure)
  * **Programmatic access to blocks** → `json` (includes bounding boxes and block types)
  * **Embedding and search** → `chunks` (pre-chunked for vector databases)
</Tip>

### Processing Modes

| Mode       | Description                                     | Best For                                         |
| ---------- | ----------------------------------------------- | ------------------------------------------------ |
| `fast`     | Lowest latency, good for simple documents       | High-throughput pipelines, simple layouts        |
| `balanced` | Balance of speed and accuracy **(recommended)** | Most use cases                                   |
| `accurate` | Highest accuracy, best for complex layouts      | Complex tables, dense layouts, scanned documents |

<Tip>
  **Which mode should I use?**

  * **Most use cases** → `balanced` (recommended default)
  * **Simple, clean PDFs** at high throughput → `fast`
  * **Scanned documents, complex tables, or dense layouts** → `accurate`
</Tip>

### Page Control

| Parameter    | Type   | Default | Description                                                                             |
| ------------ | ------ | ------- | --------------------------------------------------------------------------------------- |
| `max_pages`  | int    | -       | Maximum pages to process                                                                |
| `page_range` | string | -       | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index. |
| `paginate`   | bool   | `false` | Add page delimiters to output                                                           |

### Image Handling

| Parameter                  | Type | Default | Description                   |
| -------------------------- | ---- | ------- | ----------------------------- |
| `disable_image_extraction` | bool | `false` | Don't extract images          |
| `disable_image_captions`   | bool | `false` | Don't generate image captions |

### Advanced Options

| Parameter                    | Type   | Default | Description                                                                                                                                                                                                  |
| ---------------------------- | ------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `add_block_ids`              | bool   | `false` | Add `data-block-id` attributes to HTML elements                                                                                                                                                              |
| `skip_cache`                 | bool   | `false` | Skip cached results                                                                                                                                                                                          |
| `save_checkpoint`            | bool   | `false` | Save checkpoint for reuse                                                                                                                                                                                    |
| `extras`                     | string | -       | Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_row_bboxes`, `infographic`, `new_block_types`                                                                               |
| `include_markdown_in_chunks` | bool   | `false` | Include markdown content in chunks/JSON output                                                                                                                                                               |
| `token_efficient_markdown`   | bool   | `false` | Optimize markdown for LLM token efficiency                                                                                                                                                                   |
| `fence_synthetic_captions`   | bool   | `false` | Wrap synthetic image captions in HTML comments                                                                                                                                                               |
| `additional_config`          | string | -       | JSON with extra config (see below)                                                                                                                                                                           |
| `webhook_url`                | string | -       | Override webhook URL for this request                                                                                                                                                                        |
| `processing_location`        | string | -       | Data residency region override: `"eu"` or `"us"`. When set, use `file_url` or a pre-uploaded `datalab://` reference — multipart uploads are not supported. EU processing carries a regional pricing premium. |

<Note>
  For structured extraction, use the [Extract API](/docs/recipes/structured-extraction/api-overview). For document segmentation, use the [Segment API](/docs/recipes/document-segmentation/auto-segmentation).
</Note>

<Note>
  The `track_changes` extra is supported on this endpoint. You can also use the dedicated [Track Changes endpoint](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents).
</Note>

### Additional Config Options

Pass as JSON string in `additional_config`:

| Key                           | Type | Description                     |
| ----------------------------- | ---- | ------------------------------- |
| `keep_spreadsheet_formatting` | bool | Preserve spreadsheet formatting |
| `keep_pageheader_in_output`   | bool | Include page headers            |
| `keep_pagefooter_in_output`   | bool | Include page footers            |

Example:

```python theme={null}
options = ConvertOptions(
    additional_config={
        "keep_spreadsheet_formatting": True,
        "keep_pageheader_in_output": False
    }
)
```

## Response Fields

| Field                 | Type   | Description                                   |
| --------------------- | ------ | --------------------------------------------- |
| `status`              | string | `processing`, `complete`, or `failed`         |
| `success`             | bool   | Whether conversion succeeded                  |
| `output_format`       | string | Requested output format                       |
| `markdown`            | string | Markdown output (if format is markdown)       |
| `html`                | string | HTML output (if format is html)               |
| `json`                | object | JSON output (if format is json)               |
| `chunks`              | object | Chunked output (if format is chunks)          |
| `images`              | object | Extracted images as `{filename: base64}`      |
| `metadata`            | object | Document metadata                             |
| `page_count`          | int    | Number of pages processed                     |
| `parse_quality_score` | float  | Quality score (0-5)                           |
| `cost_breakdown`      | object | Cost in cents                                 |
| `checkpoint_id`       | string | Checkpoint ID (if `save_checkpoint` was true) |
| `error`               | string | Error message if failed                       |

## Examples

### Convert with High Accuracy

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    mode="accurate",
    output_format="markdown"
)

result = client.convert("complex_document.pdf", options=options)
print(f"Quality score: {result.parse_quality_score}")
print(result.markdown)
```

### HTML with Block IDs for Citations

```python theme={null}
options = ConvertOptions(
    output_format="html",
    add_block_ids=True
)

result = client.convert("document.pdf", options=options)
# HTML elements have data-block-id attributes for citation tracking
```

### Process Specific Pages

```python theme={null}
options = ConvertOptions(
    page_range="0-4,10,15-20",  # Pages 0-4, 10, and 15-20
    output_format="markdown"
)

result = client.convert("large_document.pdf", options=options)
```

### Process Specific Sheets from a Spreadsheet

For spreadsheet files, `page_range` filters by sheet index (0-based):

```python theme={null}
options = ConvertOptions(
    page_range="0,2",  # First and third sheets only
    output_format="markdown"
)

result = client.convert("workbook.xlsx", options=options)
```

### Extract Track Changes from Word Documents

```python theme={null}
options = ConvertOptions(
    extras="track_changes",
    output_format="json"
)

result = client.convert("document_with_changes.docx", options=options)
```

## Parse Quality Score

Every conversion response includes a `parse_quality_score` (0-5) that indicates how well the document was parsed:

| Score Range | Quality   | Recommended Action                                 |
| ----------- | --------- | -------------------------------------------------- |
| 4.0 - 5.0   | Excellent | Use the output directly                            |
| 3.0 - 3.9   | Good      | Review for minor issues                            |
| 2.0 - 2.9   | Fair      | Consider retrying with `accurate` mode             |
| 0.0 - 1.9   | Poor      | Retry with `accurate` mode or check the input file |

Use quality scores to build automated quality gates:

```python theme={null}
result = client.convert("document.pdf", options=ConvertOptions(mode="balanced"))

if result.parse_quality_score < 3.0:
    # Retry with higher accuracy
    result = client.convert("document.pdf", options=ConvertOptions(mode="accurate"))
```

Use quality scores to gate pipeline execution or route documents to different processing configurations.

## Checkpoints

Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
options = ConvertOptions(
    save_checkpoint=True,
    output_format="markdown"
)
result = client.convert("document.pdf", options=options)
checkpoint_id = result.checkpoint_id

# Step 2: Use checkpoint for extraction (no re-processing needed)
extraction_options = ExtractOptions(
    page_schema=json.dumps({"type": "object", "properties": {"title": {"type": "string"}}}),
    checkpoint_id=checkpoint_id
)
extract_result = client.extract("document.pdf", options=extraction_options)
```

Checkpoints save time and cost when you need to run multiple operations (extraction, segmentation) on the same document.

<Warning>
  Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
</Warning>

## Next Steps

<CardGroup cols={2}>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data from documents using JSON schemas
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents concurrently
  </Card>

  <Card title="Document Segmentation" icon="scissors" href="/docs/recipes/document-segmentation/auto-segmentation">
    Split multi-document PDFs into segments
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Get notified when conversions complete
  </Card>
</CardGroup>
