Basic Usage
Conversion Options
UseConvertOptions to control the conversion:
All Options
| Option | Type | Default | Description |
|---|---|---|---|
output_format | str | "markdown" | Output format: "markdown", "html", "json", "chunks" |
mode | str | "balanced" | Processing mode: "fast", "balanced", "accurate" |
paginate | bool | False | Add page delimiters to output |
max_pages | int | None | Maximum number of pages to process |
page_range | str | None | Specific pages to process (e.g., "0-5,10,15-20") |
skip_cache | bool | False | Skip cached results, force reprocessing |
disable_image_extraction | bool | False | Don’t extract images from document |
disable_image_captions | bool | False | Don’t generate captions for images |
page_schema | dict | None | JSON schema for structured data extraction |
segmentation_schema | str | None | Schema for document segmentation |
save_checkpoint | bool | False | Save intermediate checkpoint for reuse |
extras | str | None | Comma-separated features: "track_changes", "chart_understanding", "extract_links" |
add_block_ids | bool | False | Add block IDs to HTML output for citations |
keep_spreadsheet_formatting | bool | False | Preserve spreadsheet styling in HTML output |
webhook_url | str | None | Override account webhook URL for this request |
additional_config | dict | None | Additional configuration options |
Processing Modes
| Mode | Description | Use Case |
|---|---|---|
fast | Lowest latency | Simple documents, real-time applications |
balanced | Balance of speed and accuracy | General use (default) |
accurate | Highest accuracy | Complex layouts, tables, figures |
Output Formats
| Format | Description |
|---|---|
markdown | Clean markdown with headers, lists, tables |
html | Structured HTML preserving layout |
json | Block-level structure with bounding boxes |
chunks | Pre-chunked output for RAG applications |
Conversion Result
TheConversionResult object contains the converted content and metadata:
Result Fields
| Field | Type | Description |
|---|---|---|
success | bool | Whether conversion succeeded |
markdown | str | Markdown output (if format is markdown) |
html | str | HTML output (if format is html) |
json | dict | JSON output (if format is json) |
chunks | dict | Chunked output (if format is chunks) |
images | dict | Extracted images as {filename: base64_data} |
metadata | dict | Document metadata |
page_count | int | Number of pages processed |
parse_quality_score | float | Quality score from 0-5 |
cost_breakdown | dict | Cost details (list_cost_cents, final_cost_cents) |
extraction_schema_json | str | Extracted data if page_schema was provided |
segmentation_results | dict | Segmentation results if segmentation_schema was provided |
checkpoint_id | str | Checkpoint ID if save_checkpoint was True |
error | str | Error message if conversion failed |
Saving Output
Save the conversion result to files:document.md(or.html,.jsonbased on format)document_images/directory with extracted images (ifsave_images=True)
Async Usage
For high-throughput applications:Polling Configuration
Control polling behavior for long-running conversions:Special Features
Track Changes (Word Documents)
Extract tracked changes and comments from DOCX files:Chart Understanding
Extract data from charts and graphs:Block IDs for Citations
Add block IDs for tracking content back to source locations:Structured Extraction
Extract structured data using a JSON schema:Try Datalab
Get started with our API in less than a minute. We include free credits.