Skip to main content

Basic Usage

from datalab_sdk import DatalabClient

client = DatalabClient()

# Convert to markdown (default)
result = client.convert("document.pdf")
print(result.markdown)

# Convert from URL
result = client.convert(file_url="https://example.com/document.pdf")
print(result.markdown)

Conversion Options

Use ConvertOptions to control the conversion:
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    output_format="markdown",     # Output format
    mode="balanced",              # Processing mode
    paginate=True,                # Add page delimiters
    max_pages=10,                 # Limit pages processed
    page_range="0-5,10",          # Specific pages (0-indexed)
)

result = client.convert("document.pdf", options=options)

All Options

OptionTypeDefaultDescription
output_formatstr"markdown"Output format: "markdown", "html", "json", "chunks"
modestr"balanced"Processing mode: "fast", "balanced", "accurate"
paginateboolFalseAdd page delimiters to output
max_pagesintNoneMaximum number of pages to process
page_rangestrNoneSpecific pages to process (e.g., "0-5,10,15-20")
skip_cacheboolFalseSkip cached results, force reprocessing
disable_image_extractionboolFalseDon’t extract images from document
disable_image_captionsboolFalseDon’t generate captions for images
page_schemadictNoneJSON schema for structured data extraction
segmentation_schemastrNoneSchema for document segmentation
save_checkpointboolFalseSave intermediate checkpoint for reuse
extrasstrNoneComma-separated features: "track_changes", "chart_understanding", "extract_links"
add_block_idsboolFalseAdd block IDs to HTML output for citations
keep_spreadsheet_formattingboolFalsePreserve spreadsheet styling in HTML output
webhook_urlstrNoneOverride account webhook URL for this request
additional_configdictNoneAdditional configuration options

Processing Modes

ModeDescriptionUse Case
fastLowest latencySimple documents, real-time applications
balancedBalance of speed and accuracyGeneral use (default)
accurateHighest accuracyComplex layouts, tables, figures

Output Formats

FormatDescription
markdownClean markdown with headers, lists, tables
htmlStructured HTML preserving layout
jsonBlock-level structure with bounding boxes
chunksPre-chunked output for RAG applications

Conversion Result

The ConversionResult object contains the converted content and metadata:
result = client.convert("document.pdf")

# Access content based on output format
print(result.markdown)        # Markdown output
print(result.html)            # HTML output
print(result.json)            # JSON structure
print(result.chunks)          # Chunked output

# Metadata
print(result.success)         # True if conversion succeeded
print(result.page_count)      # Number of pages processed
print(result.images)          # Dict of extracted images (filename -> base64)
print(result.metadata)        # Document metadata
print(result.parse_quality_score)  # Quality score (0-5)
print(result.cost_breakdown)  # Cost in cents

# Structured extraction results
print(result.extraction_schema_json)  # If page_schema was provided
print(result.segmentation_results)    # If segmentation_schema was provided

Result Fields

FieldTypeDescription
successboolWhether conversion succeeded
markdownstrMarkdown output (if format is markdown)
htmlstrHTML output (if format is html)
jsondictJSON output (if format is json)
chunksdictChunked output (if format is chunks)
imagesdictExtracted images as {filename: base64_data}
metadatadictDocument metadata
page_countintNumber of pages processed
parse_quality_scorefloatQuality score from 0-5
cost_breakdowndictCost details (list_cost_cents, final_cost_cents)
extraction_schema_jsonstrExtracted data if page_schema was provided
segmentation_resultsdictSegmentation results if segmentation_schema was provided
checkpoint_idstrCheckpoint ID if save_checkpoint was True
errorstrError message if conversion failed

Saving Output

Save the conversion result to files:
# Save during conversion
result = client.convert("document.pdf", save_output="output/document")

# Or save afterward
result.save_output("output/document", save_images=True)
This creates:
  • document.md (or .html, .json based on format)
  • document_images/ directory with extracted images (if save_images=True)

Async Usage

For high-throughput applications:
import asyncio
from datalab_sdk import AsyncDatalabClient, ConvertOptions

async def convert_documents():
    async with AsyncDatalabClient() as client:
        options = ConvertOptions(mode="fast", max_pages=5)
        result = await client.convert("document.pdf", options=options)
        return result.markdown

markdown = asyncio.run(convert_documents())

Polling Configuration

Control polling behavior for long-running conversions:
result = client.convert(
    "large_document.pdf",
    max_polls=600,        # Maximum polling attempts (default: 300)
    poll_interval=2,      # Seconds between polls (default: 1)
)

Special Features

Track Changes (Word Documents)

Extract tracked changes and comments from DOCX files:
options = ConvertOptions(
    output_format="html",
    extras="track_changes",
)
result = client.convert("contract.docx", options=options)
# HTML contains <ins>, <del>, and <comment> tags

Chart Understanding

Extract data from charts and graphs:
options = ConvertOptions(
    extras="chart_understanding",
)
result = client.convert("report.pdf", options=options)

Block IDs for Citations

Add block IDs for tracking content back to source locations:
options = ConvertOptions(
    output_format="html",
    add_block_ids=True,
)
result = client.convert("document.pdf", options=options)
# HTML elements include data-block-id attributes

Structured Extraction

Extract structured data using a JSON schema:
options = ConvertOptions(
    page_schema={
        "invoice_number": {"type": "string", "description": "Invoice number"},
        "total": {"type": "number", "description": "Total amount"},
        "items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        }
    }
)
result = client.convert("invoice.pdf", options=options)
extracted = result.extraction_schema_json
See Structured Extraction for more details.

Try Datalab

Get started with our API in less than a minute. We include free credits.