Skip to main content

Installation

The CLI is included with the SDK:
pip install datalab-python-sdk

Authentication

Set your API key as an environment variable:
export DATALAB_API_KEY=your_api_key_here
Or pass it with each command:
datalab convert document.pdf --api_key YOUR_API_KEY

Convert Documents

Convert documents to markdown, HTML, JSON, or chunks.

Basic Usage

# Convert a single file
datalab convert document.pdf

# Convert to specific format
datalab convert document.pdf --format html

# Convert with processing mode
datalab convert document.pdf --mode accurate

Output Options

# Save to specific directory
datalab convert document.pdf --output_dir ./output/

# Output formats
datalab convert document.pdf --format markdown
datalab convert document.pdf --format html
datalab convert document.pdf --format json
datalab convert document.pdf --format chunks

Processing Options

# Processing modes
datalab convert document.pdf --mode fast       # Lowest latency
datalab convert document.pdf --mode balanced   # Default
datalab convert document.pdf --mode accurate   # Highest accuracy

# Limit pages
datalab convert document.pdf --max_pages 10

# Specific page range (0-indexed)
datalab convert document.pdf --page_range "0-5,10,15-20"

# For spreadsheets, page_range filters by sheet index
datalab convert workbook.xlsx --page_range "0,2"

# Add page delimiters
datalab convert document.pdf --paginate

Advanced Options

# Add block IDs for citations (HTML only)
datalab convert document.pdf --format html --add_block_ids

# Disable image extraction
datalab convert document.pdf --disable_image_extraction

# Disable image captions
datalab convert document.pdf --disable_image_captions

# Skip cached results
datalab convert document.pdf --skip_cache

Directory Processing

Convert all documents in a directory:
# Convert all supported files
datalab convert ./documents/ --output_dir ./output/

# Filter by extension
datalab convert ./documents/ --extensions pdf,docx

# Control concurrency
datalab convert ./documents/ --max_concurrent 5

Convert Command Reference

OptionDescription
--formatOutput format: markdown, html, json, chunks
--modeProcessing mode: fast, balanced, accurate
--output_dir, -oOutput directory
--max_pagesMaximum pages to process
--page_rangeSpecific pages (e.g., "0-5,10")
--paginateAdd page delimiters
--add_block_idsAdd block IDs to HTML output
--disable_image_extractionDon’t extract images
--disable_image_captionsDon’t generate image captions
--skip_cacheForce reprocessing
--extensionsFile extensions to process (for directories)
--max_concurrentMaximum concurrent requests
--max_pollsMaximum polling attempts
--poll_intervalSeconds between polls
--api_keyDatalab API key
--base_urlAPI base URL

Extract Structured Data

Extract structured data from documents using a JSON schema.

Basic Usage

# Extract data using a page schema
datalab extract invoice.pdf \
  --page_schema '{"invoice_number": {"type": "string"}, "total": {"type": "number"}}'

# Extract with a specific mode
datalab extract invoice.pdf \
  --page_schema '{"title": {"type": "string"}}' \
  --mode accurate

# Extract using a checkpoint from a previous conversion
datalab extract invoice.pdf \
  --page_schema '{"total": {"type": "number"}}' \
  --checkpoint_id "ckpt_abc123"

Extract Command Reference

OptionDescription
--page_schema(Required) JSON schema defining fields to extract
--checkpoint_idCheckpoint ID from a previous conversion
--formatOutput format: markdown, html, json, chunks
--modeProcessing mode: fast, balanced, accurate
--output_dir, -oOutput directory
--max_pagesMaximum pages to process
--page_rangeSpecific pages (e.g., "0-5,10")
--skip_cacheForce reprocessing
--api_keyDatalab API key
--base_urlAPI base URL

Segment Documents

Segment documents into logical sections using a schema.

Basic Usage

# Segment a document
datalab segment report.pdf \
  --segmentation_schema '{"sections": [{"name": "intro", "description": "Introduction"}, {"name": "body", "description": "Main content"}]}'

# Segment with a checkpoint
datalab segment report.pdf \
  --segmentation_schema '{"sections": [{"name": "summary", "description": "Executive summary"}]}' \
  --checkpoint_id "ckpt_abc123"

Segment Command Reference

OptionDescription
--segmentation_schema(Required) JSON schema defining segment names and descriptions
--checkpoint_idCheckpoint ID from a previous conversion
--modeProcessing mode: fast, balanced, accurate
--output_dir, -oOutput directory
--max_pagesMaximum pages to process
--page_rangeSpecific pages (e.g., "0-5,10")
--skip_cacheForce reprocessing
--api_keyDatalab API key
--base_urlAPI base URL

Track Changes

Extract tracked changes from DOCX documents.

Basic Usage

# Extract tracked changes from a Word document
datalab track-changes contract.docx

# Specify output format
datalab track-changes contract.docx --format html

# With pagination
datalab track-changes contract.docx --format html --paginate

Track Changes Command Reference

OptionDescription
--formatComma-separated output formats: markdown, html, chunks (default: all three)
--paginateAdd page delimiters to output
--output_dir, -oOutput directory
--api_keyDatalab API key
--base_urlAPI base URL

Custom Processor

The custom-pipeline CLI command is deprecated. It continues to work and calls the new /api/v1/custom-processor endpoint internally, but the command name itself will be updated in a future SDK release.
Execute a custom processor on a document.

Basic Usage

# Run a custom processor
datalab custom-pipeline document.pdf --pipeline_id "cp_XXXXX"

# Run with evaluation
datalab custom-pipeline document.pdf \
  --pipeline_id "cp_XXXXX" \
  --run_eval

# Specify format and mode
datalab custom-pipeline document.pdf \
  --pipeline_id "cp_XXXXX" \
  --format json \
  --mode accurate

Custom Processor Command Reference

OptionDescription
--pipeline_id(Required) Custom processor ID (cp_XXXXX)
--run_evalRun evaluation rules for the processor
--formatOutput format: markdown, html, json, chunks
--modeProcessing mode: fast, balanced, accurate
--output_dir, -oOutput directory
--api_keyDatalab API key
--base_urlAPI base URL

Create Document

Create a DOCX document from markdown with track changes.

Basic Usage

# Create a document from a markdown file
datalab create-document --markdown input.md --output output.docx

# Create a document from inline markdown content
datalab create-document \
  --markdown "# Title\n\nDocument content here." \
  --output document.docx

Create Document Command Reference

OptionDescription
--markdown(Required) Markdown content or path to a markdown file
--output, -o(Required) Output file path for the generated DOCX
--api_keyDatalab API key
--base_urlAPI base URL

Examples

Batch Convert PDFs

# Convert all PDFs in a directory with accurate mode
datalab convert ./invoices/ \
  --extensions pdf \
  --mode accurate \
  --format json \
  --output_dir ./processed/

Extract Data from Documents

# Extract structured data using a schema
datalab extract invoice.pdf \
  --page_schema '{
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "vendor": {"type": "string", "description": "Vendor name"}
  }' \
  --mode balanced \
  --output_dir ./extracted/

High-Throughput Processing

# Process many files with high concurrency
datalab convert ./documents/ \
  --max_concurrent 10 \
  --mode fast \
  --output_dir ./output/

Getting Help

# General help
datalab --help

# Command-specific help
datalab convert --help
datalab extract --help
datalab segment --help
datalab track-changes --help
datalab custom-pipeline --help
datalab create-document --help

Next Steps

Quickstart

Get up and running with Datalab in minutes.

Batch Processing

Process multiple documents efficiently in parallel.

SDK Reference

Explore the full Python SDK for advanced usage.

Supported File Types

See all document formats supported by Datalab.