Installation
The CLI is included with the SDK:
pip install datalab-python-sdk
Authentication
Set your API key as an environment variable:
export DATALAB_API_KEY = your_api_key_here
Or pass it with each command:
datalab convert document.pdf --api_key YOUR_API_KEY
Convert Documents
Convert documents to markdown, HTML, JSON, or chunks.
Basic Usage
# Convert a single file
datalab convert document.pdf
# Convert to specific format
datalab convert document.pdf --format html
# Convert with processing mode
datalab convert document.pdf --mode accurate
Output Options
# Save to specific directory
datalab convert document.pdf --output_dir ./output/
# Output formats
datalab convert document.pdf --format markdown
datalab convert document.pdf --format html
datalab convert document.pdf --format json
datalab convert document.pdf --format chunks
Processing Options
# Processing modes
datalab convert document.pdf --mode fast # Lowest latency
datalab convert document.pdf --mode balanced # Default
datalab convert document.pdf --mode accurate # Highest accuracy
# Limit pages
datalab convert document.pdf --max_pages 10
# Specific page range (0-indexed)
datalab convert document.pdf --page_range "0-5,10,15-20"
# For spreadsheets, page_range filters by sheet index
datalab convert workbook.xlsx --page_range "0,2"
# Add page delimiters
datalab convert document.pdf --paginate
Advanced Options
# Add block IDs for citations (HTML only)
datalab convert document.pdf --format html --add_block_ids
# Disable image extraction
datalab convert document.pdf --disable_image_extraction
# Disable image captions
datalab convert document.pdf --disable_image_captions
# Skip cached results
datalab convert document.pdf --skip_cache
Directory Processing
Convert all documents in a directory:
# Convert all supported files
datalab convert ./documents/ --output_dir ./output/
# Filter by extension
datalab convert ./documents/ --extensions pdf,docx
# Control concurrency
datalab convert ./documents/ --max_concurrent 5
Convert Command Reference
Option Description --formatOutput format: markdown, html, json, chunks --modeProcessing mode: fast, balanced, accurate --output_dir, -oOutput directory --max_pagesMaximum pages to process --page_rangeSpecific pages (e.g., "0-5,10") --paginateAdd page delimiters --add_block_idsAdd block IDs to HTML output --disable_image_extractionDon’t extract images --disable_image_captionsDon’t generate image captions --skip_cacheForce reprocessing --extensionsFile extensions to process (for directories) --max_concurrentMaximum concurrent requests --max_pollsMaximum polling attempts --poll_intervalSeconds between polls --api_keyDatalab API key --base_urlAPI base URL
Extract structured data from documents using a JSON schema.
Basic Usage
# Extract data using a page schema
datalab extract invoice.pdf \
--page_schema '{"invoice_number": {"type": "string"}, "total": {"type": "number"}}'
# Extract with a specific mode
datalab extract invoice.pdf \
--page_schema '{"title": {"type": "string"}}' \
--mode accurate
# Extract using a checkpoint from a previous conversion
datalab extract invoice.pdf \
--page_schema '{"total": {"type": "number"}}' \
--checkpoint_id "ckpt_abc123"
Extract Command Reference
Option Description --page_schema(Required) JSON schema defining fields to extract--checkpoint_idCheckpoint ID from a previous conversion --formatOutput format: markdown, html, json, chunks --modeProcessing mode: fast, balanced, accurate --output_dir, -oOutput directory --max_pagesMaximum pages to process --page_rangeSpecific pages (e.g., "0-5,10") --skip_cacheForce reprocessing --api_keyDatalab API key --base_urlAPI base URL
Segment Documents
Segment documents into logical sections using a schema.
Basic Usage
# Segment a document
datalab segment report.pdf \
--segmentation_schema '{"sections": [{"name": "intro", "description": "Introduction"}, {"name": "body", "description": "Main content"}]}'
# Segment with a checkpoint
datalab segment report.pdf \
--segmentation_schema '{"sections": [{"name": "summary", "description": "Executive summary"}]}' \
--checkpoint_id "ckpt_abc123"
Segment Command Reference
Option Description --segmentation_schema(Required) JSON schema defining segment names and descriptions--checkpoint_idCheckpoint ID from a previous conversion --modeProcessing mode: fast, balanced, accurate --output_dir, -oOutput directory --max_pagesMaximum pages to process --page_rangeSpecific pages (e.g., "0-5,10") --skip_cacheForce reprocessing --api_keyDatalab API key --base_urlAPI base URL
Track Changes
Extract tracked changes from DOCX documents.
Basic Usage
# Extract tracked changes from a Word document
datalab track-changes contract.docx
# Specify output format
datalab track-changes contract.docx --format html
# With pagination
datalab track-changes contract.docx --format html --paginate
Track Changes Command Reference
Option Description --formatComma-separated output formats: markdown, html, chunks (default: all three) --paginateAdd page delimiters to output --output_dir, -oOutput directory --api_keyDatalab API key --base_urlAPI base URL
Custom Processor
The custom-pipeline CLI command is deprecated. It continues to work and calls the new /api/v1/custom-processor endpoint internally, but the command name itself will be updated in a future SDK release.
Execute a custom processor on a document.
Basic Usage
# Run a custom processor
datalab custom-pipeline document.pdf --pipeline_id "cp_XXXXX"
# Run with evaluation
datalab custom-pipeline document.pdf \
--pipeline_id "cp_XXXXX" \
--run_eval
# Specify format and mode
datalab custom-pipeline document.pdf \
--pipeline_id "cp_XXXXX" \
--format json \
--mode accurate
Custom Processor Command Reference
Option Description --pipeline_id(Required) Custom processor ID (cp_XXXXX)--run_evalRun evaluation rules for the processor --formatOutput format: markdown, html, json, chunks --modeProcessing mode: fast, balanced, accurate --output_dir, -oOutput directory --api_keyDatalab API key --base_urlAPI base URL
Create Document
Create a DOCX document from markdown with track changes.
Basic Usage
# Create a document from a markdown file
datalab create-document --markdown input.md --output output.docx
# Create a document from inline markdown content
datalab create-document \
--markdown "# Title\n\nDocument content here." \
--output document.docx
Create Document Command Reference
Option Description --markdown(Required) Markdown content or path to a markdown file--output, -o(Required) Output file path for the generated DOCX--api_keyDatalab API key --base_urlAPI base URL
Examples
Batch Convert PDFs
# Convert all PDFs in a directory with accurate mode
datalab convert ./invoices/ \
--extensions pdf \
--mode accurate \
--format json \
--output_dir ./processed/
# Extract structured data using a schema
datalab extract invoice.pdf \
--page_schema '{
"invoice_number": {"type": "string", "description": "Invoice ID"},
"total": {"type": "number", "description": "Total amount"},
"vendor": {"type": "string", "description": "Vendor name"}
}' \
--mode balanced \
--output_dir ./extracted/
High-Throughput Processing
# Process many files with high concurrency
datalab convert ./documents/ \
--max_concurrent 10 \
--mode fast \
--output_dir ./output/
Getting Help
# General help
datalab --help
# Command-specific help
datalab convert --help
datalab extract --help
datalab segment --help
datalab track-changes --help
datalab custom-pipeline --help
datalab create-document --help
Next Steps
Quickstart Get up and running with Datalab in minutes.
Batch Processing Process multiple documents efficiently in parallel.
SDK Reference Explore the full Python SDK for advanced usage.
Supported File Types See all document formats supported by Datalab.