Skip to main content

Installation

The CLI is included with the SDK:
pip install datalab-python-sdk

Authentication

Set your API key as an environment variable:
export DATALAB_API_KEY=your_api_key_here
Or pass it with each command:
datalab convert document.pdf --api_key YOUR_API_KEY

Convert Documents

Convert documents to markdown, HTML, JSON, or chunks.

Basic Usage

# Convert a single file
datalab convert document.pdf

# Convert to specific format
datalab convert document.pdf --format html

# Convert with processing mode
datalab convert document.pdf --mode accurate

Output Options

# Save to specific directory
datalab convert document.pdf --output_dir ./output/

# Output formats
datalab convert document.pdf --format markdown
datalab convert document.pdf --format html
datalab convert document.pdf --format json
datalab convert document.pdf --format chunks

Processing Options

# Processing modes
datalab convert document.pdf --mode fast       # Lowest latency
datalab convert document.pdf --mode balanced   # Default
datalab convert document.pdf --mode accurate   # Highest accuracy

# Limit pages
datalab convert document.pdf --max_pages 10

# Specific page range (0-indexed)
datalab convert document.pdf --page_range "0-5,10,15-20"

# Add page delimiters
datalab convert document.pdf --paginate

Advanced Options

# Add block IDs for citations (HTML only)
datalab convert document.pdf --format html --add_block_ids

# Disable image extraction
datalab convert document.pdf --disable_image_extraction

# Disable image captions
datalab convert document.pdf --disable_image_captions

# Skip cached results
datalab convert document.pdf --skip_cache

# Structured extraction with schema
datalab convert invoice.pdf --page_schema '{"invoice_number": {"type": "string"}}'

Directory Processing

Convert all documents in a directory:
# Convert all supported files
datalab convert ./documents/ --output_dir ./output/

# Filter by extension
datalab convert ./documents/ --extensions pdf,docx

# Control concurrency
datalab convert ./documents/ --max_concurrent 5

Convert Command Reference

OptionDescription
--formatOutput format: markdown, html, json, chunks
--modeProcessing mode: fast, balanced, accurate
--output_dir, -oOutput directory
--max_pagesMaximum pages to process
--page_rangeSpecific pages (e.g., "0-5,10")
--paginateAdd page delimiters
--add_block_idsAdd block IDs to HTML output
--disable_image_extractionDon’t extract images
--disable_image_captionsDon’t generate image captions
--page_schemaJSON schema for structured extraction
--skip_cacheForce reprocessing
--extensionsFile extensions to process (for directories)
--max_concurrentMaximum concurrent requests
--max_pollsMaximum polling attempts
--poll_intervalSeconds between polls
--api_keyDatalab API key
--base_urlAPI base URL

Workflow Commands

List Workflows

datalab list-workflows

Get Workflow Details

datalab get-workflow --workflow_id 42

Get Step Types

List available workflow step types:
datalab get-step-types

Create Workflow

Create a workflow from a JSON definition file:
datalab create-workflow --definition workflow.json
Example workflow.json:
{
  "name": "Invoice Processor",
  "steps": [
    {
      "unique_name": "parse",
      "step_key": "marker_parse",
      "settings": {"max_pages": 10},
      "depends_on": []
    },
    {
      "unique_name": "extract",
      "step_key": "marker_extract",
      "settings": {
        "page_schema": {
          "invoice_number": {"type": "string"},
          "total": {"type": "number"}
        }
      },
      "depends_on": ["parse"]
    }
  ]
}

Execute Workflow

datalab execute-workflow --workflow_id 42 --file_urls "https://example.com/doc.pdf"

Check Execution Status

datalab get-execution-status --execution_id 123

Visualize Workflow

Generate a visual representation of a workflow:
datalab visualize-workflow --definition workflow.json

Examples

Batch Convert PDFs

# Convert all PDFs in a directory with accurate mode
datalab convert ./invoices/ \
  --extensions pdf \
  --mode accurate \
  --format json \
  --output_dir ./processed/

Extract Data from Documents

# Extract structured data using a schema
datalab convert invoice.pdf \
  --mode balanced \
  --page_schema '{
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "vendor": {"type": "string", "description": "Vendor name"}
  }' \
  --output_dir ./extracted/

High-Throughput Processing

# Process many files with high concurrency
datalab convert ./documents/ \
  --max_concurrent 10 \
  --mode fast \
  --output_dir ./output/

Getting Help

# General help
datalab --help

# Command-specific help
datalab convert --help
datalab create-workflow --help

Try Datalab

Get started with our API in less than a minute. We include free credits.