> ## Documentation Index
> Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
> Use this file to discover all available pages before exploring further.

# Command Line Interface

> Use the Datalab CLI to convert documents from the command line.

## Installation

The CLI is included with the SDK:

```bash theme={null}
pip install datalab-python-sdk
```

## Authentication

Set your API key as an environment variable:

```bash theme={null}
export DATALAB_API_KEY=your_api_key_here
```

Or pass it with each command:

```bash theme={null}
datalab convert document.pdf --api_key YOUR_API_KEY
```

## Convert Documents

Convert documents to markdown, HTML, JSON, or chunks.

### Basic Usage

```bash theme={null}
# Convert a single file
datalab convert document.pdf

# Convert to specific format
datalab convert document.pdf --format html

# Convert with processing mode
datalab convert document.pdf --mode accurate
```

### Output Options

```bash theme={null}
# Save to specific directory
datalab convert document.pdf --output_dir ./output/

# Output formats
datalab convert document.pdf --format markdown
datalab convert document.pdf --format html
datalab convert document.pdf --format json
datalab convert document.pdf --format chunks
```

### Processing Options

```bash theme={null}
# Processing modes
datalab convert document.pdf --mode fast       # Lowest latency (default)
datalab convert document.pdf --mode balanced   # Balance of speed and accuracy
datalab convert document.pdf --mode accurate   # Highest accuracy

# Limit pages
datalab convert document.pdf --max_pages 10

# Specific page range (0-indexed)
datalab convert document.pdf --page_range "0-5,10,15-20"

# For spreadsheets, page_range filters by sheet index
datalab convert workbook.xlsx --page_range "0,2"

# Add page delimiters
datalab convert document.pdf --paginate
```

### Advanced Options

```bash theme={null}
# Add block IDs for citations (HTML only)
datalab convert document.pdf --format html --add_block_ids

# Disable image extraction
datalab convert document.pdf --disable_image_extraction

# Disable image captions
datalab convert document.pdf --disable_image_captions

# Skip cached results
datalab convert document.pdf --skip_cache
```

### Directory Processing

Convert all documents in a directory:

```bash theme={null}
# Convert all supported files
datalab convert ./documents/ --output_dir ./output/

# Filter by extension
datalab convert ./documents/ --extensions pdf,docx

# Control concurrency
datalab convert ./documents/ --max_concurrent 5
```

### Convert Command Reference

| Option                       | Description                                         |
| ---------------------------- | --------------------------------------------------- |
| `--format`                   | Output format: `markdown`, `html`, `json`, `chunks` |
| `--mode`                     | Processing mode: `fast`, `balanced`, `accurate`     |
| `--output_dir`, `-o`         | Output directory                                    |
| `--max_pages`                | Maximum pages to process                            |
| `--page_range`               | Specific pages (e.g., `"0-5,10"`)                   |
| `--paginate`                 | Add page delimiters                                 |
| `--add_block_ids`            | Add block IDs to HTML output                        |
| `--disable_image_extraction` | Don't extract images                                |
| `--disable_image_captions`   | Don't generate image captions                       |
| `--skip_cache`               | Force reprocessing                                  |
| `--extensions`               | File extensions to process (for directories)        |
| `--max_concurrent`           | Maximum concurrent requests                         |
| `--max_polls`                | Maximum polling attempts                            |
| `--poll_interval`            | Seconds between polls                               |
| `--api_key`                  | Datalab API key                                     |
| `--base_url`                 | API base URL                                        |

## Extract Structured Data

Extract structured data from documents using a JSON schema.

### Basic Usage

```bash theme={null}
# Extract data using a page schema
datalab extract invoice.pdf \
  --page_schema '{"invoice_number": {"type": "string"}, "total": {"type": "number"}}'

# Extract with a specific mode
datalab extract invoice.pdf \
  --page_schema '{"title": {"type": "string"}}' \
  --mode accurate

# Extract using a checkpoint from a previous conversion
datalab extract invoice.pdf \
  --page_schema '{"total": {"type": "number"}}' \
  --checkpoint_id "ckpt_abc123"
```

### Extract Command Reference

| Option               | Description                                           |
| -------------------- | ----------------------------------------------------- |
| `--page_schema`      | **(Required)** JSON schema defining fields to extract |
| `--checkpoint_id`    | Checkpoint ID from a previous conversion              |
| `--format`           | Output format: `markdown`, `html`, `json`, `chunks`   |
| `--mode`             | Processing mode: `fast`, `balanced`, `accurate`       |
| `--output_dir`, `-o` | Output directory                                      |
| `--max_pages`        | Maximum pages to process                              |
| `--page_range`       | Specific pages (e.g., `"0-5,10"`)                     |
| `--skip_cache`       | Force reprocessing                                    |
| `--api_key`          | Datalab API key                                       |
| `--base_url`         | API base URL                                          |

## Segment Documents

Segment documents into logical sections using a schema.

### Basic Usage

```bash theme={null}
# Segment a document
datalab segment report.pdf \
  --segmentation_schema '{"sections": [{"name": "intro", "description": "Introduction"}, {"name": "body", "description": "Main content"}]}'

# Segment with a checkpoint
datalab segment report.pdf \
  --segmentation_schema '{"sections": [{"name": "summary", "description": "Executive summary"}]}' \
  --checkpoint_id "ckpt_abc123"
```

### Segment Command Reference

| Option                  | Description                                                        |
| ----------------------- | ------------------------------------------------------------------ |
| `--segmentation_schema` | **(Required)** JSON schema defining segment names and descriptions |
| `--checkpoint_id`       | Checkpoint ID from a previous conversion                           |
| `--mode`                | Processing mode: `fast`, `balanced`, `accurate`                    |
| `--output_dir`, `-o`    | Output directory                                                   |
| `--max_pages`           | Maximum pages to process                                           |
| `--page_range`          | Specific pages (e.g., `"0-5,10"`)                                  |
| `--skip_cache`          | Force reprocessing                                                 |
| `--api_key`             | Datalab API key                                                    |
| `--base_url`            | API base URL                                                       |

## Track Changes

Extract tracked changes from DOCX documents.

### Basic Usage

```bash theme={null}
# Extract tracked changes from a Word document
datalab track-changes contract.docx

# Specify output format
datalab track-changes contract.docx --format html

# With pagination
datalab track-changes contract.docx --format html --paginate
```

### Track Changes Command Reference

| Option               | Description                                                                       |
| -------------------- | --------------------------------------------------------------------------------- |
| `--format`           | Comma-separated output formats: `markdown`, `html`, `chunks` (default: all three) |
| `--paginate`         | Add page delimiters to output                                                     |
| `--output_dir`, `-o` | Output directory                                                                  |
| `--api_key`          | Datalab API key                                                                   |
| `--base_url`         | API base URL                                                                      |

## Custom Processor

<Warning>
  The `custom-pipeline` CLI command is deprecated. It continues to work and calls the new `/api/v1/custom-processor` endpoint internally, but the command name itself will be updated in a future SDK release.
</Warning>

Execute a custom processor on a document.

### Basic Usage

```bash theme={null}
# Run a custom processor
datalab custom-pipeline document.pdf --pipeline_id "cp_XXXXX"

# Run with evaluation
datalab custom-pipeline document.pdf \
  --pipeline_id "cp_XXXXX" \
  --run_eval

# Specify format and mode
datalab custom-pipeline document.pdf \
  --pipeline_id "cp_XXXXX" \
  --format json \
  --mode accurate
```

### Custom Processor Command Reference

| Option               | Description                                         |
| -------------------- | --------------------------------------------------- |
| `--pipeline_id`      | **(Required)** Custom processor ID (`cp_XXXXX`)     |
| `--run_eval`         | Run evaluation rules for the processor              |
| `--format`           | Output format: `markdown`, `html`, `json`, `chunks` |
| `--mode`             | Processing mode: `fast`, `balanced`, `accurate`     |
| `--output_dir`, `-o` | Output directory                                    |
| `--api_key`          | Datalab API key                                     |
| `--base_url`         | API base URL                                        |

## Create Document

Create a DOCX document from markdown with track changes.

### Basic Usage

```bash theme={null}
# Create a document from a markdown file
datalab create-document --markdown input.md --output output.docx

# Create a document from inline markdown content
datalab create-document \
  --markdown "# Title\n\nDocument content here." \
  --output document.docx
```

### Create Document Command Reference

| Option           | Description                                                |
| ---------------- | ---------------------------------------------------------- |
| `--markdown`     | **(Required)** Markdown content or path to a markdown file |
| `--output`, `-o` | **(Required)** Output file path for the generated DOCX     |
| `--api_key`      | Datalab API key                                            |
| `--base_url`     | API base URL                                               |

## Examples

### Batch Convert PDFs

```bash theme={null}
# Convert all PDFs in a directory with accurate mode
datalab convert ./invoices/ \
  --extensions pdf \
  --mode accurate \
  --format json \
  --output_dir ./processed/
```

### Extract Data from Documents

```bash theme={null}
# Extract structured data using a schema
datalab extract invoice.pdf \
  --page_schema '{
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "vendor": {"type": "string", "description": "Vendor name"}
  }' \
  --mode balanced \
  --output_dir ./extracted/
```

### High-Throughput Processing

```bash theme={null}
# Process many files with high concurrency
datalab convert ./documents/ \
  --max_concurrent 10 \
  --mode fast \
  --output_dir ./output/
```

## Getting Help

```bash theme={null}
# General help
datalab --help

# Command-specific help
datalab convert --help
datalab extract --help
datalab segment --help
datalab track-changes --help
datalab custom-pipeline --help
datalab create-document --help
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Quickstart" icon="bolt" href="/docs/welcome/quickstart">
    Get up and running with Datalab in minutes.
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents efficiently in parallel.
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    Explore the full Python SDK for advanced usage.
  </Card>

  <Card title="Supported File Types" icon="file-circle-check" href="/docs/common/supportedfiletypes">
    See all document formats supported by Datalab.
  </Card>
</CardGroup>
