Datalab SDK Documentation

Our SDK is designed to help you quickly get started converting your documents. It supports:
  • Marker (document conversion) and OCR endpoints
  • Sync and async mode
  • Usage from Python or the CLI
  • Single file or directory conversion
  • Automatic saving of output
This guide will go into the SDK in more detail.

Installation

pip install datalab-python-sdk

Authentication

Set your API key as an environment variable:
export DATALAB_API_KEY="your_api_key_here"
Or pass it directly to the client:
from datalab_sdk import DatalabClient
client = DatalabClient(api_key="your_api_key_here")

Python Usage

Convert

The convert functionality uses the marker endpoint to convert documents to markdown, HTML, or JSON format.

Basic Usage

from datalab_sdk import DatalabClient
from datalab_sdk.models import ConvertOptions

# Synchronous client
client = DatalabClient()

# Basic conversion to markdown
result = client.convert("document.pdf")
print(result.markdown)

# With options
options = ConvertOptions(
    output_format="html",
    max_pages=10,
    force_ocr=True,
    use_llm=True
)
result = client.convert("document.pdf", options=options)
print(result.html)

Async Usage

import asyncio
from datalab_sdk import AsyncDatalabClient
from datalab_sdk.models import ConvertOptions

async def convert_document():
    async with AsyncDatalabClient() as client:
        options = ConvertOptions(
            output_format="json",
            paginate=True,
            max_pages=5
        )
        result = await client.convert("document.pdf", options=options)
        return result.json

# Run async function
result = asyncio.run(convert_document())

Convert Options

The ConvertOptions class supports all marker endpoint parameters. See more information on the parameters here.

Conversion Result

The ConversionResult object contains the converted content and metadata. See more information on the return fields here.

Output Saving

You can save conversion results:
# Save during conversion
result = client.convert("document.pdf", save_output="output_folder/document")

# Or save afterward
result.save_output("output_folder/document", save_images=True)

OCR

The OCR functionality extracts text with detailed positional information from documents.

Basic Usage

from datalab_sdk import DatalabClient
from datalab_sdk.models import OCROptions

# Synchronous client
client = DatalabClient()

# Basic OCR
result = client.ocr("document.pdf")
text = result.get_text()
print(text)

# With options
options = OCROptions(max_pages=5)
result = client.ocr("document.pdf", options=options)

Async Usage

import asyncio
from datalab_sdk import AsyncDatalabClient
from datalab_sdk.models import OCROptions

async def ocr_document():
    async with AsyncDatalabClient() as client:
        options = OCROptions(max_pages=3)
        result = await client.ocr("document.pdf", options=options)
        return result.get_text()

# Run async function
text = asyncio.run(ocr_document())

OCR Options

The OCROptions class supports OCR-specific parameters. See more information on the parameters here.

OCR Result

The OCRResult object contains detailed text and positional information. See more information on the return fields here.

Output Saving

# Save during OCR
result = client.ocr("document.pdf", save_output="output_folder/ocr_result")

# Or save afterward
result.save_output("output_folder/ocr_result")
# Creates: ocr_result.txt (plain text) and ocr_result.ocr.json (detailed data)

CLI Usage

Authentication

Pass the --api_key option or set the DATALAB_API_KEY environment variable.

Convert

Examples

# Convert to JSON format with LLM enhancement
datalab-sdk convert document.pdf --format json --use_llm

# Process only PDFs and Word docs in a directory
datalab-sdk convert /docs/ --extensions pdf,docx --max_concurrent 10

# Force OCR with custom page range
datalab-sdk convert document.pdf --force_ocr --page_range "0-5"

# Save to specific output directory
datalab-sdk convert document.pdf --output_dir /output/ --paginate

OCR

Examples

# OCR specific pages
datalab-sdk ocr document.pdf --page_range "0-2,5,7"

# Process images with custom extensions
datalab-sdk ocr /images/ --extensions png,jpg --max_concurrent 5

# OCR with custom output directory
datalab-sdk ocr document.pdf --output_dir /ocr_results/