Skip to main content
Custom Processors are currently in beta. Contact support@datalab.to for access.
Custom processors customize the output of the convert processor. When standard conversion doesn’t produce exactly what you need — edge-case layouts, domain-specific formatting, or use-case-specific output transformations — custom processors let you fine-tune the result. Before you begin, make sure you have:
  1. A Datalab account with an API key (new accounts include $5 in free credits)
  2. Python 3.10+ installed
  3. The Datalab SDK: pip install datalab-python-sdk
  4. Your DATALAB_API_KEY environment variable set

How Custom Processors Work

A custom processor applies modifications on top of document conversion. The flow is:
  1. The convert processor parses your document into structured output
  2. The custom processor applies your modifications to refine that output
Modifications can operate at different levels:
  • Block-level — Modify individual blocks (e.g., rewrite table captions, summarize content)
  • Page-level — Modify entire pages with full structural control (e.g., reorder blocks, add/remove elements)
  • Classification — Classify pages into categories for downstream routing

Creating a Custom Processor

The recommended way to create a custom processor is through Forge. The creation flow is a 3-step guided wizard:
  1. Describe — Use the chat-driven builder to articulate what your processor should do. Describe your goal in natural language (e.g., “Summarize all tables into bullet points” or “Extract only the financial data sections”) and the AI assistant will help you refine and confirm the specification before generating the processor.
  2. Documents — Upload example documents that represent your use case. These are used to generate and validate the processor configuration.
  3. Review — See the generated processor run on your examples. If the results aren’t right, use the Improve tab in the sidebar to describe what to change and generate a new version. The History tab shows all past versions and lets you revert to any of them; Details shows the active configuration.
Each custom processor gets an ID in the format cp_XXXXX.

Using a Custom Processor

Standalone

Run a custom processor directly on a document:
from datalab_sdk import DatalabClient, CustomProcessorOptions

client = DatalabClient()

options = CustomProcessorOptions(
    pipeline_id="cp_abc123",    # Your custom processor ID
    mode="balanced",
    output_format="markdown",
)

result = client.run_custom_processor("document.pdf", options=options)
print(result.markdown)

In a Pipeline

Use a custom processor as part of a pipeline by adding it as a custom processor:
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

pipeline = client.create_pipeline(steps=[
    PipelineProcessor(type="convert", settings={"mode": "balanced"}),
    PipelineProcessor(type="custom", settings={}, custom_processor_id="cp_abc123"),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "summary": {"type": "string"}
            }
        }
    })
])
This chains convert → custom → extract: the document is parsed, your custom modifications are applied, then structured data is extracted from the customized output.

CustomProcessorOptions

OptionTypeDefaultDescription
pipeline_idstrRequiredCustom processor ID (cp_XXXXX)
versionintActive versionSpecific processor version to run
run_evalboolFalseRun evaluation rules after processing
modestr"fast"Processing mode: "fast", "balanced", "accurate"
output_formatstr"markdown"Output format: "markdown", "html", "json", "chunks"
paginateboolFalseAdd page delimiters
add_block_idsboolFalseAdd block IDs for citation tracking
disable_image_extractionboolFalseDon’t extract images
disable_image_captionsboolFalseDon’t generate image captions
webhook_urlstr-Webhook URL for completion notification

Versioning

Custom processors support versioning. Each iteration creates a new version, letting you refine behavior over time:
# List versions
versions = client.list_custom_processor_versions("cp_abc123")
for v in versions["versions"]:
    print(f"v{v.version}: {v.description}")

# Switch active version
client.set_active_processor_version("cp_abc123", version=2)

Managing Custom Processors

# List your custom processors
result = client.list_custom_processors(limit=50)
for p in result["processors"]:
    print(f"{p.processor_id}: {p.name} (v{p.active_version})")

# Archive
client.archive_custom_processor("cp_abc123")

Next Steps

Pipeline Overview

Processor types, composition rules, and when to use pipelines.

Create a Pipeline

Build pipelines that include custom processors.

Document Conversion

Understand the convert processor that custom processors build on.

Contact Support

Request beta access to Custom Processors.