> ## Documentation Index
> Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom Processors

> Fine-tune document conversion output with AI-generated custom processors.

Custom processors customize the output of the `convert` processor. When standard conversion doesn't produce exactly what you need — edge-case layouts, domain-specific formatting, or use-case-specific output transformations — custom processors let you fine-tune the result.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## How Custom Processors Work

A custom processor applies modifications on top of document conversion. The flow is:

1. The `convert` processor parses your document into structured output
2. The custom processor applies your modifications to refine that output

Modifications can operate at different levels:

* **Block-level** — Modify individual blocks (e.g., rewrite table captions, summarize content)
* **Page-level** — Modify entire pages with full structural control (e.g., reorder blocks, add/remove elements)
* **Classification** — Classify pages into categories for downstream routing

## Creating a Custom Processor

The recommended way to create a custom processor is through [Forge](https://www.datalab.to/app/playground). The creation flow is a 3-step guided wizard:

1. **Describe** — Use the chat-driven builder to articulate what your processor should do. Describe your goal in natural language (e.g., "Summarize all tables into bullet points" or "Extract only the financial data sections") and the AI assistant will help you refine and confirm the specification before generating the processor.
2. **Documents** — Upload example documents that represent your use case. These are used to generate and validate the processor configuration.
3. **Review** — See the generated processor run on your examples. If the results aren't right, use the **Improve** tab in the sidebar to describe what to change and generate a new version. The **History** tab shows all past versions and lets you revert to any of them; **Details** shows the active configuration.

Each custom processor gets an ID in the format `cp_XXXXX`.

## Using a Custom Processor

### Standalone

Run a custom processor directly on a document:

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, CustomProcessorOptions

  client = DatalabClient()

  options = CustomProcessorOptions(
      pipeline_id="cp_abc123",    # Your custom processor ID
      mode="balanced",
      output_format="markdown",
  )

  result = client.run_custom_processor("document.pdf", options=options)
  print(result.markdown)
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/custom-processor \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "pipeline_id=cp_abc123" \
    -F "mode=balanced" \
    -F "output_format=markdown"
  ```

  ```python Python (requests) theme={null}
  import os, time, requests

  url = "https://www.datalab.to/api/v1/custom-processor"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  with open("document.pdf", "rb") as f:
      resp = requests.post(url, headers=headers,
          files={"file": ("document.pdf", f, "application/pdf")},
          data={
              "pipeline_id": "cp_abc123",
              "mode": "balanced",
              "output_format": "markdown"
          })

  check_url = resp.json()["request_check_url"]

  for _ in range(300):
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          print(result["markdown"])
          break
      time.sleep(2)
  ```
</CodeGroup>

### In a Pipeline

Use a custom processor as part of a pipeline by adding it as a `custom` processor:

```python theme={null}
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

pipeline = client.create_pipeline(steps=[
    PipelineProcessor(type="convert", settings={"mode": "balanced"}),
    PipelineProcessor(type="custom", settings={}, custom_processor_id="cp_abc123"),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "summary": {"type": "string"}
            }
        }
    })
])
```

This chains convert → custom → extract: the document is parsed, your custom modifications are applied, then structured data is extracted from the customized output.

## CustomProcessorOptions

| Option                     | Type | Default        | Description                                                 |
| -------------------------- | ---- | -------------- | ----------------------------------------------------------- |
| `pipeline_id`              | str  | Required       | Custom processor ID (`cp_XXXXX`)                            |
| `version`                  | int  | Active version | Specific processor version to run                           |
| `run_eval`                 | bool | `False`        | Run evaluation rules after processing                       |
| `mode`                     | str  | `"fast"`       | Processing mode: `"fast"`, `"balanced"`, `"accurate"`       |
| `output_format`            | str  | `"markdown"`   | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"` |
| `paginate`                 | bool | `False`        | Add page delimiters                                         |
| `add_block_ids`            | bool | `False`        | Add block IDs for citation tracking                         |
| `disable_image_extraction` | bool | `False`        | Don't extract images                                        |
| `disable_image_captions`   | bool | `False`        | Don't generate image captions                               |
| `webhook_url`              | str  | -              | Webhook URL for completion notification                     |

## Versioning

Custom processors support versioning. Each iteration creates a new version, letting you refine behavior over time:

```python theme={null}
# List versions
versions = client.list_custom_processor_versions("cp_abc123")
for v in versions["versions"]:
    print(f"v{v.version}: {v.description}")

# Switch active version
client.set_active_processor_version("cp_abc123", version=2)
```

## Managing Custom Processors

```python theme={null}
# List your custom processors
result = client.list_custom_processors(limit=50)
for p in result["processors"]:
    print(f"{p.processor_id}: {p.name} (v{p.active_version})")

# Archive
client.archive_custom_processor("cp_abc123")
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Pipeline Overview" icon="sitemap" href="/docs/recipes/pipelines/pipeline-overview">
    Processor types, composition rules, and when to use pipelines.
  </Card>

  <Card title="Create a Pipeline" icon="hammer" href="/docs/recipes/pipelines/create-pipeline">
    Build pipelines that include custom processors.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Understand the convert processor that custom processors build on.
  </Card>

  <Card title="Forge Evals" icon="vials" href="/docs/recipes/forge-evals/overview">
    Evaluate and compare processor configurations across your document collection.
  </Card>
</CardGroup>