# [DEPRECATED] Marker
Source: https://documentation.datalab.to/api-reference/[deprecated]-marker

https://www.datalab.to/openapi.json post /api/v1/marker
**DEPRECATED**: Use the new endpoints instead:
- `/convert` for document conversion
- `/extract` for structured data extraction
- `/segment` for document segmentation
- `/custom-pipeline` for custom pipeline execution

This endpoint will be removed in a future version.


# [DEPRECATED] OCR
Source: https://documentation.datalab.to/api-reference/[deprecated]-ocr

https://www.datalab.to/openapi.json post /api/v1/ocr
[DEPRECATED] This endpoint is deprecated and will be removed in the future.
This endpoint is used to submit a PDF or image for OCR.  The OCR text lines will be returned, along with their bbox and polygon coordinates.


# [DEPRECATED] Table Recognition
Source: https://documentation.datalab.to/api-reference/[deprecated]-table-recognition

https://www.datalab.to/openapi.json post /api/v1/table_rec
[DEPRECATED] This endpoint is deprecated and will be removed in the future.
This endpoint is used to submit a request for table recognition.  The detected tables will be returned, as well as their parsed structure.


# Api Health
Source: https://documentation.datalab.to/api-reference/api-health

https://www.datalab.to/openapi.json get /api/v1/user_health
This endpoint is used to check the health of the API, given an API key.


# Add Files To Collection
Source: https://documentation.datalab.to/api-reference/collections/add-files-to-collection

https://www.datalab.to/openapi.json post /api/v1/collections/{collection_id}/files
Link existing uploaded files to a collection.


# Create Collection
Source: https://documentation.datalab.to/api-reference/collections/create-collection

https://www.datalab.to/openapi.json post /api/v1/collections
Create a new collection.


# Delete Collection
Source: https://documentation.datalab.to/api-reference/collections/delete-collection

https://www.datalab.to/openapi.json delete /api/v1/collections/{collection_id}
Soft-delete (archive) collection.


# Get Batch Run
Source: https://documentation.datalab.to/api-reference/collections/get-batch-run

https://www.datalab.to/openapi.json get /api/v1/eval_batch_runs/{run_id}
Get batch run status and progress.


# Get Batch Run Results
Source: https://documentation.datalab.to/api-reference/collections/get-batch-run-results

https://www.datalab.to/openapi.json get /api/v1/eval_batch_runs/{run_id}/results
Get per-file results for a batch run.


# Get Collection
Source: https://documentation.datalab.to/api-reference/collections/get-collection

https://www.datalab.to/openapi.json get /api/v1/collections/{collection_id}
Get collection with file list.


# List Batch Runs
Source: https://documentation.datalab.to/api-reference/collections/list-batch-runs

https://www.datalab.to/openapi.json get /api/v1/eval_batch_runs
List batch runs for the team, optionally filtered by collection, eval rubric, and/or pipeline.


# List Collections
Source: https://documentation.datalab.to/api-reference/collections/list-collections

https://www.datalab.to/openapi.json get /api/v1/collections
List collections for the team.


# Remove File From Collection
Source: https://documentation.datalab.to/api-reference/collections/remove-file-from-collection

https://www.datalab.to/openapi.json delete /api/v1/collections/{collection_id}/files/{uploaded_file_id}
Unlink a file from a collection (does NOT delete the uploaded file).


# Start Batch Run
Source: https://documentation.datalab.to/api-reference/collections/start-batch-run

https://www.datalab.to/openapi.json post /api/v1/eval_batch_runs
Start a batch evaluation run on all files in the collection.


# Update Collection
Source: https://documentation.datalab.to/api-reference/collections/update-collection

https://www.datalab.to/openapi.json put /api/v1/collections/{collection_id}
Update collection name/description.


# Convert Document
Source: https://documentation.datalab.to/api-reference/convert-document

https://www.datalab.to/openapi.json post /api/v1/convert
Convert a PDF, image, or document to markdown, HTML, JSON, or chunks. Use save_checkpoint=true to save parsed state for later /extract or /segment calls.


# Convert Result Check
Source: https://documentation.datalab.to/api-reference/convert-result-check

https://www.datalab.to/openapi.json get /api/v1/convert/{request_id}
Poll this endpoint to check the status of a Convert request and retrieve the converted document.


# Create Document
Source: https://documentation.datalab.to/api-reference/create-document

https://www.datalab.to/openapi.json post /api/v1/create-document
Create a DOCX document from markdown with track changes support. Supports <ins>, <del>, and <comment> tags.


# Create Document Result Check
Source: https://documentation.datalab.to/api-reference/create-document-result-check

https://www.datalab.to/openapi.json get /api/v1/create-document/{request_id}
Poll this endpoint to check status of a Create Document request and retrieve the generated document


# Create Workflow
Source: https://documentation.datalab.to/api-reference/create-workflow

https://www.datalab.to/openapi.json post /api/v1/workflows/workflows
Create a new workflow definition.

Example:
```json
{
  "name": "PDF Processing Pipeline",
  "team_id": 1,
  "steps": [
    {
      "step_key": "marker_parse",
      "unique_name": "parse",
      "settings": {"extract_images": true}
    },
    {
      "step_key": "marker_extract",
      "unique_name": "extract",
      "version": "1.0.0",
      "settings": {},
      "depends_on": ["parse"]
    },
    {
      "step_key": "marker_segment",
      "unique_name": "segment",
      "settings": {"method": "auto"},
      "depends_on": ["parse"]
    }
  ]
}
```

This creates a template that can be executed multiple times.
Note:
- version is optional and defaults to the latest active version
- unique_name is required and must be unique within the workflow
- depends_on references other steps by their unique_name


# Archive Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/archive-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/archive
Archive a custom processor (soft-delete).
Available to any team member with pipeline access.


# Check Pipeline Access
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/check-pipeline-access

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/access
Check if the current user's team has access to custom processors.

Custom processors are generally available: every team has access and can
create/iterate. Kept (always-true) for backwards compatibility with deployed
frontends and API integrations that still poll this endpoint; creation volume
is governed by the per-plan creation allowance, not by an access gate.


# Delete Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/delete-custom-pipeline

https://www.datalab.to/openapi.json delete /api/v1/custom_pipelines/{processor_id}
Permanently delete a custom processor and all its versions. Admin-only.


# Describe Customizer
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/describe-customizer

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/describe
Conversational endpoint for building a custom processor description.
Accepts the chat history, returns the next assistant message.
When the system has enough context, includes a proposed_description.


# Export Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/export-custom-pipeline

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/export
Export a custom processor with all versions. Admin-only.

Note: This endpoint allows admins to export ANY processor across all teams,
not just processors belonging to the admin's team.

Returns the full processor record and all version data, suitable for
use as training data or re-importing via the seed endpoint.


# Get Creation Allowance
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-creation-allowance

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/creation-allowance
2026 custom-processor CREATION allowance preview for the current team (§5).

The frontend reads this BEFORE creating a NEW processor to show the at-cap block
(Free / developer hard cap) or the one-time $5 confirmation (Team).

NOTE: the path is registered BEFORE /custom_pipelines/{lookup_key} so the literal
"creation-allowance" segment is matched by this route, not captured as a lookup_key.


# Get Custom Pipeline Status
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-custom-pipeline-status

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{lookup_key}
Check the status of a custom processor generation request using the request_check_url from the initial submission.


# Get Pipeline Eval Definition
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-pipeline-eval-definition

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/eval_definition
Get the eval_definition from a custom processor's active version.


# Get Pipelines Using Processor
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-pipelines-using-processor

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/pipelines
List pipelines (from the Pipeline table) that reference this custom processor
in their steps JSON.


# Get Processor Version Detail
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-processor-version-detail

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/versions/{version}
Get detailed data for a specific processor version, including pipeline_params and eval_definition.


# Iterate Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/iterate-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/iterate
Iterate on an existing custom processor.

Provides feedback to the agent which resumes from the previous session,
creating a new version of the processor parameters.


# List Custom Pipelines
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/list-custom-pipelines

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines
List all custom processors for a team.
Returns processors ordered by creation date (newest first).


# List Pipeline Versions
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/list-pipeline-versions

https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/versions
List all versions of a custom processor, ordered by version descending.


# Restore Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/restore-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/restore
Restore an archived custom processor. Admin-only.


# Seed Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/seed-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/seed
Directly create a completed custom processor from JSON. Admin-only.

Skips the agent entirely -- useful for seeding test data, local development,
and populating demos. The processor is immediately usable via POST /api/v1/marker.

USE WITH CAUTION


# Set Active Pipeline Version
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/set-active-pipeline-version

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/set_active
Set the active version of a custom processor.
Changes the active_version pointer to any existing version.


# Submit Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/submit-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines
Submit a custom processor generation request.


# Transfer Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/transfer-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/transfer
Transfer a custom processor to another team.
This endpoint allows admins to transfer ownership of a custom processor from one team
to another. This is useful for:
1. Beta testing: Create and test processors internally, then transfer to customers
2. Sharing: Move successful processor configurations between teams
3. Updating: Push iterated versions to an existing customer processor (via to_processor_id)

Superusers can transfer any processor regardless of team ownership.


# Update Pipeline Eval Definition
Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/update-pipeline-eval-definition

https://www.datalab.to/openapi.json put /api/v1/custom_pipelines/{processor_id}/eval_definition
Update the eval_definition on a custom processor's active version.


# Custom Processor Result Check
Source: https://documentation.datalab.to/api-reference/custom-processor-result-check

https://www.datalab.to/openapi.json get /api/v1/custom-processor/{request_id}
Poll this endpoint to check the status of a Custom Processor request and retrieve the results.


# Archive Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/archive-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/archive
Archive a custom processor (soft-delete).
Available to any team member with pipeline access.


# Check Pipeline Access
Source: https://documentation.datalab.to/api-reference/custom-processors/check-pipeline-access

https://www.datalab.to/openapi.json get /api/v1/custom_processors/access
Check if the current user's team has access to custom processors.

Custom processors are generally available: every team has access and can
create/iterate. Kept (always-true) for backwards compatibility with deployed
frontends and API integrations that still poll this endpoint; creation volume
is governed by the per-plan creation allowance, not by an access gate.


# Delete Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/delete-custom-pipeline

https://www.datalab.to/openapi.json delete /api/v1/custom_processors/{processor_id}
Permanently delete a custom processor and all its versions. Admin-only.


# Describe Customizer
Source: https://documentation.datalab.to/api-reference/custom-processors/describe-customizer

https://www.datalab.to/openapi.json post /api/v1/custom_processors/describe
Conversational endpoint for building a custom processor description.
Accepts the chat history, returns the next assistant message.
When the system has enough context, includes a proposed_description.


# Export Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/export-custom-pipeline

https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/export
Export a custom processor with all versions. Admin-only.

Note: This endpoint allows admins to export ANY processor across all teams,
not just processors belonging to the admin's team.

Returns the full processor record and all version data, suitable for
use as training data or re-importing via the seed endpoint.


# Get Creation Allowance
Source: https://documentation.datalab.to/api-reference/custom-processors/get-creation-allowance

https://www.datalab.to/openapi.json get /api/v1/custom_processors/creation-allowance
2026 custom-processor CREATION allowance preview for the current team (§5).

The frontend reads this BEFORE creating a NEW processor to show the at-cap block
(Free / developer hard cap) or the one-time $5 confirmation (Team).

NOTE: the path is registered BEFORE /custom_pipelines/{lookup_key} so the literal
"creation-allowance" segment is matched by this route, not captured as a lookup_key.


# Get Custom Pipeline Status
Source: https://documentation.datalab.to/api-reference/custom-processors/get-custom-pipeline-status

https://www.datalab.to/openapi.json get /api/v1/custom_processors/{lookup_key}
Check the status of a custom processor generation request using the request_check_url from the initial submission.


# Get Pipeline Eval Definition
Source: https://documentation.datalab.to/api-reference/custom-processors/get-pipeline-eval-definition

https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/eval_definition
Get the eval_definition from a custom processor's active version.


# Get Pipelines Using Processor
Source: https://documentation.datalab.to/api-reference/custom-processors/get-pipelines-using-processor

https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/pipelines
List pipelines (from the Pipeline table) that reference this custom processor
in their steps JSON.


# Get Processor Version Detail
Source: https://documentation.datalab.to/api-reference/custom-processors/get-processor-version-detail

https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/versions/{version}
Get detailed data for a specific processor version, including pipeline_params and eval_definition.


# Iterate Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/iterate-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/iterate
Iterate on an existing custom processor.

Provides feedback to the agent which resumes from the previous session,
creating a new version of the processor parameters.


# List Custom Pipelines
Source: https://documentation.datalab.to/api-reference/custom-processors/list-custom-pipelines

https://www.datalab.to/openapi.json get /api/v1/custom_processors
List all custom processors for a team.
Returns processors ordered by creation date (newest first).


# List Pipeline Versions
Source: https://documentation.datalab.to/api-reference/custom-processors/list-pipeline-versions

https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/versions
List all versions of a custom processor, ordered by version descending.


# Restore Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/restore-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/restore
Restore an archived custom processor. Admin-only.


# Seed Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/seed-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_processors/seed
Directly create a completed custom processor from JSON. Admin-only.

Skips the agent entirely -- useful for seeding test data, local development,
and populating demos. The processor is immediately usable via POST /api/v1/marker.

USE WITH CAUTION


# Set Active Pipeline Version
Source: https://documentation.datalab.to/api-reference/custom-processors/set-active-pipeline-version

https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/set_active
Set the active version of a custom processor.
Changes the active_version pointer to any existing version.


# Submit Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/submit-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_processors
Submit a custom processor generation request.


# Transfer Custom Pipeline
Source: https://documentation.datalab.to/api-reference/custom-processors/transfer-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/transfer
Transfer a custom processor to another team.
This endpoint allows admins to transfer ownership of a custom processor from one team
to another. This is useful for:
1. Beta testing: Create and test processors internally, then transfer to customers
2. Sharing: Move successful processor configurations between teams
3. Updating: Push iterated versions to an existing customer processor (via to_processor_id)

Superusers can transfer any processor regardless of team ownership.


# Update Pipeline Eval Definition
Source: https://documentation.datalab.to/api-reference/custom-processors/update-pipeline-eval-definition

https://www.datalab.to/openapi.json put /api/v1/custom_processors/{processor_id}/eval_definition
Update the eval_definition on a custom processor's active version.


# Delete Workflow
Source: https://documentation.datalab.to/api-reference/delete-workflow

https://www.datalab.to/openapi.json delete /api/v1/workflows/workflows/{workflow_id}
Delete a workflow definition.


# Create Eval Rubric
Source: https://documentation.datalab.to/api-reference/eval_rubrics/create-eval-rubric

https://www.datalab.to/openapi.json post /api/v1/eval_rubrics
Create new eval rubric for the team.


# Create From Feedback
Source: https://documentation.datalab.to/api-reference/eval_rubrics/create-from-feedback

https://www.datalab.to/openapi.json post /api/v1/eval_rubrics/from_feedback
Convert user feedback items into structured eval rubric using LLM rewrite.


# Delete Eval Rubric
Source: https://documentation.datalab.to/api-reference/eval_rubrics/delete-eval-rubric

https://www.datalab.to/openapi.json delete /api/v1/eval_rubrics/{rubric_id}
Soft-delete (archive) eval rubric.


# Generate From Feedback
Source: https://documentation.datalab.to/api-reference/eval_rubrics/generate-from-feedback

https://www.datalab.to/openapi.json post /api/v1/eval_rubrics/generate_from_feedback
Generate eval rubric from feedback items using LLM rewrite (no DB save).


# Get Eval Rubric
Source: https://documentation.datalab.to/api-reference/eval_rubrics/get-eval-rubric

https://www.datalab.to/openapi.json get /api/v1/eval_rubrics/{rubric_id}
Get eval rubric by ID.


# Import From Pipeline
Source: https://documentation.datalab.to/api-reference/eval_rubrics/import-from-pipeline

https://www.datalab.to/openapi.json post /api/v1/eval_rubrics/import_from_pipeline
Import eval_definition from a custom pipeline's active version.


# List Eval Rubrics
Source: https://documentation.datalab.to/api-reference/eval_rubrics/list-eval-rubrics

https://www.datalab.to/openapi.json get /api/v1/eval_rubrics
List eval rubrics for the team.


# Update Eval Rubric
Source: https://documentation.datalab.to/api-reference/eval_rubrics/update-eval-rubric

https://www.datalab.to/openapi.json put /api/v1/eval_rubrics/{rubric_id}
Update eval rubric.


# Execute Workflow
Source: https://documentation.datalab.to/api-reference/execute-workflow

https://www.datalab.to/openapi.json post /api/v1/workflows/workflows/{workflow_id}/execute
Execute a workflow definition.

This creates a WorkflowExecution and starts a Temporal workflow
that will dynamically load the steps and execute them.

Requires: X-API-Key header for authentication

Body (optional):
{
    "input_config": {
        "type": "single_file",
        "file_url": "https://example.com/file.pdf"
    }
}
or
{
    "input_config": {
        "type": "file_list",
        "file_urls": ["https://example.com/file1.pdf", "https://example.com/file2.pdf"]
    }
}


# Extract Result Check
Source: https://documentation.datalab.to/api-reference/extract-result-check

https://www.datalab.to/openapi.json get /api/v1/extract/{request_id}
Poll this endpoint to check the status of an Extract request and retrieve the extracted structured data.


# Extract Structured Data
Source: https://documentation.datalab.to/api-reference/extract-structured-data

https://www.datalab.to/openapi.json post /api/v1/extract
Extract structured data from a document using a JSON schema. Provide a file for end-to-end processing, or a checkpoint_id from a previous /convert call to skip re-parsing.


# Extraction Schema Generation Result Check
Source: https://documentation.datalab.to/api-reference/extraction-schema-generation-result-check

https://www.datalab.to/openapi.json get /api/v1/marker/extraction/gen_schemas/{request_id}
Poll this endpoint to check status of an Extraction Schema Generation request and retrieve final results


# Create Extraction Schema
Source: https://documentation.datalab.to/api-reference/extraction_schemas/create-extraction-schema

https://www.datalab.to/openapi.json post /api/v1/extraction_schemas
Create a new extraction schema for the team.


# Delete Extraction Schema
Source: https://documentation.datalab.to/api-reference/extraction_schemas/delete-extraction-schema

https://www.datalab.to/openapi.json delete /api/v1/extraction_schemas/{schema_id}
Soft-delete (archive) extraction schema.


# Get Extraction Schema
Source: https://documentation.datalab.to/api-reference/extraction_schemas/get-extraction-schema

https://www.datalab.to/openapi.json get /api/v1/extraction_schemas/{schema_id}
Get extraction schema by ID.


# List Extraction Schemas
Source: https://documentation.datalab.to/api-reference/extraction_schemas/list-extraction-schemas

https://www.datalab.to/openapi.json get /api/v1/extraction_schemas
List extraction schemas for the team.


# Update Extraction Schema
Source: https://documentation.datalab.to/api-reference/extraction_schemas/update-extraction-schema

https://www.datalab.to/openapi.json put /api/v1/extraction_schemas/{schema_id}
Update extraction schema. Optionally create a new version.


# Confirm Upload
Source: https://documentation.datalab.to/api-reference/files/confirm-upload

https://www.datalab.to/openapi.json get /api/v1/files/{file_id_or_hashid}/confirm
Confirm that a file was successfully uploaded to storage.

Call this endpoint after successfully uploading a file using the presigned URL
from /upload. This will verify the file exists, get the actual file size,
and mark it as completed.

Accepts either integer file_id (e.g., "4") or hashid (e.g., "npl94jxy").

This makes the file available for use in workflows.


# Delete File
Source: https://documentation.datalab.to/api-reference/files/delete-file

https://www.datalab.to/openapi.json delete /api/v1/files/{file_id}
Delete an uploaded file.

Removes the file from both storage and the database.


# Get File Download Url
Source: https://documentation.datalab.to/api-reference/files/get-file-download-url

https://www.datalab.to/openapi.json get /api/v1/files/{file_id}/download
Generate presigned URL for downloading a file.

The URL is valid for the specified expiry time (default: 1 hour).

Args:
    file_id: File ID
    expires_in: URL expiry time in seconds (default: 3600, max: 86400)


# Get File Metadata
Source: https://documentation.datalab.to/api-reference/files/get-file-metadata

https://www.datalab.to/openapi.json get /api/v1/files/{file_id}
Get metadata for an uploaded file.

Returns file information including size, content type, and upload timestamp.


# List Files
Source: https://documentation.datalab.to/api-reference/files/list-files

https://www.datalab.to/openapi.json get /api/v1/files
List all uploaded files for the team.

Supports pagination with limit and offset parameters.

Args:
    limit: Maximum number of files to return (default: 50, max: 100)
    offset: Number of files to skip (default: 0)


# Request Upload Url
Source: https://documentation.datalab.to/api-reference/files/request-upload-url

https://www.datalab.to/openapi.json post /api/v1/files/upload
Request a presigned upload URL for direct client-side upload to storage.

This is the recommended upload flow:
1. Client calls this endpoint with filename and content_type
2. Backend creates a pending file record and returns presigned PUT URL
3. Client uploads directly to storage using the presigned URL
4. Client calls /confirm to verify upload and get actual file size


# Form Filling
Source: https://documentation.datalab.to/api-reference/form-filling

https://www.datalab.to/openapi.json post /api/v1/fill
Fill PDF or image forms with provided field data. Supports PDFs with and without native form fields.


# Form Filling Result Check
Source: https://documentation.datalab.to/api-reference/form-filling-result-check

https://www.datalab.to/openapi.json get /api/v1/fill/{request_id}
Poll this endpoint to check status of a Form Filling request and retrieve the filled form


# Generate Extraction Schemas
Source: https://documentation.datalab.to/api-reference/generate-extraction-schemas

https://www.datalab.to/openapi.json post /api/v1/marker/extraction/gen_schemas
For a given file, generate potential extraction schemas.


# Get Execution Status
Source: https://documentation.datalab.to/api-reference/get-execution-status

https://www.datalab.to/openapi.json get /api/v1/workflows/executions/{execution_id}
Get the status and results of a workflow execution.

Returns execution status and step data keyed by unique_name.
For completed or failed steps, output data is provided as presigned URLs
since outputs can be large/complex.

Users can poll this endpoint until status is COMPLETED or FAILED.

Response:
{
    "execution_id": 123,
    "workflow_id": 456,
    "status": "IN_PROGRESS" | "COMPLETED" | "FAILED" | "QUEUED" | "PENDING",
    "created": "2025-10-22T10:00:00",
    "updated": "2025-10-22T10:05:00",
    "steps": {
        "parse": {
            "status": "COMPLETED",
            "started_at": "2025-10-22T10:00:00",
            "completed_at": "2025-10-22T10:02:00",
            "output_url": "https://presigned-url-to-output.json"
        },
        "extract": {
            "status": "IN_PROGRESS",
            "started_at": "2025-10-22T10:02:00"
        },
        "segment": {
            "status": "PENDING"
        }
    }
}


# Get Workflow
Source: https://documentation.datalab.to/api-reference/get-workflow

https://www.datalab.to/openapi.json get /api/v1/workflows/workflows/{workflow_id}
Get workflow definition with all steps.


# Health
Source: https://documentation.datalab.to/api-reference/health

https://www.datalab.to/openapi.json get /api/v1/health
This endpoint is used to check the health of the API.  Returns a JSON object with the key "status" set to "ok".


# List Step Types
Source: https://documentation.datalab.to/api-reference/list-step-types

https://www.datalab.to/openapi.json get /api/v1/workflows/step-types
List all available step types that can be used in workflows.

These are the building blocks users can compose into workflows.


# List Workflows
Source: https://documentation.datalab.to/api-reference/list-workflows

https://www.datalab.to/openapi.json get /api/v1/workflows/workflows
List all workflow definitions with their steps.


# Marker Result Check
Source: https://documentation.datalab.to/api-reference/marker-result-check

https://www.datalab.to/openapi.json get /api/v1/marker/{request_id}
Poll this endpoint to check status of Marker request and retrieve final results


# OCR Result Check
Source: https://documentation.datalab.to/api-reference/ocr-result-check

https://www.datalab.to/openapi.json get /api/v1/ocr/{request_id}
Poll this endpoint to check status of an OCR request and retrieve final results


# Add Template Examples
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/add-template-examples

https://www.datalab.to/openapi.json post /api/v1/pipeline_templates/{slug}/examples
Upload example files for a template. Admin-only.


# Clone Template
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/clone-template

https://www.datalab.to/openapi.json post /api/v1/pipeline_templates/{slug}/clone
Clone a template to the user's team as a new custom processor.


# Download Template Example
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/download-template-example

https://www.datalab.to/openapi.json get /api/v1/pipeline_templates/{slug}/examples/{filename}
Fetch example file from R2 and return content directly.


# Download Template Example Thumbnail
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/download-template-example-thumbnail

https://www.datalab.to/openapi.json get /api/v1/pipeline_templates/{slug}/examples/{filename}/thumbnail
Stream thumbnail image for an example file.


# Get Template
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/get-template

https://www.datalab.to/openapi.json get /api/v1/pipeline_templates/{slug}
Get detailed info for a pipeline template.


# List Templates
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/list-templates

https://www.datalab.to/openapi.json get /api/v1/pipeline_templates
List all published pipeline templates.


# Promote To Template
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/promote-to-template

https://www.datalab.to/openapi.json post /api/v1/pipeline_templates
Create a template by copying an existing completed processor. Admin-only.

Creates an independent copy so the admin can iterate on the source processor
without affecting the template. Only the active version is copied, and
agent session/checkpoint data is stripped so cloned copies don't share
Claude sessions.


# Remove Template
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/remove-template

https://www.datalab.to/openapi.json delete /api/v1/pipeline_templates/{slug}
Un-template a pipeline (sets is_template=False). Admin-only.


# Remove Template Example
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/remove-template-example

https://www.datalab.to/openapi.json delete /api/v1/pipeline_templates/{slug}/examples/{filename}
Remove an example file from a template. Admin-only.


# Update Template
Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/update-template

https://www.datalab.to/openapi.json put /api/v1/pipeline_templates/{slug}
Update template metadata. Admin-only.


# Archive Pipeline
Source: https://documentation.datalab.to/api-reference/pipelines/archive-pipeline

https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/archive
Archive a pipeline, hiding it from the default list.


# Create Pipeline
Source: https://documentation.datalab.to/api-reference/pipelines/create-pipeline

https://www.datalab.to/openapi.json post /api/v1/pipelines
Create a new pipeline for the team.


# Create Pipeline Version
Source: https://documentation.datalab.to/api-reference/pipelines/create-pipeline-version

https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/versions
Create a new version snapshot of the pipeline's current steps.


# Discard Draft
Source: https://documentation.datalab.to/api-reference/pipelines/discard-draft

https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/discard
Discard draft changes and reset Pipeline.steps to a published version's steps.


# Get Pipeline
Source: https://documentation.datalab.to/api-reference/pipelines/get-pipeline

https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id}
Get pipeline by pipeline_id.


# Get Pipeline Execution
Source: https://documentation.datalab.to/api-reference/pipelines/get-pipeline-execution

https://www.datalab.to/openapi.json get /api/v1/pipelines/executions/{execution_id}
Poll execution status. Returns per-step status with lookup keys for partial results.

Decision rule: Check PG PipelineExecution.status first.
- If terminal (completed/failed): return from PostgreSQL (post-sync, complete data)
- If running/pending: read Firestore for real-time step status


# Get Pipeline Rate
Source: https://documentation.datalab.to/api-reference/pipelines/get-pipeline-rate

https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id}/rate
Get the pipeline rate based on plan and effective processing region.


# Get Step Result
Source: https://documentation.datalab.to/api-reference/pipelines/get-step-result

https://www.datalab.to/openapi.json get /api/v1/pipelines/executions/{execution_id}/steps/{step_index}/result
Fetch intermediate result for a specific pipeline execution step.


# List Pipeline Executions
Source: https://documentation.datalab.to/api-reference/pipelines/list-pipeline-executions

https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id}/executions
List recent executions for a pipeline.


# List Pipeline Versions
Source: https://documentation.datalab.to/api-reference/pipelines/list-pipeline-versions

https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id}/versions
List all versions of a pipeline, newest first.


# List Pipelines
Source: https://documentation.datalab.to/api-reference/pipelines/list-pipelines

https://www.datalab.to/openapi.json get /api/v1/pipelines
List pipelines for the team.


# Run Pipeline
Source: https://documentation.datalab.to/api-reference/pipelines/run-pipeline

https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/run
Execute a pipeline on a file, creating an execution DAG with per-step tracking and billing.


# Save Pipeline
Source: https://documentation.datalab.to/api-reference/pipelines/save-pipeline

https://www.datalab.to/openapi.json put /api/v1/pipelines/{pipeline_id}/save
Name and promote a pipeline to saved status.


# Unarchive Pipeline
Source: https://documentation.datalab.to/api-reference/pipelines/unarchive-pipeline

https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/unarchive
Unarchive a pipeline, restoring it to the default list.


# Update Pipeline
Source: https://documentation.datalab.to/api-reference/pipelines/update-pipeline

https://www.datalab.to/openapi.json put /api/v1/pipelines/{pipeline_id}
Update pipeline steps. This is the auto-save path.


# Add Template Examples
Source: https://documentation.datalab.to/api-reference/processor-templates/add-template-examples

https://www.datalab.to/openapi.json post /api/v1/processor_templates/{slug}/examples
Upload example files for a template. Admin-only.


# Clone Template
Source: https://documentation.datalab.to/api-reference/processor-templates/clone-template

https://www.datalab.to/openapi.json post /api/v1/processor_templates/{slug}/clone
Clone a template to the user's team as a new custom processor.


# Download Template Example
Source: https://documentation.datalab.to/api-reference/processor-templates/download-template-example

https://www.datalab.to/openapi.json get /api/v1/processor_templates/{slug}/examples/{filename}
Fetch example file from R2 and return content directly.


# Download Template Example Thumbnail
Source: https://documentation.datalab.to/api-reference/processor-templates/download-template-example-thumbnail

https://www.datalab.to/openapi.json get /api/v1/processor_templates/{slug}/examples/{filename}/thumbnail
Stream thumbnail image for an example file.


# Get Template
Source: https://documentation.datalab.to/api-reference/processor-templates/get-template

https://www.datalab.to/openapi.json get /api/v1/processor_templates/{slug}
Get detailed info for a pipeline template.


# List Templates
Source: https://documentation.datalab.to/api-reference/processor-templates/list-templates

https://www.datalab.to/openapi.json get /api/v1/processor_templates
List all published pipeline templates.


# Promote To Template
Source: https://documentation.datalab.to/api-reference/processor-templates/promote-to-template

https://www.datalab.to/openapi.json post /api/v1/processor_templates
Create a template by copying an existing completed processor. Admin-only.

Creates an independent copy so the admin can iterate on the source processor
without affecting the template. Only the active version is copied, and
agent session/checkpoint data is stripped so cloned copies don't share
Claude sessions.


# Remove Template
Source: https://documentation.datalab.to/api-reference/processor-templates/remove-template

https://www.datalab.to/openapi.json delete /api/v1/processor_templates/{slug}
Un-template a pipeline (sets is_template=False). Admin-only.


# Remove Template Example
Source: https://documentation.datalab.to/api-reference/processor-templates/remove-template-example

https://www.datalab.to/openapi.json delete /api/v1/processor_templates/{slug}/examples/{filename}
Remove an example file from a template. Admin-only.


# Update Template
Source: https://documentation.datalab.to/api-reference/processor-templates/update-template

https://www.datalab.to/openapi.json put /api/v1/processor_templates/{slug}
Update template metadata. Admin-only.


# Run Custom Pipeline
Source: https://documentation.datalab.to/api-reference/run-custom-pipeline

https://www.datalab.to/openapi.json post /api/v1/custom-pipeline
Execute a custom pipeline configuration. The pipeline_id must reference a completed custom pipeline ID or a template ID.


# Run Custom Processor
Source: https://documentation.datalab.to/api-reference/run-custom-processor

https://www.datalab.to/openapi.json post /api/v1/custom-processor
Execute a custom processor configuration. The pipeline_id must reference a completed custom processor ID or a template ID.


# Segment Document
Source: https://documentation.datalab.to/api-reference/segment-document

https://www.datalab.to/openapi.json post /api/v1/segment
Segment a document into sections using a schema. Returns page ranges for each identified segment. Provide a file for end-to-end processing, or a checkpoint_id from a previous /convert call.


# Segment Result Check
Source: https://documentation.datalab.to/api-reference/segment-result-check

https://www.datalab.to/openapi.json get /api/v1/segment/{request_id}
Poll this endpoint to check the status of a Segment request and retrieve the segmentation results.


# Table Rec Result Check
Source: https://documentation.datalab.to/api-reference/table-rec-result-check

https://www.datalab.to/openapi.json get /api/v1/table_rec/{request_id}
Poll this endpoint to check status of Table Rec request and retrieve final results


# Thumbnails
Source: https://documentation.datalab.to/api-reference/thumbnails

https://www.datalab.to/openapi.json get /api/v1/thumbnails/{lookup_key}


# Track Changes
Source: https://documentation.datalab.to/api-reference/track-changes

https://www.datalab.to/openapi.json post /api/v1/track-changes
Extract and display tracked changes from DOCX documents. Returns markdown, HTML, and/or chunks with change annotations.


# Track Changes Result Check
Source: https://documentation.datalab.to/api-reference/track-changes-result-check

https://www.datalab.to/openapi.json get /api/v1/track-changes/{request_id}
Poll this endpoint to check the status of a Track Changes request and retrieve the results.


# API Limits & Rate Limiting
Source: https://documentation.datalab.to/docs/common/limits


Datalab implements limits to ensure fair usage and maintain service quality. This guide covers file size limits, page limits, and rate limiting.

## File Size Limits

| File Type        | Maximum Size |
| ---------------- | ------------ |
| PDF Documents    | 200 MB       |
| Images           | 200 MB       |
| Office Documents | 200 MB       |

## Page Limits

| Limit                     | Value |
| ------------------------- | ----- |
| Maximum pages per request | 7,000 |

For documents exceeding these limits, use the `page_range` parameter to process in segments:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

# Process a large document in segments
options = ConvertOptions(page_range="0-999")
result1 = client.convert("large_document.pdf", options=options)

options = ConvertOptions(page_range="1000-1999")
result2 = client.convert("large_document.pdf", options=options)
```

## Rate Limits

### Request Rate Limit

| Plan      | Requests per minute | Concurrent requests |
| --------- | ------------------- | ------------------- |
| Free tier | 10                  | 5                   |
| Team      | 200                 | 400                 |

When you exceed request rate limits, you'll receive a `429` response. The SDK handles retries automatically. For raw API calls, implement retry logic:

```python theme={null}
import time
import requests

def api_call_with_retry(url, headers, files, data, max_retries=3):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, files=files, data=data)

        if response.status_code == 429:
            time.sleep(60)
            continue

        return response

    raise Exception("Max retries exceeded")
```

### Page Concurrency Limit

In addition to request rate limits, Datalab enforces a limit on the total number of pages being processed concurrently across all your requests.

| Limit                      | Value |
| -------------------------- | ----- |
| Concurrent pages in flight | 5,000 |

Most workloads will not hit this limit. It primarily affects high-volume workloads with longer-running requests — for example, large or complex documents processed in accurate mode with additional features enabled — or extremely high-volume workloads. Such sustained workloads would benefit from an enterprise agreement or a batch job that we orchestrate for you. Contact [support@datalab.to](mailto:support@datalab.to) to discuss your requirements.

This limit differs from request rate limits in two important ways:

1. **It is not time-bound.** It limits the number of pages actively being processed at any given moment, not the number of requests per minute.
2. **It is enforced during processing, not at submission.** You will not receive a `429` response when submitting a document. Instead, the result will return with `success` set to `false` and an error message:

```json theme={null}
{
  "success": false,
  "error": "Page rate limit exceeded. Your team has {current_pages} pages in flight and this request adds {page_count} more ({total} total, limit: 5,000). Please wait for some requests to complete before submitting more, or contact support@datalab.to for a higher limit."
}
```

<Warning>
  Because this limit is not enforced at submission time, you won't get an HTTP error when submitting. Always check the `success` field in your results. If you're polling for results, back off and wait for in-flight requests to complete before submitting more.
</Warning>

## Enterprise Limits

Custom limits are available for enterprise plans:

* Higher file size limits
* Increased rate limits
* Priority processing

See [pricing](https://www.datalab.to/pricing) for details, or contact support to discuss your requirements.

## Next Steps

<CardGroup>
  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents efficiently in batch.
  </Card>

  <Card title="Error Codes" icon="circle-exclamation" href="/platform/errors">
    Understand HTTP error codes and subscription errors.
  </Card>

  <Card title="Billing" icon="credit-card" href="/platform/billing">
    Learn about per-page pricing and usage monitoring.
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Receive notifications when processing completes instead of polling.
  </Card>
</CardGroup>


# Supported File Types
Source: https://documentation.datalab.to/docs/common/supportedfiletypes


Datalab supports the following file types for document conversion:

## PDF

| Extension | MIME Type         |
| --------- | ----------------- |
| `.pdf`    | `application/pdf` |

## Spreadsheets

| Extension | MIME Type                                                              |
| --------- | ---------------------------------------------------------------------- |
| `.xls`    | `application/vnd.ms-excel`                                             |
| `.xlsx`   | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`    |
| `.xlsm`   | `application/vnd.ms-excel.sheet.macroEnabled.12`                       |
| `.xltx`   | `application/vnd.openxmlformats-officedocument.spreadsheetml.template` |
| `.csv`    | `text/csv`                                                             |
| `.ods`    | `application/vnd.oasis.opendocument.spreadsheet`                       |

## Word Documents

| Extension | MIME Type                                                                 |
| --------- | ------------------------------------------------------------------------- |
| `.doc`    | `application/msword`                                                      |
| `.docx`   | `application/vnd.openxmlformats-officedocument.wordprocessingml.document` |
| `.odt`    | `application/vnd.oasis.opendocument.text`                                 |

## Presentations

| Extension | MIME Type                                                                   |
| --------- | --------------------------------------------------------------------------- |
| `.ppt`    | `application/vnd.ms-powerpoint`                                             |
| `.pptx`   | `application/vnd.openxmlformats-officedocument.presentationml.presentation` |
| `.odp`    | `application/vnd.oasis.opendocument.presentation`                           |

## HTML

| Extension | MIME Type   |
| --------- | ----------- |
| `.html`   | `text/html` |

## Ebooks

| Extension | MIME Type              |
| --------- | ---------------------- |
| `.epub`   | `application/epub+zip` |

## Images

| Extension | MIME Type    |
| --------- | ------------ |
| `.png`    | `image/png`  |
| `.jpg`    | `image/jpeg` |
| `.jpeg`   | `image/jpeg` |
| `.webp`   | `image/webp` |
| `.gif`    | `image/gif`  |
| `.tiff`   | `image/tiff` |

## Detecting MIME Types

To automatically detect a file's MIME type in Python:

```python theme={null}
import filetype

mime = filetype.guess("document.pdf")
if mime:
    print(mime.mime)  # application/pdf
```

Install with `pip install filetype`.

## Size Limits

See [API Limits](/docs/common/limits) for file size and page limits.

## Next Steps

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="/docs/welcome/quickstart">
    Get started converting documents in minutes.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Detailed guide to converting documents to Markdown, HTML, or JSON.
  </Card>

  <Card title="API Limits" icon="gauge" href="/docs/common/limits">
    Understand file size limits, page limits, and rate limiting.
  </Card>

  <Card title="File Upload" icon="upload" href="/docs/recipes/file-management/file-upload-api">
    Upload files to Datalab storage for use in pipelines.
  </Card>
</CardGroup>


# API
Source: https://documentation.datalab.to/docs/on-prem/api

Our on-prem container's API mimics Datalab's API.

Our cloud-hosted API documentation can be found [here](https://documentation.datalab.to/docs/welcome/api). With caveats and exceptions detailed below, the container image shares the same API.

# Supported endpoints

The container currently supports:

* `/api/v1/marker` documented [here](https://documentation.datalab.to/docs/welcome/api#marker).  This uses both the Marker and Chandra models.
* `/api/v1/ocr` documented [here](https://documentation.datalab.to/docs/welcome/api#ocr).
* `/api/v1/extract` — structured extraction via JSON schema. Supports `fast` and `turbo` extraction modes. Requires the Chandra model with the Lift model enabled; `balanced` mode is not available on-prem.
* `/api/v1/usage` documented [here](/docs/on-prem/usage-analytics) — provides usage analytics and performance metrics for your on-prem deployment.

# Authentication

API authentication is not supported in the container. We assume customers will be running our image on their own infrastructure in private networks.

You may send the `X-API-Key` header detailed [here](https://documentation.datalab.to/docs/welcome/api#authentication), but it will be ignored and any value works.

# PDFs and images are supported, document conversion not yet supported

Datalab's API supports [many file types](https://documentation.datalab.to/docs/common/supportedfiletypes).

The container currently supports PDFs and image file types. Other file types are not yet supported, but will be supported in an upcoming release.

## Feature Parity

| Feature                                       | Cloud API | On-Premises                                                                              |
| --------------------------------------------- | --------- | ---------------------------------------------------------------------------------------- |
| Document conversion (`/marker`)               | Yes       | Yes                                                                                      |
| OCR (`/ocr`)                                  | Yes       | Yes                                                                                      |
| Output formats (markdown, html, json, chunks) | Yes       | Yes                                                                                      |
| Parse quality scoring                         | Yes       | Yes                                                                                      |
| Chart understanding                           | Yes       | Yes (Chandra containers only)                                                            |
| Page range selection                          | Yes       | Yes                                                                                      |
| Block IDs                                     | Yes       | Yes                                                                                      |
| Token-efficient markdown                      | Yes       | Yes                                                                                      |
| Form filling (`/fill`)                        | Yes       | **No**                                                                                   |
| Create document (`/create-document`)          | Yes       | **No**                                                                                   |
| Thumbnails                                    | Yes       | **No**                                                                                   |
| Accurate mode                                 | Yes       | **No**                                                                                   |
| Fast mode                                     | Yes       | **No**                                                                                   |
| Link extraction                               | Yes       | **No**                                                                                   |
| Checkpoints                                   | Yes       | **No**                                                                                   |
| File URL download                             | Yes       | **No**                                                                                   |
| Structured extraction (`/extract`)            | Yes       | Yes — `fast` and `turbo` modes only (requires Lift model); `balanced` mode not available |
| Document segmentation                         | Yes       | **No**                                                                                   |

<Info>
  On-premises containers do not require API key authentication. Implement access control at the network or reverse proxy level.
</Info>

## Next Steps

<CardGroup>
  <Card title="Usage Analytics" icon="chart-line" href="/docs/on-prem/usage-analytics">
    Monitor request volumes, performance metrics, and system status.
  </Card>

  <Card title="Running the Container" icon="server" href="/docs/on-prem/running-the-container">
    Get the on-prem container up and running in minutes.
  </Card>

  <Card title="Cloud API Reference" icon="book" href="/docs/welcome/api">
    Full REST API reference that the on-prem container mirrors.
  </Card>

  <Card title="On-Prem Overview" icon="building" href="/docs/on-prem/overview">
    Compare open-source and paid on-prem options.
  </Card>
</CardGroup>


# Overview
Source: https://documentation.datalab.to/docs/on-prem/overview

Run inference on your own infrastructure

**Customers can run our models on infrastructure they control with an Enterprise contract.** To get started, please [**fill out this form**](https://www.datalab.to/contact).

# **What’s the difference between Open Source and Datalab's paid On-Prem options?**

Our free open-source options ([Chandra](https://github.com/datalab-to/chandra), [Marker](https://github.com/datalab-to/marker), and [Surya](https://github.com/datalab-to/surya)) are ideal for research, personal use, and early-stage startups.

Our paid on-prem options are for teams that need a commercial license to run our models and have one or multiple of the following requirements:

* Require data privacy/operate in highly-regulated environments
* Extremely high volume
* Model training or customization
* White-glove support and SLAs

Here’s a more detailed breakdown.

|                           | **Free (Open Source)**                               | **Datalab On-Prem**                                                                                                                                                           |
| ------------------------- | ---------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Intended Use**          | Research, personal use, startups \< \$2M ARR/funding | Commercial workloads requiring data privacy, easy deployments, and white-glove support                                                                                        |
| **Models**                | Open source model weights                            | Access to newer, more accurate models not available in open source (e.g., latest Chandra versions)                                                                            |
| **License**               | GPL + custom RAILs                                   | Commercial license to use our models on-prem (Marker/Surya/Chandra) without sublicensing. Custom rights as needed.                                                            |
| **Deployment**            | Self-install from OSS repos                          | Custom deployment topologies and Datalab-assisted rollout options as-needed. For small deployments, we have a Docker image that is simple to use and upgrade.                 |
| **Support**               | Community only                                       | Premium support with SLAs                                                                                                                                                     |
| **Contracting**           | None                                                 | Custom agreements and security reviews. For POCs and small deployments we offer no-redline agreements to get started, fast.                                                   |
| **Billing**               | Free                                                 | Invoice/PO, custom terms. For small deployments we offer credit card checkout.                                                                                                |
| **Scale and Performance** | Self-effort                                          | Throughput/latency/accuracy tuning; custom page counts; custom rate limits. For small deployments we offer optimized single-GPU support that is simple to deploy and operate. |

For higher page volume & GPU concurrency,  fully-airgapped deployments, white glove support, and other custom needs,

# **Try before you buy**

We have two easy ways for customers to try our models:

* Our open-source projects, [Chandra](https://github.com/datalab-to/chandra), [Marker](https://github.com/datalab-to/marker), and [Surya](https://github.com/datalab-to/surya).
* Datalab's [Cloud API](https://www.datalab.to).

Our container image mimics the cloud-hosted API for a simple transition:

* [Cloud API Docs](https://documentation.datalab.to/docs/welcome/api)
* [On Prem Container API Docs](./api)

## Next Steps

<CardGroup>
  <Card title="Running the Container" icon="server" href="/docs/on-prem/running-the-container">
    Get the on-prem container up and running in minutes.
  </Card>

  <Card title="On-Prem API" icon="code" href="/docs/on-prem/api">
    API reference for the on-prem container image.
  </Card>

  <Card title="Cloud Quickstart" icon="rocket" href="/docs/welcome/quickstart">
    Try the cloud-hosted API to evaluate before deploying on-prem.
  </Card>

  <Card title="Billing" icon="credit-card" href="/platform/billing">
    Understand pricing for on-prem and cloud plans.
  </Card>
</CardGroup>


# Running the Container
Source: https://documentation.datalab.to/docs/on-prem/running-the-container

Getting our container up-and-running takes minutes.

Running Datalab's containers requires **a Google Cloud service account key** to pull the container image.

If the terms of your agreement require a license, we'll also provide **a license key.**

# License-enabled containers

Copy your license key, download the service account key, and [run the script in this Github repository to get up-and-running](https://github.com/datalab-to/datalab-on-prem):

```shellscript theme={null}
export DATALAB_LICENSE_KEY=your-license-key
export SERVICE_ACCOUNT_KEY_FILE=path/to/key.json
./run-datalab-inference-container.sh
```

# Kubernetes deployment (Helm)

A Helm chart is available for deploying the container on Kubernetes clusters. Contact [support@datalab.to](mailto:support@datalab.to) to receive the chart and values reference for your deployment.

# Fully-airgapped containers

A license key is not required to run a fully-airgapped container. If the terms of your agreement require a fully-airgapped container, we will provide:

* Access to private registries that contain those images.
* A Google Cloud service account key to pull images.
* Directions for how to run the container.

# [www.datalab.to](http://www.datalab.to) must be reachable

Our on-prem license requires that [https://www.datalab.to](https://www.datalab.to) is reachable in order to:

1. Activate and register your license with our servers.
2. Send usage metrics.

# Usage data sent to Datalab

License activation and usage heartbeats **do not send private data to Datalab.**

Our intent is to ensure compliance with our license and to easily support customers when they run into problems.

The container sends the following to our servers:

1. On container startup we activate your license. In that request, we send information about your hardware and OS available in `/proc` and `/sys` (in the container, not on your host).
2. At regular intervals we send usage heartbeats that contain:
   1. The # of successful/failed inference requests completed since the last heartbeat
   2. The # of inference requests submitted to the container over a recent time window

# I need a fully-airgapped deployment

We also support fully-airgapped deployments that do not require a license. [Get started by filling out this form.](https://www.datalab.to/contact)

Please reach out to us at [support@datalab.to](mailto:support@datalab.to) if you have questions.

## Hardware Requirements

| Container Type  | GPU Required | Minimum VRAM | Recommended Use                             |
| --------------- | ------------ | ------------ | ------------------------------------------- |
| `marker`        | Yes (CUDA)   | 24 GB        | Standard document conversion with Surya OCR |
| `chandra`       | Yes (CUDA)   | 80 GB        | Full Chandra VLM for highest accuracy       |
| `chandra-small` | Yes (CUDA)   | 16 GB        | Smaller Chandra variants (2B/4B models)     |

## Health Check

Verify your container is running:

```bash theme={null}
curl http://localhost:8000/health_check
```

Expected response:

```json theme={null}
{"status": "healthy"}
```

## Next Steps

<CardGroup>
  <Card title="On-Prem API" icon="code" href="/docs/on-prem/api">
    API reference for the on-prem container image.
  </Card>

  <Card title="On-Prem Overview" icon="building" href="/docs/on-prem/overview">
    Compare open-source and paid on-prem deployment options.
  </Card>

  <Card title="Cloud Quickstart" icon="rocket" href="/docs/welcome/quickstart">
    Try the cloud-hosted API for quick evaluation.
  </Card>

  <Card title="Error Codes" icon="circle-exclamation" href="/platform/errors">
    Understand HTTP error codes and troubleshooting steps.
  </Card>
</CardGroup>


# Usage Analytics
Source: https://documentation.datalab.to/docs/on-prem/usage-analytics

Monitor inference request analytics and performance metrics in your on-prem deployment.

The usage analytics endpoint provides comprehensive metrics about inference requests processed by your on-premises container. Use this endpoint to monitor request volumes, success rates, performance statistics, and current system status.

<Info>
  This endpoint is only available in on-premises deployments and requires a valid license.
</Info>

## Endpoint

```bash theme={null}
GET /api/v1/usage
```

## Query Parameters

| Parameter    | Type              | Default      | Description                       |
| ------------ | ----------------- | ------------ | --------------------------------- |
| `start_date` | string (ISO 8601) | 24 hours ago | Start of time range for analytics |
| `end_date`   | string (ISO 8601) | Now          | End of time range for analytics   |

<Warning>
  The time range cannot exceed 7 days. Requests with larger ranges will return a 400 error.
</Warning>

## Authentication

This endpoint requires a valid on-premises license. If your license is invalid or expired, the endpoint returns a 423 (Locked) status code.

## Response Structure

The endpoint returns a comprehensive analytics object with five main sections:

### Period

The effective time range for the query (normalized to UTC):

```json theme={null}
{
  "period": {
    "start_date": "2024-06-01T00:00:00+00:00",
    "end_date": "2024-06-01T23:59:59+00:00"
  }
}
```

### Summary

Aggregate statistics across all request types:

```json theme={null}
{
  "summary": {
    "total_requests": 1250,
    "successful_requests": 1200,
    "failed_requests": 50,
    "successful_pages_processed": 15000,
    "failed_pages_processed": 500,
    "success_rate": 0.96
  }
}
```

| Field                        | Type  | Description                                 |
| ---------------------------- | ----- | ------------------------------------------- |
| `total_requests`             | int   | Total completed requests in time range      |
| `successful_requests`        | int   | Requests completed without errors           |
| `failed_requests`            | int   | Requests that failed with errors            |
| `successful_pages_processed` | int   | Total pages from successful requests        |
| `failed_pages_processed`     | int   | Total pages from failed requests            |
| `success_rate`               | float | Ratio of successful to total requests (0-1) |

### By Request Type

Per-type breakdown of the same metrics:

```json theme={null}
{
  "by_request_type": {
    "marker": {
      "total_requests": 1000,
      "successful_requests": 980,
      "failed_requests": 20,
      "successful_pages_processed": 12000,
      "failed_pages_processed": 200
    },
    "ocr": {
      "total_requests": 250,
      "successful_requests": 220,
      "failed_requests": 30,
      "successful_pages_processed": 3000,
      "failed_pages_processed": 300
    }
  }
}
```

### Performance

Processing time and queue wait statistics (only includes successful requests):

```json theme={null}
{
  "performance": {
    "average_processing_time_secs": 12.5,
    "median_processing_time_secs": 10.2,
    "p95_processing_time_secs": 25.8,
    "p99_processing_time_secs": 35.4,
    "average_queue_wait_secs": 2.3
  }
}
```

| Field                          | Type  | Description                        |
| ------------------------------ | ----- | ---------------------------------- |
| `average_processing_time_secs` | float | Mean time from start to completion |
| `median_processing_time_secs`  | float | 50th percentile processing time    |
| `p95_processing_time_secs`     | float | 95th percentile processing time    |
| `p99_processing_time_secs`     | float | 99th percentile processing time    |
| `average_queue_wait_secs`      | float | Mean time from submission to start |

<Info>
  Performance metrics are `null` when there are no successful requests in the time range. Failed requests are excluded from performance calculations.
</Info>

### Current Status

Live snapshot of in-progress and queued requests (not filtered by time range):

```json theme={null}
{
  "current_status": {
    "requests_in_progress": 5,
    "requests_queued": 12
  }
}
```

| Field                  | Type | Description                        |
| ---------------------- | ---- | ---------------------------------- |
| `requests_in_progress` | int  | Requests currently being processed |
| `requests_queued`      | int  | Requests waiting to be processed   |

## Examples

### Basic Usage (Default 24-Hour Window)

<CodeGroup>
  ```python Python SDK theme={null}
  # The Python SDK does not yet support the usage endpoint
  # Use the requests library directly
  import requests

  response = requests.get(
      "http://localhost:8000/api/v1/usage",
      headers={"X-API-Key": "any-value"}  # Not validated in on-prem
  )

  data = response.json()
  print(f"Total requests: {data['summary']['total_requests']}")
  print(f"Success rate: {data['summary']['success_rate']:.2%}")
  ```

  ```bash cURL theme={null}
  curl -X GET http://localhost:8000/api/v1/usage \
    -H "X-API-Key: any-value"
  ```

  ```python Python (requests) theme={null}
  import requests

  response = requests.get(
      "http://localhost:8000/api/v1/usage",
      headers={"X-API-Key": "any-value"}
  )

  data = response.json()

  # Print summary
  summary = data["summary"]
  print(f"Total: {summary['total_requests']}")
  print(f"Success: {summary['successful_requests']}")
  print(f"Failed: {summary['failed_requests']}")
  print(f"Success rate: {summary['success_rate']:.2%}")

  # Print performance metrics
  perf = data["performance"]
  if perf["average_processing_time_secs"]:
      print(f"\nAvg processing time: {perf['average_processing_time_secs']:.2f}s")
      print(f"P95 processing time: {perf['p95_processing_time_secs']:.2f}s")
  ```
</CodeGroup>

### Custom Time Range

<CodeGroup>
  ```python Python SDK theme={null}
  import requests
  from datetime import datetime, timedelta, timezone

  # Query last 7 days
  end_date = datetime.now(timezone.utc)
  start_date = end_date - timedelta(days=7)

  response = requests.get(
      "http://localhost:8000/api/v1/usage",
      params={
          "start_date": start_date.isoformat(),
          "end_date": end_date.isoformat()
      },
      headers={"X-API-Key": "any-value"}
  )

  data = response.json()
  ```

  ```bash cURL theme={null}
  # Query specific date range
  curl -X GET "http://localhost:8000/api/v1/usage?start_date=2024-06-01T00:00:00Z&end_date=2024-06-07T23:59:59Z" \
    -H "X-API-Key: any-value"
  ```

  ```python Python (requests) theme={null}
  import requests
  from datetime import datetime, timedelta, timezone

  # Query last 3 days
  end_date = datetime.now(timezone.utc)
  start_date = end_date - timedelta(days=3)

  response = requests.get(
      "http://localhost:8000/api/v1/usage",
      params={
          "start_date": start_date.isoformat(),
          "end_date": end_date.isoformat()
      },
      headers={"X-API-Key": "any-value"}
  )

  data = response.json()
  print(f"Period: {data['period']['start_date']} to {data['period']['end_date']}")
  ```
</CodeGroup>

### Monitoring Dashboard Example

<CodeGroup>
  ```python Python SDK theme={null}
  import requests
  from datetime import datetime, timezone

  def get_usage_metrics():
      """Fetch current usage metrics for monitoring dashboard."""
      response = requests.get(
          "http://localhost:8000/api/v1/usage",
          headers={"X-API-Key": "any-value"}
      )
      
      if response.status_code != 200:
          raise Exception(f"Failed to fetch metrics: {response.status_code}")
      
      return response.json()

  def print_dashboard():
      """Print a simple monitoring dashboard."""
      data = get_usage_metrics()
      
      print("=" * 60)
      print("DATALAB ON-PREM USAGE DASHBOARD")
      print("=" * 60)
      
      # Summary
      summary = data["summary"]
      print(f"\n📊 SUMMARY (Last 24 Hours)")
      print(f"  Total Requests:     {summary['total_requests']:,}")
      print(f"  Successful:         {summary['successful_requests']:,}")
      print(f"  Failed:             {summary['failed_requests']:,}")
      print(f"  Success Rate:       {summary['success_rate']:.2%}")
      print(f"  Pages Processed:    {summary['successful_pages_processed']:,}")
      
      # By type
      print(f"\n📈 BY REQUEST TYPE")
      for req_type, metrics in data["by_request_type"].items():
          print(f"  {req_type.upper()}:")
          print(f"    Requests: {metrics['total_requests']:,} ({metrics['successful_requests']:,} successful)")
          print(f"    Pages: {metrics['successful_pages_processed']:,}")
      
      # Performance
      perf = data["performance"]
      if perf["average_processing_time_secs"]:
          print(f"\n⚡ PERFORMANCE")
          print(f"  Avg Processing:     {perf['average_processing_time_secs']:.2f}s")
          print(f"  Median Processing:  {perf['median_processing_time_secs']:.2f}s")
          print(f"  P95 Processing:     {perf['p95_processing_time_secs']:.2f}s")
          print(f"  P99 Processing:     {perf['p99_processing_time_secs']:.2f}s")
          print(f"  Avg Queue Wait:     {perf['average_queue_wait_secs']:.2f}s")
      
      # Current status
      status = data["current_status"]
      print(f"\n🔄 CURRENT STATUS")
      print(f"  In Progress:        {status['requests_in_progress']}")
      print(f"  Queued:             {status['requests_queued']}")
      
      print("=" * 60)

  if __name__ == "__main__":
      print_dashboard()
  ```

  ```bash cURL theme={null}
  # Simple monitoring script
  curl -s http://localhost:8000/api/v1/usage \
    -H "X-API-Key: any-value" | \
    jq '{
      total: .summary.total_requests,
      success_rate: .summary.success_rate,
      in_progress: .current_status.requests_in_progress,
      queued: .current_status.requests_queued
    }'
  ```

  ```python Python (requests) theme={null}
  import requests
  from datetime import datetime, timezone

  def monitor_system_health():
      """Check system health based on usage metrics."""
      response = requests.get(
          "http://localhost:8000/api/v1/usage",
          headers={"X-API-Key": "any-value"}
      )
      
      data = response.json()
      summary = data["summary"]
      status = data["current_status"]
      perf = data["performance"]
      
      # Check success rate
      if summary["success_rate"] < 0.95:
          print(f"⚠️  WARNING: Success rate is {summary['success_rate']:.2%}")
      
      # Check queue depth
      if status["requests_queued"] > 100:
          print(f"⚠️  WARNING: {status['requests_queued']} requests queued")
      
      # Check processing time
      if perf["p95_processing_time_secs"] and perf["p95_processing_time_secs"] > 60:
          print(f"⚠️  WARNING: P95 processing time is {perf['p95_processing_time_secs']:.1f}s")
      
      print("✅ System health check complete")

  monitor_system_health()
  ```
</CodeGroup>

## Error Responses

### 400 Bad Request

Invalid query parameters:

```json theme={null}
{
  "detail": "start_date must be before end_date."
}
```

```json theme={null}
{
  "detail": "Time range must not exceed 7 days."
}
```

### 423 Locked

License validation failed:

```json theme={null}
{
  "detail": "License validation failed"
}
```

## Implementation Notes

* Only **completed requests** (with `end_time` set) are included in summary statistics
* Failed requests are counted in totals but excluded from performance metrics
* Performance percentiles use linear interpolation for accurate calculation
* Queue wait time is calculated as `start_time - submission_time`
* Processing time is calculated as `end_time - start_time`
* Naive datetimes (without timezone) are treated as UTC
* The `current_status` section provides a live snapshot and is not filtered by the time range

## Use Cases

### Capacity Planning

Monitor request volumes and processing times to plan infrastructure scaling:

```python theme={null}
import requests
from datetime import datetime, timedelta, timezone

# Get last 7 days of data
end = datetime.now(timezone.utc)
start = end - timedelta(days=7)

response = requests.get(
    "http://localhost:8000/api/v1/usage",
    params={"start_date": start.isoformat(), "end_date": end.isoformat()},
    headers={"X-API-Key": "any-value"}
)

data = response.json()
avg_daily_requests = data["summary"]["total_requests"] / 7
avg_daily_pages = data["summary"]["successful_pages_processed"] / 7

print(f"Average daily requests: {avg_daily_requests:.0f}")
print(f"Average daily pages: {avg_daily_pages:.0f}")
```

### Performance Monitoring

Track processing times to identify performance degradation:

```python theme={null}
import requests

response = requests.get(
    "http://localhost:8000/api/v1/usage",
    headers={"X-API-Key": "any-value"}
)

perf = response.json()["performance"]

# Alert if P95 exceeds threshold
if perf["p95_processing_time_secs"] and perf["p95_processing_time_secs"] > 30:
    print(f"ALERT: P95 processing time is {perf['p95_processing_time_secs']:.1f}s")
```

### Queue Monitoring

Monitor queue depth to detect bottlenecks:

```python theme={null}
import requests

response = requests.get(
    "http://localhost:8000/api/v1/usage",
    headers={"X-API-Key": "any-value"}
)

status = response.json()["current_status"]

if status["requests_queued"] > 50:
    print(f"WARNING: {status['requests_queued']} requests in queue")
```

## Next Steps

<CardGroup>
  <Card title="On-Prem API" icon="server" href="/docs/on-prem/api">
    Full API reference for the on-prem container.
  </Card>

  <Card title="Running the Container" icon="play" href="/docs/on-prem/running-the-container">
    Get the on-prem container up and running.
  </Card>

  <Card title="Error Codes" icon="circle-exclamation" href="/platform/errors">
    Understand HTTP error codes and troubleshooting.
  </Card>

  <Card title="On-Prem Overview" icon="building" href="/docs/on-prem/overview">
    Compare open-source and paid on-prem options.
  </Card>
</CardGroup>


# Batch Processing
Source: https://documentation.datalab.to/docs/recipes/conversion/batch-documents

Convert multiple documents efficiently with parallel processing.

Process directories of documents with the SDK or CLI. Both handle rate limiting and retries automatically.

## SDK Batch Processing

Process multiple files using Python's async capabilities:

### Async Batch Processing

For higher throughput:

```python theme={null}
import asyncio
from pathlib import Path
from datalab_sdk import AsyncDatalabClient, ConvertOptions

async def process_directory(input_dir: str, output_dir: str):
    async with AsyncDatalabClient() as client:
        pdf_files = list(Path(input_dir).glob("*.pdf"))

        # Process all files concurrently
        tasks = [
            client.convert(str(pdf), options=ConvertOptions(mode="balanced"))
            for pdf in pdf_files
        ]

        results = await asyncio.gather(*tasks, return_exceptions=True)

        for pdf, result in zip(pdf_files, results):
            if isinstance(result, Exception):
                print(f"Error processing {pdf.name}: {result}")
            else:
                output_path = Path(output_dir) / f"{pdf.stem}.md"
                output_path.write_text(result.markdown)
                print(f"Saved: {output_path}")

asyncio.run(process_directory("./documents/", "./output/"))
```

## CLI Batch Processing

The CLI handles directory processing automatically:

```bash theme={null}
# Convert all PDFs in a directory
datalab convert ./documents/ --output_dir ./output/

# Filter by extension
datalab convert ./documents/ --extensions pdf,docx

# Control concurrency
datalab convert ./documents/ --max_concurrent 10

# With processing options
datalab convert ./documents/ \
  --mode balanced \
  --format markdown \
  --output_dir ./output/
```

See [CLI Reference](/docs/welcome/sdk/cli) for all options.

## REST API Batch Processing

For raw API usage, implement parallel requests with retry handling:

```python theme={null}
import os
import time
import requests
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter, Retry

API_URL = "https://www.datalab.to/api/v1/convert"
API_KEY = os.getenv("DATALAB_API_KEY")

# Configure session with retries
session = requests.Session()
retries = Retry(
    total=20,
    backoff_factor=4,
    status_forcelist=[429],
    allowed_methods=["GET", "POST"],
    raise_on_status=False,
)
session.mount("https://", HTTPAdapter(max_retries=retries))


def convert_document(pdf_path: Path, output_format="markdown", mode="balanced"):
    """Convert a single document with polling."""
    headers = {"X-API-Key": API_KEY}

    # Submit request
    with open(pdf_path, "rb") as f:
        response = session.post(
            API_URL,
            files={"file": (pdf_path.name, f, "application/pdf")},
            data={"output_format": output_format, "mode": mode},
            headers=headers
        )

    data = response.json()
    check_url = data["request_check_url"]

    # Poll for completion
    for _ in range(300):
        result = session.get(check_url, headers=headers).json()

        if result["status"] == "complete":
            return result
        elif result["status"] == "failed":
            raise Exception(f"Failed: {result.get('error')}")

        time.sleep(2)

    raise Exception("Timeout")


def batch_convert(directory: str, max_workers: int = 5):
    """Process all PDFs in a directory."""
    doc_dir = Path(directory)
    pdfs = list(doc_dir.glob("*.pdf"))
    print(f"Found {len(pdfs)} PDFs")

    results = {}

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(convert_document, pdf): pdf.name
            for pdf in pdfs
        }

        for future in as_completed(futures):
            filename = futures[future]
            try:
                result = future.result()
                results[filename] = result
                print(f"Converted: {filename}")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

    return results


# Usage
results = batch_convert("./documents/", max_workers=5)
```

## Rate Limits

* **Request rate limit:** 200 requests per minute per account (429 on exceed)
* **Concurrent request limit:** 400 concurrent requests (429 on exceed)
* **Page concurrency limit:** 5,000 pages in flight across all requests — this is enforced during processing, not at submission. Results return with `success: false` if exceeded. Always check the `success` field when polling for results.
* The SDK and CLI handle request rate limiting and retries automatically
* For higher limits, contact [support@datalab.to](mailto:support@datalab.to)

See [API Limits](/docs/common/limits) for details.

## Tips

1. **Use async for high throughput** - Async processing handles many concurrent requests efficiently
2. **Limit concurrency** - Start with 5-10 concurrent requests and adjust based on your rate limits
3. **Handle failures gracefully** - Use `return_exceptions=True` with `asyncio.gather` to continue processing on errors
4. **Save progress** - Write results incrementally to avoid losing work on long batches

## Next Steps

<CardGroup>
  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Learn more about Marker's conversion API and output formats.
  </Card>

  <Card title="API Limits" icon="gauge" href="/docs/common/limits">
    Understand rate limits and how to optimize throughput.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Get notified when batch conversions complete via webhooks.
  </Card>
</CardGroup>


# Document Conversion
Source: https://documentation.datalab.to/docs/recipes/conversion/conversion-api-overview

Convert documents to Markdown, HTML, JSON, or chunks using the Convert API.

Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

<Info>
  **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call.
</Info>

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, ConvertOptions

  client = DatalabClient()

  # Basic conversion
  result = client.convert("document.pdf")
  print(result.markdown)

  # With options
  options = ConvertOptions(
      output_format="markdown",
      mode="balanced",
      paginate=True
  )
  result = client.convert("document.pdf", options=options)
  ```

  ```bash cURL theme={null}
  # Submit request
  curl -X POST https://www.datalab.to/api/v1/convert \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "output_format=markdown" \
    -F "mode=balanced"

  # Poll for results (use request_check_url from response)
  curl https://www.datalab.to/api/v1/convert/REQUEST_ID \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, time, requests

  API_URL = "https://www.datalab.to/api/v1/convert"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  # Submit request
  with open("document.pdf", "rb") as f:
      response = requests.post(
          API_URL,
          files={"file": ("document.pdf", f, "application/pdf")},
          data={"output_format": "markdown", "mode": "balanced"},
          headers=headers
      )

  check_url = response.json()["request_check_url"]

  # Poll for completion
  for _ in range(300):
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          print(result["markdown"])
          break
      time.sleep(2)
  ```
</CodeGroup>

The SDK handles polling automatically. For the REST API, you submit a request and poll the `request_check_url` until the status is `complete`.

See [SDK Conversion](/docs/welcome/sdk/conversion) for complete SDK documentation.

<Info>
  **File limits:** Maximum file size is 200 MB, with up to 7,000 pages per request. See [API Limits](/docs/common/limits) for the full list.
</Info>

## Parameters

### Core Parameters

| Parameter       | Type   | Default    | Description                                         |
| --------------- | ------ | ---------- | --------------------------------------------------- |
| `file`          | file   | -          | Document file (multipart upload)                    |
| `file_url`      | string | -          | URL to document (alternative to file)               |
| `output_format` | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks` |
| `mode`          | string | `fast`     | Processing mode (see below)                         |

<Tip>
  **Which output format should I use?**

  * **LLM/RAG pipelines** → `markdown` (default, most compatible)
  * **Web display** → `html` (preserves visual structure)
  * **Programmatic access to blocks** → `json` (includes bounding boxes and block types)
  * **Embedding and search** → `chunks` (pre-chunked for vector databases)
</Tip>

### Processing Modes

| Mode       | Description                                     | Best For                                         |
| ---------- | ----------------------------------------------- | ------------------------------------------------ |
| `fast`     | Lowest latency, good for simple documents       | High-throughput pipelines, simple layouts        |
| `balanced` | Balance of speed and accuracy **(recommended)** | Most use cases                                   |
| `accurate` | Highest accuracy, best for complex layouts      | Complex tables, dense layouts, scanned documents |

<Tip>
  **Which mode should I use?**

  * **Most use cases** → `balanced` (recommended default)
  * **Simple, clean PDFs** at high throughput → `fast`
  * **Scanned documents, complex tables, or dense layouts** → `accurate`
</Tip>

### Page Control

| Parameter    | Type   | Default | Description                                                                             |
| ------------ | ------ | ------- | --------------------------------------------------------------------------------------- |
| `max_pages`  | int    | -       | Maximum pages to process                                                                |
| `page_range` | string | -       | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index. |
| `paginate`   | bool   | `false` | Add page delimiters to output                                                           |

### Image Handling

| Parameter                  | Type | Default | Description                   |
| -------------------------- | ---- | ------- | ----------------------------- |
| `disable_image_extraction` | bool | `false` | Don't extract images          |
| `disable_image_captions`   | bool | `false` | Don't generate image captions |

### Advanced Options

| Parameter                    | Type   | Default | Description                                                                                                                                                                                                                |
| ---------------------------- | ------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `add_block_ids`              | bool   | `false` | Add `data-block-id` attributes to HTML elements                                                                                                                                                                            |
| `skip_cache`                 | bool   | `false` | Skip cached results                                                                                                                                                                                                        |
| `save_checkpoint`            | bool   | `false` | Save checkpoint for reuse                                                                                                                                                                                                  |
| `word_bboxes`                | bool   | `false` | Predict per-word bounding boxes with confidence scores. Each word is inlined into HTML output as a `<span data-bbox="..." data-confidence="...">` element (markdown output strips these). Billed at \$0.30 per 1K pages.   |
| `extras`                     | string | -       | Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_cell_bboxes`, `list_item_bboxes`, `infographic`, `new_block_types`. (`table_row_bboxes` is deprecated — use `table_cell_bboxes` instead.) |
| `include_markdown_in_chunks` | bool   | `false` | Include markdown content in chunks/JSON output                                                                                                                                                                             |
| `token_efficient_markdown`   | bool   | `false` | Optimize markdown for LLM token efficiency                                                                                                                                                                                 |
| `fence_synthetic_captions`   | bool   | `false` | Wrap synthetic image captions in HTML comments                                                                                                                                                                             |
| `additional_config`          | string | -       | JSON with extra config (see below)                                                                                                                                                                                         |
| `webhook_url`                | string | -       | Override webhook URL for this request                                                                                                                                                                                      |
| `processing_location`        | string | -       | Data residency region override: `"eu"` or `"us"`. When set, use `file_url` or a pre-uploaded `datalab://` reference — multipart uploads are not supported. EU processing carries a regional pricing premium.               |

<Note>
  For structured extraction, use the [Extract API](/docs/recipes/structured-extraction/api-overview). For document segmentation, use the [Segment API](/docs/recipes/document-segmentation/auto-segmentation).
</Note>

<Note>
  The `track_changes` extra is supported on this endpoint. You can also use the dedicated [Track Changes endpoint](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents).
</Note>

### Bounding Box Add-ons

Three add-ons annotate HTML output with spatial coordinates and confidence scores. All are billed at **\$0.30 per 1K pages** each (additive on top of the base conversion rate) and require the `html` output format to expose the attributes.

| Add-on            | How to enable                | What it annotates                                                                                                   |
| ----------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| Word bboxes       | `word_bboxes=True`           | Every word in the document gets a `data-bbox` and `data-confidence` span in HTML                                    |
| Table cell bboxes | `extras="table_cell_bboxes"` | `<colgroup><col>`, `<tr>`, and `<td>`/`<th>` elements get `data-bbox`/`data-confidence`; also enables `word_bboxes` |
| List item bboxes  | `extras="list_item_bboxes"`  | Each `<li>` element gets `data-bbox`/`data-confidence`; also enables `word_bboxes`                                  |

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

# Get table cell bboxes (also includes word bboxes)
options = ConvertOptions(
    output_format="html",
    extras="table_cell_bboxes,list_item_bboxes",
)
result = client.convert("document.pdf", options=options)
# HTML contains data-bbox and data-confidence on table cells, list items, and words
```

### Additional Config Options

Pass as JSON string in `additional_config`:

| Key                           | Type | Description                     |
| ----------------------------- | ---- | ------------------------------- |
| `keep_spreadsheet_formatting` | bool | Preserve spreadsheet formatting |
| `keep_pageheader_in_output`   | bool | Include page headers            |
| `keep_pagefooter_in_output`   | bool | Include page footers            |

Example:

```python theme={null}
options = ConvertOptions(
    additional_config={
        "keep_spreadsheet_formatting": True,
        "keep_pageheader_in_output": False
    }
)
```

## Response Fields

| Field                 | Type   | Description                                   |
| --------------------- | ------ | --------------------------------------------- |
| `status`              | string | `processing`, `complete`, or `failed`         |
| `success`             | bool   | Whether conversion succeeded                  |
| `output_format`       | string | Requested output format                       |
| `markdown`            | string | Markdown output (if format is markdown)       |
| `html`                | string | HTML output (if format is html)               |
| `json`                | object | JSON output (if format is json)               |
| `chunks`              | object | Chunked output (if format is chunks)          |
| `images`              | object | Extracted images as `{filename: base64}`      |
| `metadata`            | object | Document metadata                             |
| `page_count`          | int    | Number of pages processed                     |
| `parse_quality_score` | float  | Quality score (0-5)                           |
| `cost_breakdown`      | object | Cost in cents                                 |
| `checkpoint_id`       | string | Checkpoint ID (if `save_checkpoint` was true) |
| `error`               | string | Error message if failed                       |

## Examples

### Convert with High Accuracy

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    mode="accurate",
    output_format="markdown"
)

result = client.convert("complex_document.pdf", options=options)
print(f"Quality score: {result.parse_quality_score}")
print(result.markdown)
```

### HTML with Block IDs for Citations

```python theme={null}
options = ConvertOptions(
    output_format="html",
    add_block_ids=True
)

result = client.convert("document.pdf", options=options)
# HTML elements have data-block-id attributes for citation tracking
```

### Process Specific Pages

```python theme={null}
options = ConvertOptions(
    page_range="0-4,10,15-20",  # Pages 0-4, 10, and 15-20
    output_format="markdown"
)

result = client.convert("large_document.pdf", options=options)
```

### Process Specific Sheets from a Spreadsheet

For spreadsheet files, `page_range` filters by sheet index (0-based):

```python theme={null}
options = ConvertOptions(
    page_range="0,2",  # First and third sheets only
    output_format="markdown"
)

result = client.convert("workbook.xlsx", options=options)
```

### Extract Track Changes from Word Documents

```python theme={null}
options = ConvertOptions(
    extras="track_changes",
    output_format="json"
)

result = client.convert("document_with_changes.docx", options=options)
```

## Parse Quality Score

Every conversion response includes a `parse_quality_score` (0-5) that indicates how well the document was parsed:

| Score Range | Quality   | Recommended Action                                 |
| ----------- | --------- | -------------------------------------------------- |
| 4.0 - 5.0   | Excellent | Use the output directly                            |
| 3.0 - 3.9   | Good      | Review for minor issues                            |
| 2.0 - 2.9   | Fair      | Consider retrying with `accurate` mode             |
| 0.0 - 1.9   | Poor      | Retry with `accurate` mode or check the input file |

Use quality scores to build automated quality gates:

```python theme={null}
result = client.convert("document.pdf", options=ConvertOptions(mode="balanced"))

if result.parse_quality_score < 3.0:
    # Retry with higher accuracy
    result = client.convert("document.pdf", options=ConvertOptions(mode="accurate"))
```

Use quality scores to gate pipeline execution or route documents to different processing configurations.

## Checkpoints

Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
options = ConvertOptions(
    save_checkpoint=True,
    output_format="markdown"
)
result = client.convert("document.pdf", options=options)
checkpoint_id = result.checkpoint_id

# Step 2: Use checkpoint for extraction (no re-processing needed)
extraction_options = ExtractOptions(
    page_schema=json.dumps({"type": "object", "properties": {"title": {"type": "string"}}}),
    checkpoint_id=checkpoint_id
)
extract_result = client.extract("document.pdf", options=extraction_options)
```

Checkpoints save time and cost when you need to run multiple operations (extraction, segmentation) on the same document.

<Warning>
  Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
</Warning>

## Next Steps

<CardGroup>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data from documents using JSON schemas
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents concurrently
  </Card>

  <Card title="Document Segmentation" icon="scissors" href="/docs/recipes/document-segmentation/auto-segmentation">
    Split multi-document PDFs into segments
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Get notified when conversions complete
  </Card>
</CardGroup>


# Create Document
Source: https://documentation.datalab.to/docs/recipes/create-document/create-document-api-overview

Generate DOCX files from markdown with track changes support.

Convert markdown to Word documents (DOCX) with support for track changes, insertions, deletions, and comments. This is useful for generating legal documents, contracts with redlines, and collaborative review documents.

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient

  client = DatalabClient()

  markdown = (
      "# Contract\n\n"
      "This agreement is between "
      '<ins data-revision-author="Editor" data-revision-datetime="2024-01-15T10:00:00Z">'
      "Acme Corp</ins> and the client."
  )

  result = client.create_document(markdown=markdown)
  result.save_output("contract")  # saves contract.docx
  print(f"Document created: {result.page_count} page(s)")
  ```

  ```bash cURL theme={null}
  # Submit request
  curl -X POST https://www.datalab.to/api/v1/create-document \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "markdown": "# Contract\n\nThis agreement is between <ins data-revision-author=\"Editor\" data-revision-datetime=\"2024-01-15T10:00:00Z\">Acme Corp</ins> and the client.",
      "output_format": "docx"
    }'

  # Poll for results (use request_check_url from response)
  curl https://www.datalab.to/api/v1/create-document/REQUEST_ID \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import requests, json, time, base64, os

  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  # Submit request
  response = requests.post(
      "https://www.datalab.to/api/v1/create-document",
      json={
          "markdown": "# Contract\n\nThis agreement is between "
                      '<ins data-revision-author="Editor" '
                      'data-revision-datetime="2024-01-15T10:00:00Z">'
                      "Acme Corp</ins> and the client.",
          "output_format": "docx"
      },
      headers=headers
  )

  check_url = response.json()["request_check_url"]

  # Poll for results
  while True:
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          docx_bytes = base64.b64decode(result["output_base64"])
          with open("contract.docx", "wb") as f:
              f.write(docx_bytes)
          print("Document saved as contract.docx")
          break
      elif result.get("error"):
          print(f"Error: {result['error']}")
          break
      time.sleep(2)
  ```
</CodeGroup>

## SDK Usage

Use `client.create_document()` to create a DOCX from markdown:

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

result = client.create_document(
    markdown="# Title\n\nDocument content here.",
    output_format="docx",       # Only 'docx' is supported
    webhook_url=None,           # Optional completion webhook
    save_output="output/doc",   # Optional: saves output.docx automatically
)

print(result.success)         # True if creation succeeded
print(result.page_count)      # Number of pages
print(result.cost_breakdown)  # Cost details
result.save_output("output/contract")  # Saves contract.docx
```

### SDK Method Parameters

| Parameter       | Type     | Default      | Description                                         |
| --------------- | -------- | ------------ | --------------------------------------------------- |
| `markdown`      | str      | **Required** | Markdown content with optional track changes markup |
| `output_format` | str      | `"docx"`     | Output format (only `"docx"` is supported)          |
| `webhook_url`   | str      | None         | Optional webhook URL for completion notification    |
| `save_output`   | str/Path | None         | File path to save the output DOCX                   |
| `max_polls`     | int      | `300`        | Maximum polling attempts                            |
| `poll_interval` | int      | `1`          | Seconds between polls                               |

### SDK Result Fields

| Field            | Type  | Description                         |
| ---------------- | ----- | ----------------------------------- |
| `success`        | bool  | Whether document creation succeeded |
| `status`         | str   | `"complete"` when done              |
| `output_format`  | str   | `"docx"`                            |
| `output_base64`  | str   | Base64-encoded DOCX file            |
| `runtime`        | float | Processing time in seconds          |
| `page_count`     | int   | Pages in the generated document     |
| `cost_breakdown` | dict  | Cost details                        |
| `error`          | str   | Error message if creation failed    |

## How It Works

1. Send markdown content with optional track changes markup
2. The API converts it to a DOCX file with proper Word formatting
3. Track changes tags become native Word revision marks
4. The DOCX file is returned as a base64-encoded string

## Track Changes Markup

### Insertions

Mark inserted text with `<ins>` tags:

```html theme={null}
<ins data-revision-author="John Doe" data-revision-datetime="2024-01-15T10:00:00Z">newly added text</ins>
```

| Attribute                | Required | Description                                       |
| ------------------------ | -------- | ------------------------------------------------- |
| `data-revision-author`   | Yes      | Author name for the insertion                     |
| `data-revision-datetime` | Yes      | ISO 8601 timestamp (e.g., `2024-01-15T10:00:00Z`) |

### Deletions

Mark deleted text with `<del>` tags:

```html theme={null}
<del data-revision-author="Jane Smith" data-revision-datetime="2024-01-15T11:00:00Z">removed text</del>
```

| Attribute                | Required | Description                  |
| ------------------------ | -------- | ---------------------------- |
| `data-revision-author`   | Yes      | Author name for the deletion |
| `data-revision-datetime` | Yes      | ISO 8601 timestamp           |

### Comments

Add comments with `<comment>` tags:

```html theme={null}
<comment data-comment-author="Reviewer" data-comment-datetime="2024-01-15T12:00:00Z" text="Please verify this clause">annotated text</comment>
```

| Attribute               | Required | Description                                           |
| ----------------------- | -------- | ----------------------------------------------------- |
| `data-comment-author`   | Yes      | Author/reviewer name                                  |
| `text`                  | Yes      | The comment text                                      |
| `data-comment-datetime` | No       | ISO 8601 timestamp (defaults to current time)         |
| `data-comment-initial`  | No       | Author initials (auto-generated from name if omitted) |

## Parameters

| Parameter       | Type   | Required | Description                                         |
| --------------- | ------ | -------- | --------------------------------------------------- |
| `markdown`      | string | Yes      | Markdown content with optional track changes markup |
| `output_format` | string | No       | Output format (currently only `docx` is supported)  |
| `webhook_url`   | string | No       | Webhook URL to notify when processing completes     |

## Response

The response follows the standard async pattern — submit, then poll:

**Initial response:**

```json theme={null}
{
  "success": true,
  "request_id": "abc123",
  "request_check_url": "https://www.datalab.to/api/v1/create-document/abc123"
}
```

**Final response (when polling):**

| Field            | Type   | Description                         |
| ---------------- | ------ | ----------------------------------- |
| `status`         | string | `processing` or `complete`          |
| `success`        | bool   | Whether document creation succeeded |
| `output_format`  | string | `docx`                              |
| `output_base64`  | string | Base64-encoded DOCX file            |
| `runtime`        | float  | Processing time in seconds          |
| `page_count`     | int    | Pages in the generated document     |
| `cost_breakdown` | object | Cost details                        |
| `error`          | string | Error message if creation failed    |

## Full Example

A contract with insertions, deletions, and reviewer comments:

```python theme={null}
import requests, json, time, base64, os

headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

markdown = """# Service Agreement

## Parties

This agreement is between <ins data-revision-author="Legal" data-revision-datetime="2024-06-01T09:00:00Z">Acme Corporation ("Provider")</ins> and <del data-revision-author="Legal" data-revision-datetime="2024-06-01T09:00:00Z">the Client</del><ins data-revision-author="Legal" data-revision-datetime="2024-06-01T09:00:00Z">GlobalTech Inc. ("Client")</ins>.

## Terms

The service period begins on <comment data-comment-author="Reviewer" text="Confirm start date with finance">January 1, 2025</comment> and continues for <del data-revision-author="Legal" data-revision-datetime="2024-06-01T10:00:00Z">12</del><ins data-revision-author="Legal" data-revision-datetime="2024-06-01T10:00:00Z">24</ins> months.

## Payment

The total contract value is <ins data-revision-author="Finance" data-revision-datetime="2024-06-02T14:00:00Z">$150,000</ins> payable in quarterly installments.
"""

response = requests.post(
    "https://www.datalab.to/api/v1/create-document",
    json={"markdown": markdown, "output_format": "docx"},
    headers=headers
)

check_url = response.json()["request_check_url"]

while True:
    result = requests.get(check_url, headers=headers).json()
    if result["status"] == "complete":
        docx_bytes = base64.b64decode(result["output_base64"])
        with open("service_agreement.docx", "wb") as f:
            f.write(docx_bytes)
        print(f"Document saved ({result['page_count']} pages)")
        break
    time.sleep(2)
```

The generated DOCX file opens in Microsoft Word with native track changes visible, allowing reviewers to accept or reject each change.

## Use Cases

* **Legal document generation** — create contracts with tracked revisions
* **Contract redlining** — mark up agreements with insertions and deletions
* **Collaborative review** — add reviewer comments to documents
* **Document automation** — generate Word documents from templates with dynamic content

## Next Steps

<CardGroup>
  <Card title="Track Changes Extraction" icon="file-diff" href="/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents">
    Extract track changes from existing Word documents
  </Card>

  <Card title="Document Conversion" icon="file-text" href="/docs/recipes/conversion/conversion-api-overview">
    Convert documents to markdown, HTML, or JSON
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Get notified when document creation completes
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>
</CardGroup>


# Document Segmentation
Source: https://documentation.datalab.to/docs/recipes/document-segmentation/auto-segmentation

Automatically split multi-document PDFs into separate segments.

Automatically identify and split PDFs that contain multiple documents (like batch-scanned files) into their component parts.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

<Info>
  **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call.
</Info>

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  import json
  from datalab_sdk import DatalabClient, SegmentOptions

  client = DatalabClient()

  # Define segmentation schema
  segmentation_schema = {
      "segments": []
  }

  options = SegmentOptions(
      segmentation_schema=json.dumps(segmentation_schema),
      mode="balanced"
  )

  result = client.segment("combined_documents.pdf", options=options)

  # Access segmentation results
  for segment in result.segmentation_results["segments"]:
      print(f"{segment['name']}: pages {segment['pages']}")
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/segment \
    -H "X-API-Key: YOUR_API_KEY" \
    -F "file=@combined_documents.pdf" \
    -F "output_format=markdown" \
    -F "mode=balanced" \
    -F 'segmentation_schema={"segments": []}'
  ```

  ```python Python (requests) theme={null}
  import requests
  import json
  import time

  API_KEY = "YOUR_API_KEY"
  headers = {"X-API-Key": API_KEY}

  # Submit segmentation request
  with open("combined.pdf", "rb") as f:
      response = requests.post(
          "https://www.datalab.to/api/v1/segment",
          files={"file": ("combined.pdf", f, "application/pdf")},
          data={
              "output_format": "markdown",
              "mode": "balanced",
              "segmentation_schema": json.dumps({"segments": []})
          },
          headers=headers
      )

  check_url = response.json()["request_check_url"]

  # Poll for results
  while True:
      result = requests.get(check_url, headers=headers).json()

      if result["status"] == "complete":
          segments = result["segmentation_results"]["segments"]
          for seg in segments:
              print(f"{seg['name']}: pages {seg['pages']}")
          break
      elif result["status"] == "failed":
          print(f"Error: {result.get('error')}")
          break

      time.sleep(2)
  ```
</CodeGroup>

## When to Use

Segmentation is useful when:

* Batch-scanned documents are combined into a single PDF
* Multiple document types are stapled together
* You need to apply different processing to different sections

## Response Format

```json theme={null}
{
  "segmentation_results": {
    "segments": [
      {
        "name": "Research Paper",
        "pages": [0, 1, 2],
        "confidence": "medium"
      },
      {
        "name": "Invoice",
        "pages": [3, 4],
        "confidence": "high"
      }
    ],
    "metadata": {
      "total_pages": 5,
      "segmentation_method": "auto_detected"
    }
  }
}
```

## Process Each Segment

After segmentation, process each segment separately:

```python theme={null}
import json
from datalab_sdk import DatalabClient, SegmentOptions, ExtractOptions

client = DatalabClient()

# First, get segments
seg_options = SegmentOptions(
    segmentation_schema=json.dumps({"segments": []}),
    mode="balanced"
)
result = client.segment("combined.pdf", options=seg_options)

# Process each segment with appropriate schema using the Extract API
extraction_schemas = {
    "Invoice": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "total": {"type": "number"}
        }
    },
    "Contract": {
        "type": "object",
        "properties": {
            "parties": {"type": "array", "items": {"type": "string"}},
            "effective_date": {"type": "string"}
        }
    }
}

extracted_data = {}

for segment in result.segmentation_results["segments"]:
    segment_name = segment["name"]
    pages = segment["pages"]

    schema = extraction_schemas.get(segment_name)
    if schema:
        # Build page range string
        page_range = ",".join(str(p) for p in pages)

        options = ExtractOptions(
            page_schema=json.dumps(schema),
            page_range=page_range,
            mode="balanced"
        )

        seg_result = client.extract("combined.pdf", options=options)
        extracted_data[segment_name] = json.loads(seg_result.extraction_schema_json)

print(extracted_data)
```

## Using Checkpoints

If you already converted a document with `save_checkpoint=True` using the [Convert API](/docs/recipes/conversion/conversion-api-overview), pass the `checkpoint_id` to `SegmentOptions` to skip re-parsing. This saves time and cost when running segmentation on a previously converted document.

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
convert_result = client.convert("combined.pdf", options=ConvertOptions(save_checkpoint=True))
checkpoint_id = convert_result.checkpoint_id

# Step 2: Segment using checkpoint (no re-parsing needed)
options = SegmentOptions(
    segmentation_schema=json.dumps({"segments": []}),
    checkpoint_id=checkpoint_id
)
result = client.segment("combined.pdf", options=options)
```

## Custom Segmentation Schema

Define expected segment types for better accuracy:

```python theme={null}
segmentation_schema = {
    "segments": [
        {"type": "invoice", "description": "Invoice or billing document"},
        {"type": "contract", "description": "Legal contract or agreement"},
        {"type": "receipt", "description": "Payment receipt"}
    ]
}
```

## Next Steps

<CardGroup>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data from document segments using JSON schemas.
  </Card>

  <Card title="Handling Long Documents" icon="file-lines" href="/docs/recipes/structured-extraction/handling-long-documents">
    Tips for TOC-based segmentation on documents with 50+ pages.
  </Card>

  <Card title="Document Conversion" icon="file-export" href="/docs/recipes/conversion/conversion-api-overview">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>
</CardGroup>


# Track Changes in Word Docs
Source: https://documentation.datalab.to/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents

Pull tracked changes and comments from Word documents for review workflows

If you're working with legal documents, contracts, or any collaborative review process, you know how painful it is to manually track all the changes, comments, and revisions in Word documents.

This guide shows you how to extract all that markup programmatically using the Track Changes API.

# Overview

The Track Changes API extracts:

* Tracked changes: insertions and deletions with author names and timestamps
* Comments: all margin comments with author details

This allows you get a full revision history from your Word docs into clean HTML and Markdown.

`track_changes` is perfect for legal workflows where you need to:

* Generate redline summaries for clients
* Identify all changes made by specific parties
* Extract action items from comments
* Analyze negotiation patterns across contract versions
* Create audit trails of document revisions

Submit your Word document to the dedicated Track Changes endpoint. The output will be provided in Markdown and HTML format by default, with all tracked changes and comments preserved in the markup.

# Quick Start (SDK)

The simplest way to extract tracked changes is with the Python SDK:

```python theme={null}
from datalab_sdk import DatalabClient, TrackChangesOptions

client = DatalabClient()
options = TrackChangesOptions(output_format="markdown,html,chunks")
result = client.track_changes("contract.docx", options=options)
print(result.markdown)
```

# Making the API Request

Here's how to submit a Word document and extract its tracked changes using the REST API:

```python theme={null}
import requests
import time
import os

API_URL = "https://www.datalab.to/api/v1/track-changes"
API_KEY = os.getenv("DATALAB_API_KEY")

def extract_tracked_changes(docx_path, output_format='html,markdown'):
    """
    Extract tracked changes and comments from a Word document.

    Args:
        docx_path: Path to the .docx file
        output_format: 'html' or 'markdown' or `html,markdown`

    Returns:
        Dictionary with the converted content including tracked changes
    """
    with open(docx_path, 'rb') as f:
        form_data = {
            'file': (docx_path, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'),
            'output_format': (None, output_format),
            'paginate': (None, False)  # Set to True if you want page breaks
        }

        headers = {"X-API-Key": API_KEY}
        response = requests.post(API_URL, files=form_data, headers=headers)
        data = response.json()

    # Poll for completion
    check_url = data["request_check_url"]
    max_polls = 300 # Set longer if needed

    for i in range(max_polls):
        time.sleep(2)
        response = requests.get(check_url, headers=headers)
        result = response.json()

        if result["status"] == "complete":
            return result
        elif result["status"] == "failed":
            raise Exception(f"Conversion failed: {result.get('error')}")

    raise TimeoutError("Conversion did not complete in time")
```

The response will contain your document with all tracked changes preserved. Here's what the markup looks like:

* Insertions: `<ins data-revision-author="Sandy Kwon" data-revision-datetime="2025-11-11T11:24:00">new text</ins>`
* Deletions: `<del data-revision-author="Vikram Oberoi" data-revision-datetime="2025-11-11T10:34:00">old text</del>`
* Comments: `<comment data-comment-author="Vikram Oberoi" data-comment-datetime="2025-11-11T10:32:00" data-comment-initial="VO" text="comment text">marked text</comment>`

This markup will appear in both HTML and Markdown output.

# Analyzing Changes with LLMs

Once you have the extracted markup, you can use an LLM to analyze the changes.

Here's an example using OpenRouter to generate a redline summary:

```python theme={null}
import requests
import os

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
OPENROUTER_MODEL = os.getenv("OPENROUTER_MODEL")


def analyze_changes_with_llm(marked_up_content, analysis_type='summary'):
    """
    Use an LLM via OpenRouter to analyze tracked changes.
    
    Args:
        marked_up_content: The HTML or Markdown with tracked changes
        analysis_type: Type of analysis ('summary', 'risks', 'by_author', etc.)
    
    Returns:
        LLM analysis of the changes
    """
    
    prompts = {
        'summary': """Analyze this contract with tracked changes and provide:
1. A concise summary of all changes made
2. Key changes that materially affect the agreement
3. Any changes that shift risk or obligations between parties
4. Recommended action items for legal review

Document with tracked changes:
{content}""",
        
        'by_author': """Review this document with tracked changes and create a report organized by author:
- List each author's changes
- Categorize changes as substantive vs. stylistic
- Highlight any conflicting changes between authors

Document:
{content}""",
        
        'risks': """Analyze this contract's tracked changes for potential legal risks:
- Identify changes that increase liability or obligations
- Flag any deletions of protective language
- Note additions that could be problematic
- Assess the overall risk profile of the revisions

Document:
{content}"""
    }
    
    prompt = prompts.get(analysis_type, prompts['summary']).format(content=marked_up_content)
    
    response = requests.post(
        url="https://openrouter.ai/api/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENROUTER_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": OPENROUTER_MODEL,
            "messages": [
                {
                    "role": "user",
                    "content": prompt
                }
            ]
        }
    )
    
    return response.json()['choices'][0]['message']['content']

# Example usage
result = extract_tracked_changes('nda_draft_v3.docx', output_format='html')
marked_up_doc = result['html']

# Generate different types of analysis
summary = analyze_changes_with_llm(marked_up_doc, 'summary')
risk_analysis = analyze_changes_with_llm(marked_up_doc, 'risks')
by_author = analyze_changes_with_llm(marked_up_doc, 'by_author')

print("Change Summary:")
print(summary)
print("\n" + "="*80 + "\n")
print("Risk Analysis:")
print(risk_analysis)
```

# Pagination

For longer documents, you may want to preserve page breaks in the output so you can split them. Set paginate to True in your request:

```python theme={null}
form_data = {
    'file': (docx_path, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'),
    'output_format': (None, 'html'),
    'paginate': (None, True)  # Enable pagination
}
```

**For Markdown output**, each page will preceded by a horizontal rule containing the page number:

```
{0}------------------------------------------------

<Page 0 Content>

{1}------------------------------------------------

<Page 1 Content>
```

**For HTML output**, each page will be wrapped in a div with the page number:

```html theme={null}
<div class="page" data-page-id="0">
  <!-- Page 1 content -->
</div>
<div class="page" data-page-id="1">
  <!-- Page 2 content -->
</div>
```

This makes it easy to process documents page-by-page or display them with proper pagination in your UI.

# Full Code Sample

Here's a complete example that extracts tracked changes and generates a legal review summary:

```python theme={null}
import os
import requests
import time
import json

API_URL = "https://www.datalab.to/api/v1/track-changes"
DATALAB_API_KEY = os.getenv("DATALAB_API_KEY")
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

def extract_tracked_changes(docx_path, output_format='html', paginate=False):
    """Extract tracked changes from a Word document."""
    with open(docx_path, 'rb') as f:
        form_data = {
            'file': (os.path.basename(docx_path), f,
                    'application/vnd.openxmlformats-officedocument.wordprocessingml.document'),
            'output_format': (None, output_format),
            'paginate': (None, paginate)
        }
        
        headers = {"X-API-Key": DATALAB_API_KEY}
        response = requests.post(API_URL, files=form_data, headers=headers)
        data = response.json()
    
    # Poll for completion
    check_url = data["request_check_url"]
    max_polls = 300
    
    for i in range(max_polls):
        time.sleep(2)
        response = requests.get(check_url, headers=headers)
        result = response.json()
        
        if result["status"] == "complete":
            return result
        elif result["status"] == "failed":
            raise Exception(f"Conversion failed: {result.get('error')}")
    
    raise TimeoutError("Conversion did not complete in time")


def analyze_with_llm(content, prompt_template):
    """Send content to LLM for analysis via OpenRouter."""
    response = requests.post(
        url="https://openrouter.ai/api/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENROUTER_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": "anthropic/claude-3.5-sonnet",
            "messages": [
                {
                    "role": "user",
                    "content": prompt_template.format(content=content)
                }
            ]
        }
    )
    
    return response.json()['choices'][0]['message']['content']


def generate_legal_review(docx_path):
    """
    Complete workflow: extract tracked changes and generate legal review.
    """
    print(f"Processing {docx_path}...")
    
    # Extract tracked changes
    result = extract_tracked_changes(docx_path, output_format='html', paginate=True)
    marked_up_doc = result['html']
    
    print("Document converted with tracked changes preserved.")
    
    # Generate comprehensive legal review
    review_prompt = """You are a legal reviewer analyzing a contract with tracked changes.

Please provide:

1. **Executive Summary**: Brief overview of the document and key changes
2. **Material Changes**: List substantive changes that affect rights, obligations, or liabilities
3. **Risk Assessment**: Identify any changes that increase risk exposure
4. **Comments Analysis**: Summarize unresolved comments and action items
5. **Recommendations**: Specific next steps for legal review

Document with tracked changes:
{content}"""

    print("\nGenerating legal review with LLM...")
    review = analyze_with_llm(marked_up_doc, review_prompt)
    
    # Also generate author-specific analysis
    author_prompt = """Analyze this document's tracked changes by author.

For each author who made changes:
- Total number of insertions and deletions
- Types of changes (substantive vs. editorial)
- Key themes in their revisions
- Any patterns in their negotiation strategy

Document:
{content}"""
    
    print("Generating per-author analysis...")
    author_analysis = analyze_with_llm(marked_up_doc, author_prompt)
    
    return {
        'marked_up_document': marked_up_doc,
        'legal_review': review,
        'author_analysis': author_analysis
    }


if __name__ == "__main__":
    # Process a contract with tracked changes
    results = generate_legal_review('contract_redline_v3.docx')
    
    # Save results
    with open('legal_review.txt', 'w') as f:
        f.write("LEGAL REVIEW\n")
        f.write("="*80 + "\n\n")
        f.write(results['legal_review'])
        f.write("\n\n" + "="*80 + "\n\n")
        f.write("AUTHOR ANALYSIS\n")
        f.write("="*80 + "\n\n")
        f.write(results['author_analysis'])
    
    with open('marked_up_document.html', 'w') as f:
        f.write(results['marked_up_document'])
    
    print("\nReview complete! Results saved to:")
    print("  - legal_review.txt")
    print("  - marked_up_document.html")
```

## Next Steps

<CardGroup>
  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Explore the full conversion API and output format options.
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data from documents using JSON schemas.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="SDK Conversion" icon="code" href="/docs/welcome/sdk/conversion">
    Use the Python SDK for simpler document conversion workflows.
  </Card>
</CardGroup>


# File Upload
Source: https://documentation.datalab.to/docs/recipes/file-management/file-upload-api

Upload and manage files for use in pipelines and document processing.

Upload files to Datalab storage and reference them across API calls and pipelines.

## SDK Usage

The SDK handles the upload flow automatically:

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

# Upload a single file
file = client.upload_files("document.pdf")
print(f"Uploaded: {file.reference}")  # datalab://file-abc123

# Upload multiple files
files = client.upload_files(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
for f in files:
    print(f"{f.original_filename}: {f.reference}")
```

### Use in Pipelines

```python theme={null}
# Upload files
files = client.upload_files(["invoice1.pdf", "invoice2.pdf"])

# Use in pipeline
for f in files:
    execution = client.run_pipeline("pl_abc123", file_url=f.reference)
```

### File Management

```python theme={null}
# List files
result = client.list_files(limit=50)
for file in result['files']:
    print(f"{file.original_filename}: {file.file_size} bytes")

# Get metadata
file = client.get_file_metadata(123)

# Get download URL
download = client.get_file_download_url(file_id=123, expires_in=3600)
print(download['download_url'])

# Delete file
client.delete_file(123)
```

See [SDK File Management](/docs/welcome/sdk/file-management) for complete documentation.

## REST API

The upload flow has three steps:

### 1. Request Upload URL

```bash theme={null}
curl -X POST https://www.datalab.to/api/v1/files/upload \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"filename": "document.pdf", "content_type": "application/pdf"}'
```

To store the file in EU infrastructure, add `"processing_location": "eu"` to the request body:

```bash theme={null}
curl -X POST https://www.datalab.to/api/v1/files/upload \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"filename": "document.pdf", "content_type": "application/pdf", "processing_location": "eu"}'
```

Response:

```json theme={null}
{
  "file_id": 123,
  "upload_url": "https://presigned-url...",
  "expires_in": 3600,
  "reference": "datalab://file-abc123"
}
```

### 2. Upload File

```bash theme={null}
curl -X PUT "{upload_url}" \
  -H "Content-Type: application/pdf" \
  --data-binary @document.pdf
```

### 3. Confirm Upload

```bash theme={null}
curl https://www.datalab.to/api/v1/files/123/confirm \
  -H "X-API-Key: YOUR_API_KEY"
```

### Complete Python Example

```python theme={null}
import requests

API_KEY = "YOUR_API_KEY"
headers = {"X-API-Key": API_KEY}

# Step 1: Request upload URL
response = requests.post(
    "https://www.datalab.to/api/v1/files/upload",
    json={"filename": "document.pdf", "content_type": "application/pdf"},
    headers=headers
)
data = response.json()
file_id = data["file_id"]
upload_url = data["upload_url"]
reference = data["reference"]

# Step 2: Upload file
with open("document.pdf", "rb") as f:
    requests.put(upload_url, data=f, headers={"Content-Type": "application/pdf"})

# Step 3: Confirm upload
requests.get(f"https://www.datalab.to/api/v1/files/{file_id}/confirm", headers=headers)

print(f"File ready: {reference}")
```

## File Management API

### List Files

```bash theme={null}
GET /api/v1/files?limit=50&offset=0
```

### Get File Metadata

```bash theme={null}
GET /api/v1/files/{file_id}
```

### Get Download URL

```bash theme={null}
GET /api/v1/files/{file_id}/download?expires_in=3600
```

### Delete File

```bash theme={null}
DELETE /api/v1/files/{file_id}
```

## Using File References

Once uploaded, use `datalab://file-{id}` references in any API call:

```python theme={null}
# In Convert API
response = requests.post(
    "https://www.datalab.to/api/v1/convert",
    data={
        "file_url": "datalab://file-abc123",
        "output_format": "markdown",
        "mode": "balanced"
    },
    headers=headers
)

# In Form Filling API
response = requests.post(
    "https://www.datalab.to/api/v1/fill",
    data={
        "file_url": "datalab://file-abc123",
        "field_data": json.dumps(field_data)
    },
    headers=headers
)
```

## Limits

| Limit               | Value                |
| ------------------- | -------------------- |
| Maximum file size   | 200 MB               |
| Upload URL expiry   | 1 hour               |
| Download URL expiry | 1 minute to 24 hours |

See [API Limits](/docs/common/limits) for complete details.

<Card title="Try Datalab" icon="rocket" href="https://www.datalab.to/auth/sign_up">
  Get started with our API in less than a minute. We include free credits.
</Card>


# Forge Evals
Source: https://documentation.datalab.to/docs/recipes/forge-evals/overview

Compare parsing configurations across multiple documents to find the best settings for your use case

Forge Evals is a powerful tool for evaluating and comparing different parsing configurations across multiple documents. Use it to determine which settings work best for your specific document types and use cases.

## What is Forge Evals?

Forge Evals allows you to:

* Upload up to 10 documents at once
* Test up to 5 different parsing configurations simultaneously
* Compare results side-by-side with visual diff highlighting
* Identify the optimal parsing settings for your document types

This is particularly useful when you need to:

* Determine which parsing mode (Fast, Balanced, or Accurate) works best for your documents
* Evaluate special features like Track Changes or Chart Understanding
* Compare parsing results across different document types
* Optimize for speed vs. accuracy trade-offs

## Getting started

Access Forge Evals at [https://www.datalab.to/app/evals](https://www.datalab.to/app/evals)

### Step 1: Upload documents

Upload the documents you want to evaluate. You can:

* Drag and drop files directly into the upload zone
* Click to browse and select files
* Upload up to 10 documents per evaluation session

**Supported formats:** PDF, DOCX, XLSX, PPTX, images, and more. See [supported file types](/docs/common/supportedfiletypes) for the complete list.

<Note>
  Spreadsheet files (XLS, XLSX, CSV, ODS) are processed automatically without additional configuration options.
</Note>

### Step 2: Select configurations

Choose which parsing configurations to test. Configurations are organized into three tabs:

#### Datalab tab

Select from Datalab's preset configurations or create custom ones:

**Preset configurations:**

* **Fast Mode**: Lowest latency, great for real-time use cases
* **Balanced Mode**: Balanced accuracy and latency, works well with most documents
* **Accurate Mode**: Highest accuracy and latency, good for complex documents
* **Track Changes**: Extract tracked changes from DOCX files (DOCX only)
* **Chart Understanding**: Extract data from charts and graphs

**Custom configurations:**

Create custom configurations to test specific combinations of:

* Processing mode (Fast, Balanced, or Accurate)
* Page range selection
* Special features (Track Changes, Chart Understanding)
* Output options (pagination, headers, footers)
* Run count (1-3×): Run the same configuration multiple times to test consistency

<Note>
  Track Changes only works with DOCX files. The grid will show "N/A" for incompatible document/configuration combinations.
</Note>

#### Other Models tab

Compare Datalab against other open source models hosted on our infrastructure:

* **OlmoOCR**
* **RolmoOCR**
* **DotsOCR**
* **DeepSeekOCR**

These models are hosted by Datalab and don't require any API credentials. Because Datalab models have additional optimizations when hosted on our managed API, we omit timing numbers from other hosted models to avoid confusion since a fair comparison is difficult. If you'd like to see additional models or want help with custom evals / timings, contact us at [support@datalab.to](mailto:support@datalab.to).

#### External Providers tab

<Note>
  Access to external providers is currently limited to select users. If you're actively evaluating Datalab against other providers, [contact us](mailto:support@datalab.to) to request access.
</Note>

You can also use Evals to compare Datalab outputs to other proprietary document processing providers. Get in touch to enable this.

### Step 3: Run evaluation

Click "Start Evaluation" to begin processing. The system will:

1. Process each document with each selected configuration
2. Display progress in a grid view
3. Show completion status and processing time for each run

You can:

* Monitor progress in real-time
* Cancel all runs if needed
* Retry failed runs

### Step 4: Compare results

Once runs complete, click any two cells in the grid to compare their results side-by-side.

The comparison view shows:

* **Parallel view**: Full documents side-by-side with inline diff highlighting
* **Multiple output formats**: Switch between Markdown, HTML, JSON, and Chunks
* **Rendered output**: Toggle between raw and rendered views for HTML, Markdown, and JSON formats
* **Visual diffs**: When enabled with rendered output, see word-level highlighting of changes
* **JSON visualization**: View JSON output with document thumbnails and bounding boxes overlaid
* **Processing metrics**: Duration and configuration details for each run
* **Diff statistics**: Lines added, removed, and changed

#### Viewing modes

* **Raw view**: See the original output text with line numbers
* **Rendered view**: View formatted HTML/Markdown or visualized JSON with thumbnails
* **Diff view**: Compare outputs with line-by-line or word-level highlighting
* **Rendered diff**: Combine rendered output with word-level diff highlighting (HTML/Markdown only)

<Note>
  Rendered diff view is only available for HTML and Markdown formats. JSON rendered view shows bounding boxes but does not support diff highlighting.
</Note>

Use the "Switch Runs" button to select different runs for comparison without leaving the comparison view.

## Visualization features

### Rendered output

Toggle the "Render" button to view formatted output instead of raw text:

* **HTML/Markdown**: See the fully rendered document with proper formatting, including math equations rendered with MathJax
* **JSON**: View document thumbnails with bounding boxes overlaid on detected blocks (text, tables, figures, etc.)

### Diff highlighting

When comparing two runs, enable "Show Diff" to see differences:

* **Raw diff**: Line-by-line comparison with added/removed lines highlighted
* **Rendered diff**: Word-level highlighting within rendered HTML/Markdown output, preserving formatting and math rendering

The rendered diff view intelligently highlights:

* Changed paragraphs with block-level highlighting
* Specific changed words within modified paragraphs
* Preserved math equations with accurate semantic comparison

<Note>
  Rendered diff is not available for JSON format. Use raw diff view to compare JSON outputs.
</Note>

### Multiple iterations

When a configuration is set to run multiple times (2× or 3×), each iteration appears as a separate column in the grid (e.g., "Accurate #1", "Accurate #2"). This allows you to:

* Compare consistency across multiple runs of the same configuration
* Identify variability in parsing results
* Validate that your configuration produces stable outputs

## Excluding runs

Right-click any cell in the grid to exclude that specific document/configuration combination from running. This is useful when:

* You know certain configurations won't work for specific documents
* You want to reduce the total number of runs
* You need to focus on specific comparisons

Excluded cells appear with a yellow background and can be re-included by clicking them again.

## Best practices

### Choosing configurations

* Start with the three preset modes (Fast, Balanced, Accurate) to establish a baseline
* Add Track Changes if you're working with DOCX files that contain revisions
* Add Chart Understanding if your documents contain charts or graphs
* Create custom configurations to test specific parameter combinations

### Document selection

* Include representative samples of your document types
* Test edge cases (complex layouts, mixed content, etc.)
* Keep document count manageable (3-5 documents is often sufficient)

### Interpreting results

* Compare processing times to understand speed/accuracy trade-offs
* Use the diff view to identify where configurations produce different outputs
* Toggle between raw and rendered views to see formatted output
* Use rendered diff view for word-level highlighting of changes in HTML/Markdown
* Visualize JSON output with bounding boxes to see document structure
* Pay attention to "N/A" cells indicating incompatible combinations
* Look for patterns across similar document types
* Run configurations multiple times (using run count) to test consistency and identify variability

## Limitations

* Maximum 10 documents per evaluation session
* Maximum 5 run configurations per session
* Maximum 3 iterations per configuration
* Track Changes feature only works with DOCX files
* Spreadsheet files use automatic configuration (no mode selection)
* Rendered diff view only available for HTML and Markdown formats
* External provider access is limited to select users (contact us for access)

## Custom evaluations

For larger document sets or custom evaluation needs, [contact us](https://www.datalab.to/contact) to discuss enterprise evaluation options.

## Next Steps

<CardGroup>
  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Dive into Marker's conversion API to configure the settings you evaluated.
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data from documents using JSON schemas.
  </Card>

  <Card title="Form Filling" icon="pen-to-square" href="/docs/recipes/form-filling/form-filling-api-overview">
    Automatically fill PDF forms with extracted data.
  </Card>

  <Card title="Quickstart" icon="bolt" href="/docs/welcome/quickstart">
    Get up and running with the Datalab API in minutes.
  </Card>
</CardGroup>


# Form Filling
Source: https://documentation.datalab.to/docs/recipes/form-filling/form-filling-api-overview

Automatically fill PDF and image forms with structured data.

The form filling API fills PDF and image forms with your structured data. It works with native PDF form fields and scanned/image forms.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, FormFillingOptions

  client = DatalabClient()

  options = FormFillingOptions(
      field_data={
          "name": {"value": "John Doe", "description": "Full name"},
          "email": {"value": "john@example.com", "description": "Email address"},
          "date": {"value": "12/15/2024", "description": "Today's date"},
      }
  )

  result = client.fill("form.pdf", options=options)
  result.save_output("filled_form.pdf")

  print(f"Fields filled: {result.fields_filled}")
  print(f"Fields not found: {result.fields_not_found}")
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/fill \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@form.pdf" \
    -F 'field_data={"name": {"value": "John Doe", "description": "Full name"}, "email": {"value": "john@example.com", "description": "Email address"}}'

  # Poll request_check_url from response until status is "complete"
  # Response includes output_base64 with the filled form
  ```

  ```python Python (requests) theme={null}
  import requests, json, time, base64, os

  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  field_data = {
      "name": {"value": "John Doe", "description": "Full name"},
      "email": {"value": "john@example.com", "description": "Email"},
      "date": {"value": "12/15/2024", "description": "Date"}
  }

  with open("form.pdf", "rb") as f:
      response = requests.post(
          "https://www.datalab.to/api/v1/fill",
          files={"file": ("form.pdf", f, "application/pdf")},
          data={"field_data": json.dumps(field_data), "confidence_threshold": "0.5"},
          headers=headers
      )

  check_url = response.json()["request_check_url"]

  while True:
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          pdf_bytes = base64.b64decode(result["output_base64"])
          with open("filled_form.pdf", "wb") as f:
              f.write(pdf_bytes)
          print(f"Fields filled: {result['fields_filled']}")
          break
      elif result["status"] == "failed":
          print(f"Error: {result.get('error')}")
          break
      time.sleep(2)
  ```
</CodeGroup>

See [SDK Form Filling](/docs/welcome/sdk/form-filling) for complete SDK documentation.

## How It Works

1. Upload your form (PDF or image) with field data
2. The API detects form fields and matches them to your data
3. Fields are filled and the form is returned as PDF or PNG

## Field Data Format

Provide field names with values and descriptions:

```python theme={null}
field_data = {
    "field_key": {
        "value": "The value to fill",
        "description": "Description to help match the field"
    }
}
```

### Examples

**Basic fields:**

```python theme={null}
field_data = {
    "first_name": {"value": "John", "description": "First name"},
    "last_name": {"value": "Doe", "description": "Last name"},
    "ssn": {"value": "123-45-6789", "description": "Social Security Number"}
}
```

**Checkboxes:**

```python theme={null}
field_data = {
    "is_citizen": {"value": "yes", "description": "US citizenship status"},
    "agree_terms": {"value": "checked", "description": "Terms agreement"}
}
```

Values like `"yes"`, `"true"`, `"1"`, `"checked"`, `"x"` will check boxes.

**Compound data:**

```python theme={null}
field_data = {
    "full_address": {
        "value": "123 Main St, New York, NY, 10001",
        "description": "Complete address"
    }
}
```

The API can split compound data across multiple form fields.

## Options

| Option                 | Type   | Default  | Description                                                                                                                                                                              |
| ---------------------- | ------ | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `field_data`           | dict   | Required | Field names mapped to values and descriptions                                                                                                                                            |
| `context`              | str    | None     | Additional context to help match fields                                                                                                                                                  |
| `confidence_threshold` | float  | `0.5`    | Minimum confidence for field matching (0.0-1.0)                                                                                                                                          |
| `max_pages`            | int    | None     | Maximum pages to process                                                                                                                                                                 |
| `page_range`           | str    | None     | Specific pages to process                                                                                                                                                                |
| `skip_cache`           | bool   | `False`  | Skip cached results                                                                                                                                                                      |
| `processing_location`  | string | -        | Data residency region: `"eu"` or `"us"`. When set, use `file_url` or a pre-uploaded `datalab://` reference — multipart uploads are not supported. EU carries a regional pricing premium. |

### Context Parameter

Use `context` to improve matching for specific form types:

```python theme={null}
options = FormFillingOptions(
    field_data={...},
    context="W-4 Employee's Withholding Certificate for new hire"
)
```

## Response

| Field              | Type  | Description                           |
| ------------------ | ----- | ------------------------------------- |
| `status`           | str   | `processing`, `complete`, or `failed` |
| `success`          | bool  | Whether filling succeeded             |
| `output_format`    | str   | `pdf` or `png`                        |
| `output_base64`    | str   | Base64-encoded filled form            |
| `fields_filled`    | list  | Field names that were filled          |
| `fields_not_found` | list  | Field names that couldn't be matched  |
| `page_count`       | int   | Pages processed                       |
| `runtime`          | float | Processing time in seconds            |
| `cost_breakdown`   | dict  | Cost details                          |

## Supported Form Types

* **PDF with native AcroForm fields** - Uses PDF form fields directly
* **PDF with visual fields** - Detects field locations and adds text overlays
* **Images** (PNG, JPG) - Detects field locations and draws text on image

The API automatically detects the input type and uses the appropriate method.

<Warning>
  Results are deleted from Datalab servers one hour after processing completes.
</Warning>

## Next Steps

<CardGroup>
  <Card title="SDK Form Filling" icon="code" href="/docs/welcome/sdk/form-filling">
    Complete SDK reference for form filling
  </Card>

  <Card title="File Upload" icon="upload" href="/docs/recipes/file-management/file-upload-api">
    Upload forms for reuse across requests
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Get notified when form filling completes
  </Card>
</CardGroup>


# Recipes Overview
Source: https://documentation.datalab.to/docs/recipes/overview

End-to-end guides for common document processing workflows.

Recipes are detailed, end-to-end guides with fully working code samples. Pick a recipe based on what you're trying to accomplish.

## By Use Case

<CardGroup>
  <Card title="Build Production Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned pipelines for production use
  </Card>

  <Card title="Process PDFs for RAG/LLM" icon="robot" href="/docs/recipes/conversion/conversion-api-overview">
    Convert documents to markdown or chunks for retrieval-augmented generation
  </Card>

  <Card title="Extract Invoice/Receipt Data" icon="receipt" href="/docs/recipes/structured-extraction/api-overview">
    Pull structured fields (amounts, dates, line items) from financial documents
  </Card>

  <Card title="Analyze Contracts" icon="file-contract" href="/docs/recipes/structured-extraction/api-overview">
    Extract parties, dates, and clauses from legal documents
  </Card>

  <Card title="Process Research Papers" icon="flask" href="/docs/recipes/structured-extraction/api-overview">
    Extract titles, authors, abstracts, and citations from academic papers
  </Card>

  <Card title="Fill Out Forms" icon="pen-line" href="/docs/recipes/form-filling/form-filling-api-overview">
    Automatically fill PDF and image forms with structured data
  </Card>

  <Card title="Generate Documents" icon="file-export" href="/docs/recipes/create-document/create-document-api-overview">
    Create Word documents from markdown with track changes
  </Card>

  <Card title="Split Batch-Scanned PDFs" icon="scissors" href="/docs/recipes/document-segmentation/auto-segmentation">
    Separate multi-document PDFs into individual documents
  </Card>

  <Card title="Review Track Changes" icon="file-diff" href="/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents">
    Extract redlines, insertions, deletions, and comments from Word documents
  </Card>
</CardGroup>

## By Feature

| Feature                | Description                                                 | Guide                                                                                  |
| ---------------------- | ----------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| Document Conversion    | Convert PDFs, images, and office docs to markdown/HTML/JSON | [Guide](/docs/recipes/conversion/conversion-api-overview)                              |
| Batch Processing       | Process multiple documents concurrently                     | [Guide](/docs/recipes/conversion/batch-documents)                                      |
| Structured Extraction  | Extract fields using JSON schemas                           | [Guide](/docs/recipes/structured-extraction/api-overview)                              |
| Long Document Handling | Strategies for 100+ page documents                          | [Guide](/docs/recipes/structured-extraction/handling-long-documents)                   |
| Document Segmentation  | Split multi-document PDFs by section                        | [Guide](/docs/recipes/document-segmentation/auto-segmentation)                         |
| Form Filling           | Fill PDF and image forms programmatically                   | [Guide](/docs/recipes/form-filling/form-filling-api-overview)                          |
| Create Document        | Generate DOCX files from markdown                           | [Guide](/docs/recipes/create-document/create-document-api-overview)                    |
| File Upload            | Upload and manage files for reuse                           | [Guide](/docs/recipes/file-management/file-upload-api)                                 |
| Pipelines              | Chain processors into versioned, reusable configurations    | [Guide](/docs/recipes/pipelines/pipeline-overview)                                     |
| Pipeline Versioning    | Manage drafts, publish versions, pin production deployments | [Guide](/docs/recipes/pipelines/pipeline-versioning)                                   |
| Track Changes          | Extract redlines and comments from Word docs                | [Guide](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents) |
| Forge Evals            | Compare parsing configurations side-by-side                 | [Guide](/docs/recipes/forge-evals/overview)                                            |

## Self-Hosted

All cloud API recipes work with our [on-premises containers](/docs/on-prem/overview) for sensitive documents. See the [feature parity table](/docs/on-prem/api#feature-parity) for available features.


# Create a Pipeline
Source: https://documentation.datalab.to/docs/recipes/pipelines/create-pipeline

Build pipelines using Forge or the SDK to chain document processors.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## Using Forge

[Forge](https://www.datalab.to/app/playground) provides a visual pipeline builder where you can:

1. **Start from a template** or create a blank pipeline
2. **Add processors** — click to add convert, extract, segment, custom, or fill processors
3. **Configure each processor** — set processing mode, schemas, field data, and options in the configuration panel
4. **Test with a document** — run the pipeline and watch each processor complete in real-time
5. **Save and version** — name your pipeline and publish versions for production use

Edits in Forge auto-save as a draft. Your published versions remain unchanged until you explicitly publish a new version.

## Using the SDK

### Create a Pipeline

Define processors using `PipelineProcessor` and create the pipeline:

```python theme={null}
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

steps = [
    PipelineProcessor(type="convert", settings={
        "mode": "balanced",
        "output_format": "markdown"
    }),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string", "description": "Document title"},
                "date": {"type": "string", "description": "Document date"},
                "summary": {"type": "string", "description": "Brief summary"}
            }
        }
    })
]

pipeline = client.create_pipeline(steps=steps)
print(f"Created: {pipeline.pipeline_id}")  # pl_XXXXX
```

The pipeline starts as an unsaved draft.

### Save the Pipeline

Name and save the pipeline so it appears in your pipeline list:

```python theme={null}
pipeline = client.save_pipeline(
    pipeline.pipeline_id,
    name="Document Summarizer"
)
print(f"Saved: {pipeline.name}")
```

### Update Steps

Update a pipeline's steps. This creates a draft if the pipeline has a published version:

```python theme={null}
updated_steps = [
    PipelineProcessor(type="convert", settings={
        "mode": "accurate",  # Changed from balanced
        "output_format": "markdown"
    }),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "date": {"type": "string"},
                "summary": {"type": "string"},
                "author": {"type": "string"}  # Added field
            }
        }
    })
]

pipeline = client.update_pipeline(pipeline.pipeline_id, steps=updated_steps)
```

## Using the REST API

<CodeGroup>
  ```bash Create theme={null}
  curl -X POST https://www.datalab.to/api/v1/pipelines \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "steps": [
        {"type": "convert", "settings": {"mode": "balanced"}},
        {"type": "extract", "settings": {
          "page_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}"
        }}
      ]
    }'
  ```

  ```bash Save theme={null}
  curl -X PUT https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/save \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"name": "Document Summarizer"}'
  ```

  ```bash Update steps theme={null}
  curl -X PUT https://www.datalab.to/api/v1/pipelines/PIPELINE_ID \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "steps": [
        {"type": "convert", "settings": {"mode": "accurate"}},
        {"type": "extract", "settings": {
          "page_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}"
        }}
      ]
    }'
  ```
</CodeGroup>

## Processor Configuration Reference

### Convert Processor

Controls how the document is parsed.

```python theme={null}
PipelineProcessor(type="convert", settings={
    "mode": "balanced",           # fast, balanced, accurate
    "output_format": "markdown",  # markdown, html, json, chunks
    "paginate": True,             # Add page delimiters
    "include_images": True,       # Extract images
    "include_image_captions": True,
    "add_block_ids": False,       # Block IDs for citations
})
```

| Setting                    | Type | Default      | Description                         |
| -------------------------- | ---- | ------------ | ----------------------------------- |
| `mode`                     | str  | `"fast"`     | Processing mode                     |
| `output_format`            | str  | `"markdown"` | Output format                       |
| `paginate`                 | bool | `false`      | Add page delimiters                 |
| `include_images`           | bool | `true`       | Extract images from document        |
| `include_image_captions`   | bool | `true`       | Generate image captions             |
| `include_headers_footers`  | bool | `false`      | Include page headers/footers        |
| `add_block_ids`            | bool | `false`      | Add block IDs for citation tracking |
| `fence_synthetic_captions` | bool | `false`      | Fence synthetic image captions      |

### Extract Processor

Extracts structured data using a JSON schema. Requires a preceding `convert` processor (or `segment` / `custom`).

```python theme={null}
PipelineProcessor(type="extract", settings={
    "page_schema": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string", "description": "Invoice ID"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "amount": {"type": "number"}
                    }
                }
            }
        }
    }
})
```

| Setting       | Type | Description                            |
| ------------- | ---- | -------------------------------------- |
| `page_schema` | dict | JSON schema defining fields to extract |

<Tip>
  Use detailed `description` fields in your schema to improve extraction accuracy. Tell the model what to look for.
</Tip>

### Segment Processor

Splits a document into logical sections. Requires a preceding `convert` processor.

```python theme={null}
PipelineProcessor(type="segment", settings={
    "segmentation_schema": {
        "Cover Letter": "The cover letter or introductory section",
        "Resume": "The applicant's resume or CV",
        "References": "Reference letters or contact information"
    }
})
```

| Setting               | Type | Description                          |
| --------------------- | ---- | ------------------------------------ |
| `segmentation_schema` | dict | Map of section names to descriptions |

### Custom Processor

Applies use-case-specific customizations to convert output. Requires a preceding `convert` processor. See [Custom Processors](/docs/recipes/pipelines/custom-processors) for details.

```python theme={null}
PipelineProcessor(
    type="custom",
    settings={},
    custom_processor_id="cp_abc123"  # Your custom processor ID
)
```

| Field                 | Type | Description                             |
| --------------------- | ---- | --------------------------------------- |
| `custom_processor_id` | str  | ID of the custom processor (`cp_XXXXX`) |
| `eval_rubric_id`      | int  | Optional evaluation rubric to apply     |

### Fill Processor

Fills form fields in a PDF or image. `fill` is always the only step in a pipeline — it cannot be chained with `convert`, `extract`, or `segment`. Use it to apply versioning and execution tracking to your form-filling workflows.

```python theme={null}
PipelineProcessor(type="fill", settings={
    "field_data": {
        "full_name": {"value": "John Doe", "description": "Full legal name"},
        "date": {"value": "2024-01-15", "description": "Today's date"},
    },
    "context": "Employee onboarding form",  # Optional
    "confidence_threshold": 0.5,             # Optional, default 0.5
})
```

| Setting                | Type  | Required | Description                                                    |
| ---------------------- | ----- | -------- | -------------------------------------------------------------- |
| `field_data`           | dict  | Yes      | Map of field keys to `{value, description}` objects            |
| `context`              | str   | No       | Additional context to improve field matching                   |
| `confidence_threshold` | float | No       | Minimum confidence for field matching (0.0–1.0, default `0.5`) |

## List and Manage Pipelines

```python theme={null}
# List saved pipelines
result = client.list_pipelines(saved_only=True, limit=50)
for p in result["pipelines"]:
    print(f"{p.pipeline_id}: {p.name} (v{p.active_version})")

# Get a specific pipeline
pipeline = client.get_pipeline("pl_abc123")

# Archive (soft-delete)
client.archive_pipeline("pl_abc123")

# Restore
client.unarchive_pipeline("pl_abc123")
```

## Next Steps

<CardGroup>
  <Card title="Pipeline Versioning" icon="code-branch" href="/docs/recipes/pipelines/pipeline-versioning">
    Manage drafts, publish versions, and pin production deployments.
  </Card>

  <Card title="Run a Pipeline" icon="play" href="/docs/recipes/pipelines/run-pipeline">
    Execute pipelines with overrides and track results.
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Deep dive on extraction schemas and confidence scoring.
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk/pipelines">
    Full SDK reference for all pipeline methods.
  </Card>
</CardGroup>


# Custom Processors
Source: https://documentation.datalab.to/docs/recipes/pipelines/custom-processors

Fine-tune document conversion output with AI-generated custom processors.

Custom processors customize the output of the `convert` processor. When standard conversion doesn't produce exactly what you need — edge-case layouts, domain-specific formatting, or use-case-specific output transformations — custom processors let you fine-tune the result.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## How Custom Processors Work

A custom processor applies modifications on top of document conversion. The flow is:

1. The `convert` processor parses your document into structured output
2. The custom processor applies your modifications to refine that output

Modifications can operate at different levels:

* **Block-level** — Modify individual blocks (e.g., rewrite table captions, summarize content)
* **Page-level** — Modify entire pages with full structural control (e.g., reorder blocks, add/remove elements)
* **Classification** — Classify pages into categories for downstream routing

## Creating a Custom Processor

The recommended way to create a custom processor is through [Forge](https://www.datalab.to/app/playground). The creation flow is a 3-step guided wizard:

1. **Describe** — Use the chat-driven builder to articulate what your processor should do. Describe your goal in natural language (e.g., "Summarize all tables into bullet points" or "Extract only the financial data sections") and the AI assistant will help you refine and confirm the specification before generating the processor.
2. **Documents** — Upload example documents that represent your use case. These are used to generate and validate the processor configuration.
3. **Review** — See the generated processor run on your examples. If the results aren't right, use the **Improve** tab in the sidebar to describe what to change and generate a new version. The **History** tab shows all past versions and lets you revert to any of them; **Details** shows the active configuration.

Each custom processor gets an ID in the format `cp_XXXXX`.

## Using a Custom Processor

### Standalone

Run a custom processor directly on a document:

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, CustomProcessorOptions

  client = DatalabClient()

  options = CustomProcessorOptions(
      pipeline_id="cp_abc123",    # Your custom processor ID
      mode="balanced",
      output_format="markdown",
  )

  result = client.run_custom_processor("document.pdf", options=options)
  print(result.markdown)
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/custom-processor \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "pipeline_id=cp_abc123" \
    -F "mode=balanced" \
    -F "output_format=markdown"
  ```

  ```python Python (requests) theme={null}
  import os, time, requests

  url = "https://www.datalab.to/api/v1/custom-processor"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  with open("document.pdf", "rb") as f:
      resp = requests.post(url, headers=headers,
          files={"file": ("document.pdf", f, "application/pdf")},
          data={
              "pipeline_id": "cp_abc123",
              "mode": "balanced",
              "output_format": "markdown"
          })

  check_url = resp.json()["request_check_url"]

  for _ in range(300):
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          print(result["markdown"])
          break
      time.sleep(2)
  ```
</CodeGroup>

### In a Pipeline

Use a custom processor as part of a pipeline by adding it as a `custom` processor:

```python theme={null}
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

pipeline = client.create_pipeline(steps=[
    PipelineProcessor(type="convert", settings={"mode": "balanced"}),
    PipelineProcessor(type="custom", settings={}, custom_processor_id="cp_abc123"),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "summary": {"type": "string"}
            }
        }
    })
])
```

This chains convert → custom → extract: the document is parsed, your custom modifications are applied, then structured data is extracted from the customized output.

## CustomProcessorOptions

| Option                     | Type | Default        | Description                                                 |
| -------------------------- | ---- | -------------- | ----------------------------------------------------------- |
| `pipeline_id`              | str  | Required       | Custom processor ID (`cp_XXXXX`)                            |
| `version`                  | int  | Active version | Specific processor version to run                           |
| `run_eval`                 | bool | `False`        | Run evaluation rules after processing                       |
| `mode`                     | str  | `"fast"`       | Processing mode: `"fast"`, `"balanced"`, `"accurate"`       |
| `output_format`            | str  | `"markdown"`   | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"` |
| `paginate`                 | bool | `False`        | Add page delimiters                                         |
| `add_block_ids`            | bool | `False`        | Add block IDs for citation tracking                         |
| `disable_image_extraction` | bool | `False`        | Don't extract images                                        |
| `disable_image_captions`   | bool | `False`        | Don't generate image captions                               |
| `webhook_url`              | str  | -              | Webhook URL for completion notification                     |

## Versioning

Custom processors support versioning. Each iteration creates a new version, letting you refine behavior over time:

```python theme={null}
# List versions
versions = client.list_custom_processor_versions("cp_abc123")
for v in versions["versions"]:
    print(f"v{v.version}: {v.description}")

# Switch active version
client.set_active_processor_version("cp_abc123", version=2)
```

## Managing Custom Processors

```python theme={null}
# List your custom processors
result = client.list_custom_processors(limit=50)
for p in result["processors"]:
    print(f"{p.processor_id}: {p.name} (v{p.active_version})")

# Archive
client.archive_custom_processor("cp_abc123")
```

## Next Steps

<CardGroup>
  <Card title="Pipeline Overview" icon="sitemap" href="/docs/recipes/pipelines/pipeline-overview">
    Processor types, composition rules, and when to use pipelines.
  </Card>

  <Card title="Create a Pipeline" icon="hammer" href="/docs/recipes/pipelines/create-pipeline">
    Build pipelines that include custom processors.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Understand the convert processor that custom processors build on.
  </Card>

  <Card title="Forge Evals" icon="vials" href="/docs/recipes/forge-evals/overview">
    Evaluate and compare processor configurations across your document collection.
  </Card>
</CardGroup>


# Pipelines
Source: https://documentation.datalab.to/docs/recipes/pipelines/pipeline-overview

Build versioned document processing pipelines by chaining processors together.

Pipelines chain processors — convert, extract, segment, and custom — into a single reusable unit. Define a pipeline once, version it, and run it against any document with one API call.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## Why Pipelines

Individual endpoints like `/convert` and `/extract` work well for one-off tasks. Pipelines are better when you need to:

* **Chain processors** — Convert a document, then extract structured data, in one call
* **Version your configuration** — Pin production integrations to a specific version while iterating on drafts
* **Standardize processing** — Share pipeline configurations across your team
* **Track execution** — Monitor each processor's status as a pipeline runs

<Info>
  You can build pipelines visually in [Forge](https://www.datalab.to/app/playground) or programmatically via the SDK and API.
</Info>

## How Pipelines Work

A pipeline is an ordered chain of processors. Each processor processes the document and passes its output to the next via checkpoints.

```
convert → segment → extract
```

Most pipelines start with `convert`. The `fill` processor is the exception — it runs as a standalone step and cannot be chained.

### Processor Types

| Processor | Description                                                         | Can Follow                     |
| --------- | ------------------------------------------------------------------- | ------------------------------ |
| `convert` | Parse document to markdown/HTML/JSON                                | Must be first                  |
| `segment` | Split document into logical sections                                | `convert`                      |
| `extract` | Extract structured data using a JSON schema                         | `convert`, `segment`, `custom` |
| `custom`  | Run a [custom processor](/docs/recipes/pipelines/custom-processors) | `convert`                      |
| `fill`    | Fill form fields in a PDF or image                                  | Standalone only                |

### Composition Rules

* Every pipeline starts with a `convert` or `fill` processor
* `extract` is always terminal (nothing can follow it)
* `segment` can feed into `extract`
* `custom` can feed into `extract`
* `fill` is always standalone — it cannot follow or precede other processors

Common patterns:

| Pattern                       | Use Case                                 |
| ----------------------------- | ---------------------------------------- |
| `convert`                     | Simple document parsing                  |
| `convert → extract`           | Parse and extract structured fields      |
| `convert → segment`           | Parse and split into sections            |
| `convert → segment → extract` | Split, then extract from each section    |
| `convert → custom → extract`  | Apply custom processing, then extract    |
| `fill`                        | Version and track form-filling workflows |

## Pipeline Lifecycle

Pipelines have three states:

1. **Draft** — Edits auto-save. Not versioned yet.
2. **Saved** — Named and visible in your pipeline list.
3. **Published** — An immutable version snapshot. Safe to use in production.

```
Create (draft) → Save (named) → Publish version (immutable)
         ↑                              |
         └──── Edit (new draft) ←───────┘
```

When you edit a published pipeline, your changes go into a draft. The published version remains unchanged until you publish a new version. You can discard the draft at any time to revert.

See [Pipeline Versioning](/docs/recipes/pipelines/pipeline-versioning) for the full lifecycle.

## Quick Example

Create a pipeline that converts a document and extracts invoice data:

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, PipelineProcessor

  client = DatalabClient()

  # Define steps
  steps = [
      PipelineProcessor(type="convert", settings={"mode": "balanced"}),
      PipelineProcessor(type="extract", settings={
          "page_schema": {
              "type": "object",
              "properties": {
                  "invoice_number": {"type": "string"},
                  "total_amount": {"type": "number"},
                  "vendor_name": {"type": "string"}
              }
          }
      })
  ]

  # Create and save
  pipeline = client.create_pipeline(steps=steps)
  pipeline = client.save_pipeline(pipeline.pipeline_id, name="Invoice Extractor")

  # Run on a document
  execution = client.run_pipeline(
      pipeline.pipeline_id,
      file_path="invoice.pdf"
  )

  # Poll until complete
  execution = client.get_pipeline_execution(
      execution.execution_id,
      max_polls=300
  )

  # Get extraction result
  result = client.get_step_result(execution.execution_id, step_index=1)
  print(result)
  ```

  ```bash cURL theme={null}
  # Create pipeline
  curl -X POST https://www.datalab.to/api/v1/pipelines \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "steps": [
        {"type": "convert", "settings": {"mode": "balanced"}},
        {"type": "extract", "settings": {
          "page_schema": {
            "type": "object",
            "properties": {
              "invoice_number": {"type": "string"},
              "total_amount": {"type": "number"}
            }
          }
        }}
      ]
    }'

  # Save pipeline (use pipeline_id from response)
  curl -X PUT https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/save \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"name": "Invoice Extractor"}'

  # Run pipeline
  curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/run \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@invoice.pdf"

  # Poll execution (use execution_id from response)
  curl https://www.datalab.to/api/v1/pipelines/executions/EXECUTION_ID \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, time, requests, json

  BASE = "https://www.datalab.to/api/v1"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  # Create pipeline
  resp = requests.post(f"{BASE}/pipelines", headers={
      **headers, "Content-Type": "application/json"
  }, json={
      "steps": [
          {"type": "convert", "settings": {"mode": "balanced"}},
          {"type": "extract", "settings": {
              "page_schema": json.dumps({
                  "type": "object",
                  "properties": {
                      "invoice_number": {"type": "string"},
                      "total_amount": {"type": "number"}
                  }
              })
          }}
      ]
  })
  pipeline_id = resp.json()["pipeline_id"]

  # Save
  requests.put(f"{BASE}/pipelines/{pipeline_id}/save",
      headers={**headers, "Content-Type": "application/json"},
      json={"name": "Invoice Extractor"})

  # Run
  with open("invoice.pdf", "rb") as f:
      resp = requests.post(f"{BASE}/pipelines/{pipeline_id}/run",
          headers=headers,
          files={"file": ("invoice.pdf", f, "application/pdf")})
  execution_id = resp.json()["execution_id"]

  # Poll
  for _ in range(300):
      resp = requests.get(f"{BASE}/pipelines/executions/{execution_id}",
          headers=headers)
      data = resp.json()
      if data["status"] in ("completed", "failed"):
          break
      time.sleep(2)

  # Get step result
  resp = requests.get(
      f"{BASE}/pipelines/executions/{execution_id}/steps/1/result",
      headers=headers)
  print(resp.json())
  ```
</CodeGroup>

## Pipelines vs Individual Endpoints

|                   | Individual Endpoints      | Pipelines                        |
| ----------------- | ------------------------- | -------------------------------- |
| **Processors**    | One at a time             | Chain multiple processors        |
| **Versioning**    | None                      | Draft, saved, published versions |
| **Configuration** | Pass options per request  | Configure once, reuse            |
| **Forge UI**      | Playground                | Full pipeline builder            |
| **Best for**      | Quick tests, simple tasks | Production integrations          |

Individual endpoints (`/convert`, `/extract`, `/segment`) are not going away. Use them for simple, one-off processing. Use Pipelines when you need repeatability, versioning, or multi-processor chains.

## Next Steps

<CardGroup>
  <Card title="Create a Pipeline" icon="hammer" href="/docs/recipes/pipelines/create-pipeline">
    Build your first pipeline with Forge or the SDK.
  </Card>

  <Card title="Pipeline Versioning" icon="code-branch" href="/docs/recipes/pipelines/pipeline-versioning">
    Manage drafts, versions, and production deployments.
  </Card>

  <Card title="Run a Pipeline" icon="play" href="/docs/recipes/pipelines/run-pipeline">
    Execute pipelines with overrides, polling, and webhooks.
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk/pipelines">
    Full SDK reference for all pipeline methods.
  </Card>
</CardGroup>


# Pipeline Versioning
Source: https://documentation.datalab.to/docs/recipes/pipelines/pipeline-versioning

Manage pipeline drafts, publish immutable versions, and pin production deployments.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## Version Lifecycle

Every pipeline goes through a predictable lifecycle:

| State     | `active_version` | Description                                 |
| --------- | ---------------- | ------------------------------------------- |
| Draft     | `0`              | Edits auto-save. No published version yet.  |
| Saved     | `0`              | Named pipeline, still no published version. |
| Published | `1`, `2`, ...    | Immutable version snapshots exist.          |

When you edit a published pipeline, your changes go into a draft. The published version is untouched until you explicitly publish again.

## Publish a Version

Create an immutable snapshot of the current pipeline steps:

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient

  client = DatalabClient()

  # Publish version 1
  version = client.create_pipeline_version(
      "pl_abc123",
      description="Initial production release"
  )
  print(f"Published v{version.version}")  # v1
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/versions \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"description": "Initial production release"}'
  ```

  ```python Python (requests) theme={null}
  import os, requests

  BASE = "https://www.datalab.to/api/v1"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  resp = requests.post(f"{BASE}/pipelines/pl_abc123/versions",
      headers={**headers, "Content-Type": "application/json"},
      json={"description": "Initial production release"})
  print(resp.json())
  ```
</CodeGroup>

Each call increments the version number. Published versions are immutable — their steps cannot be changed.

## Edit and Iterate

After publishing, any edits create a draft that is separate from the published version:

```python theme={null}
from datalab_sdk import PipelineProcessor

# Edit steps — this creates a draft
client.update_pipeline("pl_abc123", steps=[
    PipelineProcessor(type="convert", settings={"mode": "accurate"}),  # Changed
    PipelineProcessor(type="extract", settings={
        "page_schema": {"type": "object", "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"}  # Added field
        }}
    })
])

# Test the draft
execution = client.run_pipeline("pl_abc123", file_path="test.pdf", version=0)

# Happy with changes? Publish a new version
version = client.create_pipeline_version("pl_abc123", description="Added author field")
print(f"Published v{version.version}")  # v2
```

<Warning>
  `version=0` explicitly runs the draft. Omitting `version` runs the active published version. See [Run a Pipeline](/docs/recipes/pipelines/run-pipeline) for version parameter details.
</Warning>

## Discard a Draft

Revert unsaved changes and restore the published version's steps:

<CodeGroup>
  ```python Python SDK theme={null}
  # Discard draft, revert to active version
  pipeline = client.discard_pipeline_draft("pl_abc123")

  # Or revert to a specific version
  pipeline = client.discard_pipeline_draft("pl_abc123", version=1)
  ```

  ```bash cURL theme={null}
  # Discard draft, revert to active version
  curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/discard \
    -H "X-API-Key: $DATALAB_API_KEY"

  # Revert to specific version
  curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/discard \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"version": 1}'
  ```

  ```python Python (requests) theme={null}
  import os, requests

  BASE = "https://www.datalab.to/api/v1"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  resp = requests.post(f"{BASE}/pipelines/pl_abc123/discard", headers=headers)
  print(resp.json())
  ```
</CodeGroup>

## Browse Version History

List all published versions for a pipeline:

```python theme={null}
result = client.list_pipeline_versions("pl_abc123")

for v in result["versions"]:
    print(f"v{v.version}: {v.description} (created {v.created})")
    print(f"  Steps: {[s['type'] for s in v.steps]}")
```

Versions are returned newest-first.

## Best Practices

**Pin production integrations to a specific version.** When calling `run_pipeline()` from production code, pass an explicit `version` number. This protects you from accidental changes:

```python theme={null}
# Production code — pinned to v2
execution = client.run_pipeline(
    "pl_abc123",
    file_path="document.pdf",
    version=2  # Always runs v2, even if v3 is published later
)
```

**Test drafts before publishing.** Use `version=0` to run the draft version against test documents:

```python theme={null}
# Test draft changes
execution = client.run_pipeline(
    "pl_abc123",
    file_path="test_document.pdf",
    version=0  # Runs draft
)
```

**Use descriptions.** Include a meaningful description when publishing so your team can understand what changed:

```python theme={null}
client.create_pipeline_version(
    "pl_abc123",
    description="Switch to accurate mode, add line_items extraction"
)
```

**Archive unused pipelines.** Keep your pipeline list clean:

```python theme={null}
client.archive_pipeline("pl_old123")

# List includes archived if you need them
result = client.list_pipelines(include_archived=True)
```

## Next Steps

<CardGroup>
  <Card title="Run a Pipeline" icon="play" href="/docs/recipes/pipelines/run-pipeline">
    Execute pipelines with version selection, overrides, and polling.
  </Card>

  <Card title="Create a Pipeline" icon="hammer" href="/docs/recipes/pipelines/create-pipeline">
    Build pipelines with Forge or the SDK.
  </Card>

  <Card title="Pipeline Overview" icon="sitemap" href="/docs/recipes/pipelines/pipeline-overview">
    Processor types, composition rules, and when to use pipelines.
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk/pipelines">
    Full SDK reference for all pipeline methods.
  </Card>
</CardGroup>


# Run a Pipeline
Source: https://documentation.datalab.to/docs/recipes/pipelines/run-pipeline

Execute pipelines with version selection, overrides, polling, and per-processor result retrieval.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## Basic Execution

Run a pipeline on a document:

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient

  client = DatalabClient()

  execution = client.run_pipeline(
      "pl_abc123",
      file_path="document.pdf"
  )

  # Poll until complete
  execution = client.get_pipeline_execution(
      execution.execution_id,
      max_polls=300,
      poll_interval=2
  )

  print(f"Status: {execution.status}")
  ```

  ```bash cURL theme={null}
  # Start execution
  curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/run \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf"

  # Poll for completion (use execution_id from response)
  curl https://www.datalab.to/api/v1/pipelines/executions/EXECUTION_ID \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, time, requests

  BASE = "https://www.datalab.to/api/v1"
  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  # Start execution
  with open("document.pdf", "rb") as f:
      resp = requests.post(f"{BASE}/pipelines/pl_abc123/run",
          headers=headers,
          files={"file": ("document.pdf", f, "application/pdf")})

  execution_id = resp.json()["execution_id"]

  # Poll
  for _ in range(300):
      resp = requests.get(f"{BASE}/pipelines/executions/{execution_id}",
          headers=headers)
      data = resp.json()
      if data["status"] in ("completed", "completed_with_errors", "failed"):
          break
      time.sleep(2)

  print(f"Status: {data['status']}")
  ```
</CodeGroup>

You can also pass a URL instead of a file:

```python theme={null}
execution = client.run_pipeline(
    "pl_abc123",
    file_url="https://example.com/document.pdf"
)
```

## Version Selection

The `version` parameter controls which pipeline configuration runs:

| Value            | Behavior                                                                           |
| ---------------- | ---------------------------------------------------------------------------------- |
| Omitted / `None` | Runs the **active published version**. If no version is published, runs the draft. |
| `0`              | Explicitly runs the **draft** (current unpublished edits).                         |
| `1`, `2`, ...    | Runs a **specific published version**.                                             |

```python theme={null}
# Run active published version (recommended for production)
execution = client.run_pipeline("pl_abc123", file_path="doc.pdf")

# Run draft for testing
execution = client.run_pipeline("pl_abc123", file_path="doc.pdf", version=0)

# Pin to specific version
execution = client.run_pipeline("pl_abc123", file_path="doc.pdf", version=2)
```

<Warning>
  If you omit `version` and no version has been published, the draft runs. Publish a version before using a pipeline in production to avoid running unfinished drafts.
</Warning>

## Run-Level Overrides

Override pipeline behavior per execution without changing the pipeline configuration:

```python theme={null}
execution = client.run_pipeline(
    "pl_abc123",
    file_path="document.pdf",
    page_range="0-5",          # Process specific pages
    output_format="json",      # Override output format
    skip_cache=True,           # Force reprocessing (skip cached results)
    run_evals=True,            # Run evaluation rubrics defined on steps
    webhook_url="https://example.com/webhook",  # Notify on completion
    version=2,                 # Pin to version 2
)
```

### Override Reference

| Parameter       | Type | Default | Description                                                  |
| --------------- | ---- | ------- | ------------------------------------------------------------ |
| `file_path`     | str  | -       | Local file to process (mutually exclusive with `file_url`)   |
| `file_url`      | str  | -       | URL to document                                              |
| `page_range`    | str  | -       | Pages to process (e.g., `"0-5,10"`, 0-indexed)               |
| `output_format` | str  | -       | Override output format: `markdown`, `html`, `json`, `chunks` |
| `skip_cache`    | bool | `False` | Skip cached results, reprocess from scratch                  |
| `run_evals`     | bool | `False` | Run evaluation rubrics configured on steps                   |
| `webhook_url`   | str  | -       | URL to POST when execution completes                         |
| `version`       | int  | -       | Pipeline version to run (see above)                          |
| `max_polls`     | int  | `1`     | Polling attempts after submission                            |
| `poll_interval` | int  | `1`     | Seconds between polls                                        |

## Execution Status

Poll for status using `get_pipeline_execution()`:

```python theme={null}
execution = client.get_pipeline_execution(
    execution.execution_id,
    max_polls=300,      # Keep polling until complete
    poll_interval=2     # Check every 2 seconds
)

print(f"Status: {execution.status}")
print(f"Version: {execution.pipeline_version}")
print(f"Started: {execution.started_at}")
print(f"Completed: {execution.completed_at}")
```

### Status Values

| Status                  | Description                       |
| ----------------------- | --------------------------------- |
| `pending`               | Queued, not started               |
| `running`               | Processors are executing          |
| `completed`             | All steps finished successfully   |
| `completed_with_errors` | Some steps completed, some failed |
| `failed`                | Execution failed                  |

### Per-Processor Tracking

Each processor in the execution reports its own status:

```python theme={null}
for step in execution.steps:
    print(f"Step {step.step_index} ({step.step_type}): {step.status}")
    if step.error_message:
        print(f"  Error: {step.error_message}")
    if step.result_url:
        print(f"  Result available")
```

Step status values: `pending`, `dispatched`, `running`, `completed`, `failed`, `skipped`.

## Retrieve Processor Results

Fetch the output of a specific processor:

```python theme={null}
# Get result for step at index 1 (e.g., extract step)
result = client.get_step_result(execution.execution_id, step_index=1)
print(result)
```

<CodeGroup>
  ```bash cURL theme={null}
  curl https://www.datalab.to/api/v1/pipelines/executions/EXECUTION_ID/steps/1/result \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  resp = requests.get(
      f"{BASE}/pipelines/executions/{execution_id}/steps/1/result",
      headers=headers)
  print(resp.json())
  ```
</CodeGroup>

<Warning>
  Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
</Warning>

## Webhooks

Get notified when a pipeline execution completes instead of polling:

```python theme={null}
execution = client.run_pipeline(
    "pl_abc123",
    file_path="document.pdf",
    webhook_url="https://your-server.com/pipeline-webhook"
)
```

Datalab sends a POST request to your webhook URL when the execution reaches a terminal status. See [Webhooks](/platform/webhooks) for payload details.

## List Executions

View recent executions for a pipeline:

```python theme={null}
result = client.list_pipeline_executions("pl_abc123", limit=20)

for ex in result["executions"]:
    print(f"{ex.execution_id}: {ex.status} (v{ex.pipeline_version})")
```

## Billing

Pipeline execution is billed per page, with rates additive across processors. Each processor type has its own per-page rate. Check a pipeline's rate before running:

```python theme={null}
rate = client.get_pipeline_rate("pl_abc123")
print(f"Rate per 1000 pages: {rate['rate_per_1000_pages_cents']} cents")
print(f"Breakdown: {rate['rate_breakdown']}")
```

## End-to-End Example

Create a pipeline, publish it, and run it in production:

```python theme={null}
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

# 1. Create and save
pipeline = client.create_pipeline(steps=[
    PipelineProcessor(type="convert", settings={"mode": "balanced"}),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "vendor": {"type": "string", "description": "Vendor name"},
                "amount": {"type": "number", "description": "Total amount"},
                "date": {"type": "string", "description": "Invoice date"}
            }
        }
    })
])
pipeline = client.save_pipeline(pipeline.pipeline_id, name="Invoice Parser")

# 2. Test the draft
test_exec = client.run_pipeline(
    pipeline.pipeline_id,
    file_path="test_invoice.pdf",
    version=0
)
test_exec = client.get_pipeline_execution(test_exec.execution_id, max_polls=300)
test_result = client.get_step_result(test_exec.execution_id, step_index=1)
print(f"Test result: {test_result}")

# 3. Publish
version = client.create_pipeline_version(
    pipeline.pipeline_id,
    description="Initial release — balanced mode, basic fields"
)

# 4. Run in production (pinned to version)
execution = client.run_pipeline(
    pipeline.pipeline_id,
    file_path="real_invoice.pdf",
    version=version.version
)
execution = client.get_pipeline_execution(execution.execution_id, max_polls=300)

if execution.status == "completed":
    result = client.get_step_result(execution.execution_id, step_index=1)
    print(f"Extracted: {result}")
else:
    for step in execution.steps:
        if step.error_message:
            print(f"Step {step.step_index} failed: {step.error_message}")
```

## Next Steps

<CardGroup>
  <Card title="Pipeline Overview" icon="sitemap" href="/docs/recipes/pipelines/pipeline-overview">
    Processor types, composition rules, and when to use pipelines.
  </Card>

  <Card title="Pipeline Versioning" icon="code-branch" href="/docs/recipes/pipelines/pipeline-versioning">
    Manage drafts, versions, and production pinning.
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Configure webhook notifications for pipeline executions.
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk/pipelines">
    Full SDK reference for all pipeline methods.
  </Card>
</CardGroup>


# Structured Extraction
Source: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview

Extract structured data from documents using JSON schemas.

Extract specific fields from documents by providing a JSON schema. Marker parses the document and fills in your schema with extracted values.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

<Info>
  **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call.
</Info>

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  import json
  from datalab_sdk import DatalabClient, ExtractOptions

  client = DatalabClient()

  schema = {
      "type": "object",
      "properties": {
          "invoice_number": {"type": "string", "description": "Invoice ID or number"},
          "total_amount": {"type": "number", "description": "Total amount due"},
          "vendor_name": {"type": "string", "description": "Company or vendor name"}
      },
      "required": ["invoice_number", "total_amount"]
  }

  options = ExtractOptions(
      page_schema=json.dumps(schema),
      mode="balanced"
  )

  result = client.extract("invoice.pdf", options=options)
  extracted = json.loads(result.extraction_schema_json)
  print(f"Invoice: {extracted['invoice_number']}")
  print(f"Total: ${extracted['total_amount']}")
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/extract \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@invoice.pdf" \
    -F "mode=balanced" \
    -F 'page_schema={"type":"object","properties":{"invoice_number":{"type":"string","description":"Invoice ID"},"total_amount":{"type":"number","description":"Total due"}}}'

  # Poll request_check_url from response until status is "complete"
  ```

  ```python Python (requests) theme={null}
  import requests, json, time, os

  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  schema = {
      "type": "object",
      "properties": {
          "invoice_number": {"type": "string", "description": "Invoice ID"},
          "total_amount": {"type": "number", "description": "Total due"}
      }
  }

  with open("invoice.pdf", "rb") as f:
      response = requests.post(
          "https://www.datalab.to/api/v1/extract",
          files={"file": ("invoice.pdf", f, "application/pdf")},
          data={"page_schema": json.dumps(schema), "mode": "balanced"},
          headers=headers
      )

  check_url = response.json()["request_check_url"]

  while True:
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          extracted = json.loads(result["extraction_schema_json"])
          print(extracted)
          break
      elif result["status"] == "failed":
          print(f"Error: {result.get('error')}")
          break
      time.sleep(2)
  ```
</CodeGroup>

## Extraction Modes

The `extraction_mode` form parameter controls how extraction runs. This is separate from `mode`, which controls document parsing quality.

| Mode                   | Description                                                                  | Price           | Latency                                   |
| ---------------------- | ---------------------------------------------------------------------------- | --------------- | ----------------------------------------- |
| **fast**               | Extraction with per-field citations                                          | \$6 / 1K pages  | Lowest                                    |
| **balanced** (default) | Extraction with independent verification, per-field reasoning, and citations | \$25 / 1K pages | Slower — trades speed for higher accuracy |

Both modes return citations for every extracted field. Balanced mode additionally returns `_meta` per field with `extraction_status`, `reasoning`, and `verification` results.

<Note>
  `balanced` is the default. Teams that made an extraction request in the 30 days before June 4, 2026 default to `fast` instead. Pass `extraction_mode` explicitly to override the default in either case.
</Note>

```bash cURL theme={null}
# Fast extraction mode
curl -X POST https://www.datalab.to/api/v1/extract \
  -H "X-API-Key: $DATALAB_API_KEY" \
  -F "file=@invoice.pdf" \
  -F 'page_schema={"type":"object","properties":{"invoice_number":{"type":"string"}}}' \
  -F "extraction_mode=fast"
```

```python theme={null}
# The SDK's ExtractOptions controls document parse mode via `mode`.
# To set extraction_mode, use the REST API directly (see cURL tab above)
# or pass it as a raw form field via requests.
options = ExtractOptions(page_schema=json.dumps(schema))  # defaults to balanced extraction
```

See [Balanced Extraction Mode](/docs/recipes/structured-extraction/balanced-mode) for a full guide on the balanced mode response format and building workflows with verification metadata.

## Schema Format

Use JSON Schema format to define what you want to extract:

```json theme={null}
{
  "type": "object",
  "properties": {
    "field_name": {
      "type": "string",
      "description": "Describe what this field contains"
    },
    "numeric_field": {
      "type": "number",
      "description": "A numeric value"
    },
    "list_field": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "nested_field": {"type": "string"}
        }
      }
    }
  },
  "required": ["field_name"]
}
```

### Tips for Better Extraction

1. **Use descriptive field names** - `invoice_number` is clearer than `id`
2. **Add descriptions** - The `description` field helps the model understand context
3. **Specify types correctly** - Use `number` for numeric values, `string` for text
4. **Use arrays for repeating data** - Line items, table rows, etc.

<Warning>
  **Common schema pitfalls:**

  * Using vague field names like `data` or `info` — be specific (e.g., `invoice_number`, `total_amount`)
  * Forgetting `description` fields — these help the model understand what to extract
  * Setting `type: "string"` for numeric values — use `type: "number"` for amounts, quantities, etc.
  * Deeply nested schemas — keep schemas as flat as possible for better extraction accuracy
</Warning>

## Response

The extracted data is returned in `extraction_schema_json`:

```json theme={null}
{
  "status": "complete",
  "success": true,
  "json": {...},
  "extraction_schema_json": "{\"invoice_number\": \"INV-2024-001\", \"total_amount\": 1500.00, ...}",
  "page_count": 2
}
```

### Citation Tracking

Each extracted field includes citations to the source blocks:

```json theme={null}
{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123", "block_124"],
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"]
}
```

Use these block IDs with the `json` output to trace extracted values back to the source document.

## Schema Examples

### Financial Document

```python theme={null}
schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string", "description": "Company name"},
        "fiscal_year": {"type": "string", "description": "Fiscal year"},
        "total_revenue": {"type": "number", "description": "Total revenue in dollars"},
        "net_income": {"type": "number", "description": "Net income in dollars"},
        "eps": {"type": "number", "description": "Earnings per share"}
    }
}
```

### Scientific Paper

```python theme={null}
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string", "description": "Paper title"},
        "authors": {
            "type": "array",
            "items": {"type": "string"},
            "description": "List of author names"
        },
        "abstract": {"type": "string", "description": "Paper abstract"},
        "keywords": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Keywords or tags"
        }
    }
}
```

### Contract

```python theme={null}
schema = {
    "type": "object",
    "properties": {
        "parties": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "role": {"type": "string"}
                }
            }
        },
        "effective_date": {"type": "string", "description": "Contract start date"},
        "termination_date": {"type": "string", "description": "Contract end date"},
        "total_value": {"type": "number", "description": "Total contract value"}
    }
}
```

## Using Checkpoints

If you already converted a document with `save_checkpoint=True` using the [Convert API](/docs/recipes/conversion/conversion-api-overview), pass the `checkpoint_id` to `ExtractOptions` to skip re-parsing. This saves time and cost when running extraction on a previously converted document.

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
convert_result = client.convert("invoice.pdf", options=ConvertOptions(save_checkpoint=True))
checkpoint_id = convert_result.checkpoint_id

# Step 2: Extract using checkpoint (no re-parsing needed)
schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice ID"},
        "total_amount": {"type": "number", "description": "Total due"}
    }
}

options = ExtractOptions(
    page_schema=json.dumps(schema),
    checkpoint_id=checkpoint_id
)
result = client.extract("invoice.pdf", options=options)
extracted = json.loads(result.extraction_schema_json)
```

The extract endpoint accepts the following parameters: `file`, `page_schema` or `schema_id` (one is required), `schema_version`, `mode`, `max_pages`, `page_range`, `save_checkpoint`, `checkpoint_id`, `webhook_url`, and `processing_location` (e.g. `"eu"` — routes processing and storage to EU infrastructure; requires `file_url` or a pre-uploaded `datalab://` reference instead of a multipart upload).

### Using Saved Schemas

Instead of passing `page_schema` inline, you can save schemas to Datalab and reference them by ID. This avoids repeating the schema in every request and enables versioning.

```bash theme={null}
curl -X POST https://www.datalab.to/api/v1/extract \
  -H "X-API-Key: $DATALAB_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "schema_id=sch_k8Hx9mP2nQ4v"
```

Pass `schema_version` to pin to a specific schema version; omit it to always use the latest. See [Saved Schemas](/docs/recipes/structured-extraction/saved-schemas) for full CRUD API reference.

## Confidence Scoring

<Note>
  **Extraction scoring is in beta.**

  We'd love your feedback — reach out at [support@datalab.to](mailto:support@datalab.to).

  Scoring is free.
</Note>

Scoring runs automatically after every extraction. When you poll `request_check_url`, the response initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include `_score` fields and an `extraction_score_average` once scoring completes. No extra parameters or endpoints are needed.

Each `_score` field is a `{"score": int, "reasoning": str}` object explaining what evidence was found or missing.

### Score response format

Without scoring complete, `extraction_schema_json` contains fields and citations:

```json theme={null}
{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123"],
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"]
}
```

Once scoring finishes, each field also gets a `_score` object, and the top-level response includes an `extraction_score_average`:

```json theme={null}
{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123"],
  "invoice_number_score": {
    "score": 5,
    "reasoning": "Value found verbatim in the document header with a matching citation."
  },
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"],
  "total_amount_score": {
    "score": 4,
    "reasoning": "Amount found in the totals row; minor ambiguity due to a subtotal nearby."
  }
}
```

The top-level response also includes `extraction_score_average` (4.5 in this case), averaging all field scores.

**Score rubric:**

| Score | Meaning                                                    |
| ----- | ---------------------------------------------------------- |
| 5     | High confidence — clear match with strong citation support |
| 4     | Good confidence — match found with minor ambiguity         |
| 3     | Moderate confidence — partial match or uncertain citation  |
| 2     | Low confidence — match is inferred or weakly supported     |
| 1     | Very low confidence — no clear evidence found              |

See [Confidence Scoring](/docs/recipes/structured-extraction/confidence-scoring) for a full walkthrough with code examples.

## Auto-Generate Schemas

Don't want to write schemas by hand? Use the schema generation endpoint to automatically suggest schemas for your document. This requires a checkpoint from a previous conversion:

```python theme={null}
import os, requests, json, time

headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

# Step 1: Convert with checkpoint
with open("invoice.pdf", "rb") as f:
    resp = requests.post(
        "https://www.datalab.to/api/v1/convert",
        files={"file": ("invoice.pdf", f, "application/pdf")},
        data={"save_checkpoint": "true", "output_format": "markdown"},
        headers=headers
    )
check_url = resp.json()["request_check_url"]

# Poll until complete
while True:
    result = requests.get(check_url, headers=headers).json()
    if result["status"] == "complete":
        checkpoint_id = result["checkpoint_id"]
        break
    time.sleep(2)

# Step 2: Generate schemas
resp = requests.post(
    "https://www.datalab.to/api/v1/marker/extraction/gen_schemas",
    json={"checkpoint_id": checkpoint_id},
    headers=headers
)
gen_check_url = resp.json()["request_check_url"]

while True:
    result = requests.get(gen_check_url, headers=headers).json()
    if result["status"] == "complete":
        suggestions = result["suggestions"]
        print("Simple schema:", suggestions["simple_schema"])
        print("Moderate schema:", suggestions["moderate_schema"])
        print("Complex schema:", suggestions["complex_schema"])
        break
    time.sleep(2)
```

The endpoint returns three schema options at different complexity levels — use the one that best matches your needs, then customize it.

## Using Forge Playground

Create and test schemas visually in [Forge Playground](https://www.datalab.to/app/playground):

1. Upload a sample document
2. Define fields in the visual editor
3. Switch to JSON Editor to copy the schema
4. Test extraction before deploying

## Next Steps

<CardGroup>
  <Card title="Balanced Extraction Mode" icon="shield-check" href="/docs/recipes/structured-extraction/balanced-mode">
    Per-field verification, reasoning, and extraction status for compliance workflows
  </Card>

  <Card title="Saved Schemas" icon="bookmark" href="/docs/recipes/structured-extraction/saved-schemas">
    Create reusable schemas and reference them by ID — no need to repeat the schema in each request
  </Card>

  <Card title="Confidence Scoring" icon="chart-bar" href="/docs/recipes/structured-extraction/confidence-scoring">
    Score extraction results with per-field confidence ratings
  </Card>

  <Card title="Handling Long Documents" icon="file-lines" href="/docs/recipes/structured-extraction/handling-long-documents">
    Strategies for extracting from 100+ page documents
  </Card>
</CardGroup>


# Balanced Extraction Mode
Source: https://documentation.datalab.to/docs/recipes/structured-extraction/balanced-mode

Extraction with per-field verification, reasoning, and citations.

Balanced mode runs an extraction pipeline with independent verification. Every extracted field includes an audit trail: where the value came from, how it was derived, and whether an independent check confirmed it.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## When to Use Fast vs Balanced

|                              | Fast                                                    | Balanced (default)                                                                         |
| ---------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| **Price**                    | \$6 / 1K pages                                          | \$25 / 1K pages                                                                            |
| **Latency**                  | Lowest                                                  | Slower — trades speed for accuracy via independent verification                            |
| **Per-field citations**      | Yes                                                     | Yes                                                                                        |
| **Extraction status**        | No                                                      | Yes (EXTRACTED / NOT\_RESOLVABLE)                                                          |
| **Per-field reasoning**      | No                                                      | Yes                                                                                        |
| **Independent verification** | No                                                      | Yes (PASS / FAIL)                                                                          |
| **Best for**                 | High-volume workflows: invoices, forms, bank statements | Compliance, financial, legal, and medical workflows where every field needs an audit trail |

Use **fast** when speed and cost matter most. Use **balanced** when you need to trust every field and want metadata to power downstream decisions.

## Schema size on short documents

For shorter documents (**under 20 pages**), balanced mode limits how large your schema can be. Documents of **20+ pages have no schema-size limit**.

If a schema is too large for a short document, the request fails with a clear error telling you your field count and your options — and **you aren't charged**.

### Count your fields

A **field** is one value you get back — a string, number, date, true/false, or one choice from a fixed list. **Objects and lists are containers, not fields** — count the fields inside them. A list of repeated items counts its fields **once**, no matter how many rows the document has.

**4 fields:**

```json theme={null}
{ "invoice_number": "string", "invoice_date": "string",
  "total_amount": "number", "currency": "string" }
```

**5 fields** — the object is a container, so count what's inside it:

```json theme={null}
{ "vendor": { "name": "string", "address": "string" },
  "total": "number", "due_date": "string", "paid": "boolean" }
```

**4 fields** — a list's columns count once, not once per row:

```json theme={null}
{ "invoice_number": "string",
  "line_items": [ { "description": "string", "quantity": "number", "unit_price": "number" } ] }
```

### How many fields can I use?

About **25 fields** is a comfortable limit for any schema on a short document. Larger schemas often work too — especially flat ones without deep nesting — but the more fields you add, and the more deeply they're nested (lists of objects several levels down), the more likely you are to reach the limit.

You don't have to guess: if a schema is too large for a short document, the request fails with a clear error (and you aren't charged), so it's safe to try a larger one.

If you need a bigger schema on a short document:

1. **Split it into multiple extractions** — for example, header fields in one request and a large list in another.
2. **Use `fast` mode** — it supports larger schemas and costs less, without the per-field verification metadata.
3. **Trim and flatten** — drop fields you don't use and reduce nesting.

<Note>This applies only to balanced mode on documents under 20 pages. Documents of 20+ pages support schemas of any size.</Note>

## Quick Start

<CodeGroup>
  ```python Python SDK theme={null}
  import json
  from datalab_sdk import DatalabClient, ExtractOptions

  client = DatalabClient()

  schema = {
      "type": "object",
      "properties": {
          "company_name": {"type": "string", "description": "Full legal name of the company"},
          "fiscal_year_end": {"type": "string", "description": "End date of the fiscal year (YYYY-MM-DD)"},
          "total_revenue": {"type": "number", "description": "Total revenue in the reporting currency"},
          "auditor_name": {"type": "string", "description": "Name of the external audit firm"}
      },
      "required": ["company_name", "fiscal_year_end"]
  }

  options = ExtractOptions(
      page_schema=json.dumps(schema),
  )

  result = client.extract("annual_report.pdf", options=options)
  extracted = json.loads(result.extraction_schema_json)

  # Each field comes with citations and metadata
  print(f"Company: {extracted['company_name']}")
  print(f"Citations: {extracted['company_name_citations']}")
  print(f"Status: {extracted['company_name_meta']['extraction_status']}")
  print(f"Verified: {extracted['company_name_meta']['verification']['status']}")
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/extract \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@annual_report.pdf" \
    -F "extraction_mode=balanced" \
    -F 'page_schema={"type":"object","properties":{"company_name":{"type":"string","description":"Full legal name"},"total_revenue":{"type":"number","description":"Total revenue"}}}'

  # Poll request_check_url from response until status is "complete"
  ```

  ```python Python (requests) theme={null}
  import requests, json, time, os

  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  schema = {
      "type": "object",
      "properties": {
          "company_name": {"type": "string", "description": "Full legal name"},
          "total_revenue": {"type": "number", "description": "Total revenue"}
      }
  }

  with open("annual_report.pdf", "rb") as f:
      response = requests.post(
          "https://www.datalab.to/api/v1/extract",
          files={"file": ("annual_report.pdf", f, "application/pdf")},
          data={
              "page_schema": json.dumps(schema),
              "extraction_mode": "balanced"
          },
          headers=headers
      )

  check_url = response.json()["request_check_url"]

  while True:
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          extracted = json.loads(result["extraction_schema_json"])
          print(json.dumps(extracted, indent=2))
          break
      elif result["status"] == "failed":
          print(f"Error: {result.get('error')}")
          break
      time.sleep(2)
  ```
</CodeGroup>

<Note>
  `extraction_mode` controls the extraction pipeline (`fast` or `balanced`). This is separate from `mode`, which controls the document parsing stage (`fast`, `balanced`, or `accurate`). You can combine them independently — for example, `mode="fast"` with `extraction_mode="balanced"`.

  `extraction_mode` is not yet exposed in the Python SDK's `ExtractOptions`. To set it explicitly, use the cURL or Python requests examples above. When omitted, the team's configured default applies (see [Changelog](/platform/changelog) — 6/4/2026 for default rules).
</Note>

## Response Format

In balanced mode, each extracted field includes three sibling keys. The `_citations` sibling is the same format as fast mode for compatibility — balanced mode adds `_meta` with richer metadata on top:

```json theme={null}
{
  "company_name": "Whitbread PLC",
  "company_name_citations": ["/page/0/Text/3", "/page/2/Table/1"],
  "company_name_meta": {
    "extraction_status": "EXTRACTED",
    "reasoning": "The company name 'Whitbread PLC' appears in the document header on the cover page (/page/0/Text/3) and is confirmed in the directors' report (/page/2/Table/1).",
    "citations": ["/page/0/Text/3", "/page/2/Table/1"],
    "verification": {
      "status": "PASS",
      "feedback": "The company name 'Whitbread PLC' is printed on the cover page (/page/0/Text/3) and confirmed in the directors' report. No conflicting name appears in the document."
    }
  }
}
```

The `_citations` key is shared with fast mode — if you switch between modes, citation-consuming code continues to work. The `_meta` key is balanced-mode-only and contains the full audit trail.

### Field Metadata

Each `_meta` object contains:

| Field               | Description                                                                                                                           |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `extraction_status` | How the value was produced: `EXTRACTED` (value found in the document) or `NOT_RESOLVABLE` (document doesn't contain this information) |
| `reasoning`         | Audit-ready prose explaining how the value was produced, with block ID citations                                                      |
| `citations`         | Block IDs from the source document that support the value                                                                             |
| `verification`      | Independent verification result with `status` and `feedback`                                                                          |

### Extraction Status

| Status           | Meaning                                             | Value               |
| ---------------- | --------------------------------------------------- | ------------------- |
| `EXTRACTED`      | The value was found in or derived from the document | The extracted value |
| `NOT_RESOLVABLE` | The document does not contain or imply this value   | `null`              |

### Verification Status

| Status              | Meaning                                                                                          |
| ------------------- | ------------------------------------------------------------------------------------------------ |
| `PASS`              | The value and citations were independently confirmed against the source document                 |
| `FAIL_UNRESOLVABLE` | The document does not support a value for this field                                             |
| `FAIL_FIX`          | The value was flagged as incorrect during verification — the document supports a different value |
| `FAIL_CITATIONS`    | The value is correct but the citations are wrong or insufficient                                 |
| `ITEMS_MISSING`     | (List fields only) The document contains entries that are not present in the extraction          |

In practice, most fields will be `PASS` or `FAIL_UNRESOLVABLE` after verification. The other statuses indicate cases where the verifier flagged an issue that could not be fully resolved automatically.

## Building Workflows with Verification Metadata

The per-field metadata enables automated quality gates:

```python theme={null}
import json

extracted = json.loads(result.extraction_schema_json)

# Separate fields by verification status
auto_approved = []
needs_review = []

# Walk all fields and check their _meta
for key, value in extracted.items():
    if key.endswith("_meta"):
        field_name = key.removesuffix("_meta")
        meta = value
        verification = meta.get("verification", {})

        if verification.get("status") == "PASS":
            auto_approved.append(field_name)
        else:
            needs_review.append({
                "field": field_name,
                "extraction_status": meta.get("extraction_status"),
                "reasoning": meta.get("reasoning"),
                "verification_feedback": verification.get("feedback"),
            })

print(f"Auto-approved: {len(auto_approved)} fields")
print(f"Needs review: {len(needs_review)} fields")

# Route to human review queue
for item in needs_review:
    print(f"  {item['field']}: {item['extraction_status']}")
    print(f"    Reason: {item['reasoning'][:100]}...")
```

### Common Workflow Patterns

* **Auto-approve** when all fields have `verification.status == "PASS"` — no human review needed
* **Flag for review** when any field is `NOT_RESOLVABLE` or has a `FAIL_*` verification status — the document may be missing information or the extraction needs a human check
* **Show citations** to reviewers so they can verify in seconds — each field links back to specific blocks in the document
* **Use reasoning as an audit trail** — for compliance workflows, the per-field reasoning documents exactly how each value was produced, with block-level citations back to the source document

## Next Steps

<CardGroup>
  <Card title="Structured Extraction Overview" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Schema format, response structure, and extraction tips
  </Card>

  <Card title="Confidence Scoring" icon="chart-bar" href="/docs/recipes/structured-extraction/confidence-scoring">
    Additional per-field confidence scores (works with both modes)
  </Card>

  <Card title="Saved Schemas" icon="bookmark" href="/docs/recipes/structured-extraction/saved-schemas">
    Save and version schemas for reuse across requests
  </Card>

  <Card title="Handling Long Documents" icon="file-lines" href="/docs/recipes/structured-extraction/handling-long-documents">
    Tips for extracting from 100+ page documents
  </Card>
</CardGroup>


# Extraction Confidence Scoring
Source: https://documentation.datalab.to/docs/recipes/structured-extraction/confidence-scoring

Score extraction results with per-field confidence ratings and reasoning.

Score your structured extraction results to get per-field confidence ratings (1–5) with reasoning that explains what evidence was found or missing.

<Note>
  **Extraction scoring is in beta.**

  We'd love your feedback — reach out at [support@datalab.to](mailto:support@datalab.to).

  Scoring is free.
</Note>

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## How It Works

<Warning>
  **Confidence scoring runs in `fast` extraction mode only**, and `extraction_mode` defaults to `balanced`. To receive scores you must request `fast` mode explicitly: `extraction_mode="fast"`. In `balanced` and `turbo` modes, `_score` fields and `extraction_score_average` are **never** returned no matter how long you poll (see the note below).
</Warning>

When you run extraction with `extraction_mode="fast"`, scoring runs automatically afterward. When you poll `request_check_url`, the extraction result initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include `_score` fields and an `extraction_score_average` once scoring completes (typically within a minute of `status` becoming `complete`).

Each scored field receives:

* A **score** from 1 (very low confidence) to 5 (high confidence)
* A **reasoning** string explaining what evidence supports or undermines the extracted value

Beyond setting `extraction_mode="fast"`, no extra parameters or endpoints are needed — just keep polling until scores appear.

<Info>
  **Using balanced extraction mode?** Balanced mode does **not** produce `_score` fields or `extraction_score_average`. Instead it includes its own per-field verification (`_meta.verification`, with a `status` of `PASS`/`FAIL_*` and `feedback`) that runs inline as part of the extraction pipeline — a richer, different signal than the numeric confidence scores described here. The two mechanisms are mutually exclusive: use `fast` mode for numeric `_score`s, or `balanced` mode for inline verification. See [Balanced Mode](/docs/recipes/structured-extraction/balanced-mode).
</Info>

## Example

<CodeGroup>
  ```python Python (requests) theme={null}
  import requests, json, time, os

  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  schema = {
      "type": "object",
      "properties": {
          "invoice_number": {"type": "string", "description": "Invoice ID or number"},
          "total_amount": {"type": "number", "description": "Total amount due"},
          "vendor_name": {"type": "string", "description": "Vendor or company name"}
      },
      "required": ["invoice_number", "total_amount"]
  }

  with open("invoice.pdf", "rb") as f:
      resp = requests.post(
          "https://www.datalab.to/api/v1/extract",
          files={"file": ("invoice.pdf", f, "application/pdf")},
          data={
              "page_schema": json.dumps(schema),
              "extraction_mode": "fast"  # scoring runs in fast mode only
          },
          headers=headers
      )
  check_url = resp.json()["request_check_url"]

  # Poll until extraction is complete
  while True:
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          extracted = json.loads(result["extraction_schema_json"])
          print("Extraction:", extracted)
          break
      time.sleep(2)

  # Scores are enriched asynchronously after completion. Keep polling the same
  # URL until extraction_score_average appears (bounded so we don't loop forever).
  for _ in range(30):
      if "extraction_score_average" in result:
          break
      time.sleep(2)
      result = requests.get(check_url, headers=headers).json()

  scored = json.loads(result["extraction_schema_json"])
  for key, value in scored.items():
      if key.endswith("_score"):
          field = key.replace("_score", "")
          print(f"{field}: score={value['score']}, reasoning={value['reasoning']}")
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/extract \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@invoice.pdf" \
    -F 'page_schema={"type":"object","properties":{"invoice_number":{"type":"string","description":"Invoice ID"},"total_amount":{"type":"number","description":"Total due"},"vendor_name":{"type":"string","description":"Vendor or company name"}}}' \
    -F "extraction_mode=fast"

  # Poll request_check_url until status is "complete" for extraction results.
  # Keep polling the same URL — scores (_score fields + extraction_score_average)
  # appear once scoring finishes. Scoring runs in fast mode only.
  ```
</CodeGroup>

## Response Format

Without scoring, `extraction_schema_json` contains fields and citations:

```json theme={null}
{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123"],
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"]
}
```

With scoring, each field also gets a `_score` object, and the top-level response includes an `extraction_score_average`:

```json theme={null}
{
  "invoice_number": "INV-2024-001",
  "invoice_number_citations": ["block_123"],
  "invoice_number_score": {
    "score": 5,
    "reasoning": "Value found verbatim in the document header with a matching citation."
  },
  "total_amount": 1500.00,
  "total_amount_citations": ["block_456"],
  "total_amount_score": {
    "score": 4,
    "reasoning": "Amount found in the totals row; minor ambiguity due to a subtotal nearby."
  }
}
```

The top-level response also includes `extraction_score_average` (4.5 in this case), averaging all field scores.

### Score Rubric

| Score | Meaning                                                    |
| ----- | ---------------------------------------------------------- |
| 5     | High confidence — clear match with strong citation support |
| 4     | Good confidence — match found with minor ambiguity         |
| 3     | Moderate confidence — partial match or uncertain citation  |
| 2     | Low confidence — match is inferred or weakly supported     |
| 1     | Very low confidence — no clear evidence found              |

## Using Scores in Practice

Use `extraction_score_average` for a quick quality check, then inspect individual `_score` fields to flag low-confidence results:

```python theme={null}
import json

# After getting scored result (from either approach)
avg = result["extraction_score_average"]
print(f"Average score: {avg}")

scored = json.loads(result["extraction_schema_json"])
for key, value in scored.items():
    if not key.endswith("_score"):
        continue

    field = key.replace("_score", "")
    if value["score"] <= 2:
        print(f"Low confidence for '{field}': {value['reasoning']}")
    elif value["score"] >= 4:
        print(f"'{field}' = {scored[field]}")
```

This is useful for building review workflows — auto-accept high-confidence fields and route low-confidence ones to a human reviewer.

## Next Steps

<CardGroup>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Full extraction API reference and schema examples
  </Card>

  <Card title="Handling Long Documents" icon="file-lines" href="/docs/recipes/structured-extraction/handling-long-documents">
    Strategies for extracting from 100+ page documents
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="Document Conversion" icon="file-text" href="/docs/recipes/conversion/conversion-api-overview">
    Convert documents to various formats
  </Card>
</CardGroup>


# Handling Long Documents
Source: https://documentation.datalab.to/docs/recipes/structured-extraction/handling-long-documents

Tips for structured extraction on documents with 50+ pages.

For long documents, use page ranges and document segmentation to improve speed and accuracy.

## Restrict to Specific Pages

If you know which pages contain the data you need, use `page_range`:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions
import json

client = DatalabClient()

schema = {
    "type": "object",
    "properties": {
        "executive_summary": {"type": "string", "description": "Executive summary text"}
    }
}

# Only process pages 0-5 (first 6 pages)
options = ConvertOptions(
    page_schema=json.dumps(schema),
    page_range="0-5",
    mode="balanced"
)

result = client.convert("long_document.pdf", options=options)
```

You're only charged for the pages you process.

## Segment and Chain Extractions

For documents with distinct sections (like financial reports or contracts), extract the table of contents first, then process each section separately.

### Step 1: Extract Table of Contents

```python theme={null}
import json
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

toc_schema = {
    "type": "object",
    "properties": {
        "table_of_contents": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "section_name": {"type": "string"},
                    "page_number": {"type": "number"}
                }
            }
        }
    }
}

# Extract TOC from first few pages
options = ConvertOptions(
    page_schema=json.dumps(toc_schema),
    page_range="0-5",
    mode="balanced"
)

result = client.convert("report.pdf", options=options)
toc = json.loads(result.extraction_schema_json)

print("Sections found:")
for item in toc["table_of_contents"]:
    print(f"  {item['section_name']}: page {item['page_number']}")
```

### Step 2: Extract Each Section

```python theme={null}
# Define schemas for different sections
section_schemas = {
    "Financial Highlights": {
        "type": "object",
        "properties": {
            "revenue": {"type": "number"},
            "net_income": {"type": "number"},
            "eps": {"type": "number"}
        }
    },
    "Risk Factors": {
        "type": "object",
        "properties": {
            "risks": {
                "type": "array",
                "items": {"type": "string"}
            }
        }
    }
}

# Build page ranges from TOC
sections = toc["table_of_contents"]
results = {}

for i, section in enumerate(sections):
    section_name = section["section_name"]
    start_page = section["page_number"]

    # End page is start of next section (or end of document)
    end_page = sections[i + 1]["page_number"] - 1 if i + 1 < len(sections) else None

    # Get schema for this section if we have one
    schema = section_schemas.get(section_name)
    if schema:
        page_range = f"{start_page}-{end_page}" if end_page else str(start_page)

        options = ConvertOptions(
            page_schema=json.dumps(schema),
            page_range=page_range,
            mode="balanced"
        )

        result = client.convert("report.pdf", options=options)
        results[section_name] = json.loads(result.extraction_schema_json)

print(results)
```

## Use Document Segmentation

For documents without a clear table of contents, use [Document Segmentation](/docs/recipes/document-segmentation/auto-segmentation) to automatically split by section headers.

```python theme={null}
segmentation_schema = {
    "type": "object",
    "properties": {
        "sections": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "type": {"type": "string", "enum": ["introduction", "methods", "results", "conclusion"]}
                }
            }
        }
    }
}

options = ConvertOptions(
    segmentation_schema=json.dumps(segmentation_schema),
    mode="balanced"
)

result = client.convert("paper.pdf", options=options)
# Access segmentation results
segments = result.segmentation_results
```

## Full Example

Complete workflow for processing a 100+ page financial report:

```python theme={null}
import json
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()


def extract_with_toc(pdf_path: str, section_schemas: dict) -> dict:
    """Extract data from a long document using TOC-based segmentation."""

    # Step 1: Extract table of contents
    toc_schema = {
        "type": "object",
        "properties": {
            "table_of_contents": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "section_name": {"type": "string"},
                        "page_number": {"type": "number"}
                    }
                }
            }
        }
    }

    options = ConvertOptions(
        page_schema=json.dumps(toc_schema),
        page_range="0-6",
        mode="balanced"
    )

    result = client.convert(pdf_path, options=options)
    toc = json.loads(result.extraction_schema_json)
    sections = toc.get("table_of_contents", [])

    # Step 2: Extract each section with its schema
    results = {}

    for i, section in enumerate(sections):
        section_name = section["section_name"]
        start_page = int(section["page_number"])

        # Calculate page range
        if i + 1 < len(sections):
            end_page = int(sections[i + 1]["page_number"]) - 1
            page_range = f"{start_page}-{end_page}"
        else:
            page_range = str(start_page)

        # Check if we have a schema for this section
        schema = section_schemas.get(section_name)
        if not schema:
            continue

        options = ConvertOptions(
            page_schema=json.dumps(schema),
            page_range=page_range,
            mode="balanced"
        )

        try:
            result = client.convert(pdf_path, options=options)
            results[section_name] = json.loads(result.extraction_schema_json)
            print(f"Extracted: {section_name}")
        except Exception as e:
            print(f"Error extracting {section_name}: {e}")

    return results


# Define schemas for sections you care about
schemas = {
    "Financial Highlights": {
        "type": "object",
        "properties": {
            "total_revenue": {"type": "number", "description": "Total revenue"},
            "net_income": {"type": "number", "description": "Net income"},
            "year": {"type": "string", "description": "Fiscal year"}
        }
    },
    "Business Overview": {
        "type": "object",
        "properties": {
            "description": {"type": "string", "description": "Business description"},
            "products": {"type": "array", "items": {"type": "string"}}
        }
    }
}

results = extract_with_toc("annual_report.pdf", schemas)
print(json.dumps(results, indent=2))
```

## Tips

1. **Process pages you need** - Use `page_range` to avoid processing unnecessary pages
2. **Extract TOC first** - Build page ranges dynamically from the document structure
3. **Use appropriate modes** - `balanced` is usually sufficient; use `accurate` for complex tables
4. **Handle errors** - Some sections may not match your schema exactly

## Next Steps

<CardGroup>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Learn the full structured extraction API and schema options.
  </Card>

  <Card title="Document Segmentation" icon="scissors" href="/docs/recipes/document-segmentation/auto-segmentation">
    Automatically split documents by section headers.
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple long documents efficiently in parallel.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>
</CardGroup>


# Saved Schemas
Source: https://documentation.datalab.to/docs/recipes/structured-extraction/saved-schemas

Create and manage reusable extraction schemas in the Datalab UI. Reference saved schemas by ID instead of sending the full schema with every request.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## Overview

Saved Schemas let you store extraction schemas in Datalab and reference them by ID (`schema_id`) when calling `/api/v1/extract`. Instead of sending a full JSON schema with every request, you save it once and reference it by its stable ID.

Saved schemas also support **versioning** — you can update a schema while keeping a history of previous versions and pin extractions to a specific version using `schema_version`.

## Create a Schema

Create schemas via the SDK or the [Datalab UI](https://www.datalab.to/app/schemas). Each schema is assigned a `schema_id` (e.g. `sch_k8Hx9mP2nQ4v`) that you can reference in extraction requests.

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient

  client = DatalabClient()

  schema = client.create_extraction_schema(
      name="Invoice Schema",
      description="Extracts key fields from invoices",
      schema_json={
          "properties": {
              "invoice_number": {"type": "string", "description": "Invoice ID"},
              "total_amount": {"type": "number", "description": "Total amount due"},
              "vendor_name": {"type": "string", "description": "Vendor or supplier name"},
              "due_date": {"type": "string", "description": "Payment due date"},
          }
      },
  )
  print(schema.schema_id)  # e.g. sch_k8Hx9mP2nQ4v
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/extraction_schemas \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "name": "Invoice Schema",
      "description": "Extracts key fields from invoices",
      "schema_json": {
        "properties": {
          "invoice_number": {"type": "string", "description": "Invoice ID"},
          "total_amount": {"type": "number", "description": "Total amount due"},
          "vendor_name": {"type": "string", "description": "Vendor or supplier name"},
          "due_date": {"type": "string", "description": "Payment due date"}
        }
      }
    }'
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.post(
      "https://www.datalab.to/api/v1/extraction_schemas",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
      json={
          "name": "Invoice Schema",
          "schema_json": {
              "properties": {
                  "invoice_number": {"type": "string"},
                  "total_amount": {"type": "number"},
              }
          },
      },
  )
  schema_id = resp.json()["schema_id"]
  print(schema_id)
  ```
</CodeGroup>

## Extract Using a Saved Schema

Pass `schema_id` to `/api/v1/extract` instead of `page_schema`:

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, ExtractOptions
  import json

  client = DatalabClient()

  result = client.extract(
      "invoice.pdf",
      options=ExtractOptions(
          schema_id="sch_k8Hx9mP2nQ4v",
          mode="balanced",
      ),
  )
  extracted = json.loads(result.extraction_schema_json)
  print(extracted)
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/extract \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@invoice.pdf" \
    -F "schema_id=sch_k8Hx9mP2nQ4v" \
    -F "mode=balanced"

  # Poll request_check_url from response until status is "complete"
  ```

  ```python Python (requests) theme={null}
  import requests, time, os

  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  with open("invoice.pdf", "rb") as f:
      resp = requests.post(
          "https://www.datalab.to/api/v1/extract",
          files={"file": ("invoice.pdf", f, "application/pdf")},
          data={"schema_id": "sch_k8Hx9mP2nQ4v", "mode": "balanced"},
          headers=headers
      )

  check_url = resp.json()["request_check_url"]

  while True:
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          import json
          extracted = json.loads(result["extraction_schema_json"])
          print(extracted)
          break
      elif result["status"] == "failed":
          print(f"Error: {result.get('error')}")
          break
      time.sleep(2)
  ```
</CodeGroup>

<Warning>
  `page_schema` and `schema_id` are mutually exclusive — provide exactly one. If you pass both, the API returns a `400` error.
</Warning>

## Schema Versioning

When you update a schema in the [Datalab UI](https://www.datalab.to/app/schemas), you can choose to create a new version. This saves the current state to version history and increments the version number.

### Pin to a specific version

Pass `schema_version` alongside `schema_id` to use a specific version:

```bash theme={null}
curl -X POST https://www.datalab.to/api/v1/extract \
  -H "X-API-Key: $DATALAB_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "schema_id=sch_k8Hx9mP2nQ4v" \
  -F "schema_version=1"
```

Omitting `schema_version` always uses the latest version.

<Tip>
  We recommend always specifying `schema_version` alongside `schema_id`. This ensures your extractions produce consistent results even if the schema is updated later.
</Tip>

## List Schemas

<CodeGroup>
  ```python Python SDK theme={null}
  result = client.list_extraction_schemas(limit=50, include_archived=False)
  for s in result["schemas"]:
      print(f"{s.schema_id}: {s.name} (v{s.version})")
  ```

  ```bash cURL theme={null}
  # List active schemas
  curl "https://www.datalab.to/api/v1/extraction_schemas" \
    -H "X-API-Key: $DATALAB_API_KEY"

  # Include archived schemas
  curl "https://www.datalab.to/api/v1/extraction_schemas?include_archived=true" \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.get(
      "https://www.datalab.to/api/v1/extraction_schemas",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
  )
  for s in resp.json()["schemas"]:
      print(s["schema_id"], s["name"])
  ```
</CodeGroup>

The response includes `schemas` (array) and `total` (count). Schemas are ordered by creation date, newest first.

## Get a Schema

<CodeGroup>
  ```python Python SDK theme={null}
  schema = client.get_extraction_schema("sch_k8Hx9mP2nQ4v")
  print(schema.name, schema.version)
  ```

  ```bash cURL theme={null}
  curl "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.get(
      "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
  )
  print(resp.json())
  ```
</CodeGroup>

## Update a Schema

Update schema fields. Pass `create_new_version=True` to save the current state to version history before updating:

<CodeGroup>
  ```python Python SDK theme={null}
  # Update schema fields and create a new version
  schema = client.update_extraction_schema(
      "sch_k8Hx9mP2nQ4v",
      schema_json={
          "properties": {
              "invoice_number": {"type": "string"},
              "total_amount": {"type": "number"},
              "line_items": {"type": "array", "items": {"type": "string"}},  # New field
          }
      },
      create_new_version=True,
  )
  print(f"Now at v{schema.version}")
  ```

  ```bash cURL theme={null}
  curl -X PUT "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "schema_json": {
        "properties": {
          "invoice_number": {"type": "string"},
          "total_amount": {"type": "number"},
          "line_items": {"type": "array", "items": {"type": "string"}}
        }
      },
      "create_new_version": true
    }'
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.put(
      "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
      json={
          "schema_json": {"properties": {"invoice_number": {"type": "string"}}},
          "create_new_version": True,
      },
  )
  print(resp.json()["version"])
  ```
</CodeGroup>

## Archive a Schema

Archiving soft-deletes a schema — it no longer appears in list results (unless `include_archived=true`) and cannot be used for new extractions:

<CodeGroup>
  ```python Python SDK theme={null}
  client.delete_extraction_schema("sch_k8Hx9mP2nQ4v")
  ```

  ```bash cURL theme={null}
  curl -X DELETE "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, requests

  requests.delete(
      "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
  )
  ```
</CodeGroup>

## API Reference

### Schema Object

| Field             | Type         | Description                                             |
| ----------------- | ------------ | ------------------------------------------------------- |
| `schema_id`       | string       | Stable string ID (e.g. `sch_k8Hx9mP2nQ4v`)              |
| `name`            | string       | Human-readable name (max 200 chars)                     |
| `description`     | string\|null | Optional description                                    |
| `schema_json`     | object       | JSON schema with a `properties` key                     |
| `version`         | int          | Current version number (starts at 1)                    |
| `version_history` | array        | Previous versions saved with `create_new_version: true` |
| `archived`        | bool         | Whether the schema is archived                          |
| `created`         | datetime     | Creation timestamp                                      |
| `updated`         | datetime     | Last update timestamp                                   |

### `/extract` Parameters (schema-related)

| Parameter        | Type   | Description                                                      |
| ---------------- | ------ | ---------------------------------------------------------------- |
| `schema_id`      | string | ID of a saved schema. Mutually exclusive with `page_schema`.     |
| `schema_version` | int    | Version to use. Only valid with `schema_id`. Defaults to latest. |

## Next Steps

<CardGroup>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Full guide to extraction with inline schemas, checkpoints, and options.
  </Card>

  <Card title="Confidence Scoring" icon="chart-bar" href="/docs/recipes/structured-extraction/confidence-scoring">
    Score extraction results with per-field confidence ratings.
  </Card>

  <Card title="Forge Evals" icon="chart-bar" href="/docs/recipes/forge-evals/overview">
    Compare extraction results across configurations using saved schemas.
  </Card>

  <Card title="Handling Long Documents" icon="file-lines" href="/docs/recipes/structured-extraction/handling-long-documents">
    Strategies for extracting from 100+ page documents.
  </Card>
</CardGroup>


# Table Recognition
Source: https://documentation.datalab.to/docs/recipes/table-recognition/table-rec-api-overview

Extract tables from documents.

<Warning>
  **Deprecated:** The standalone Table Recognition endpoint (`/api/v1/table_rec`) is deprecated. Table extraction is now integrated into the Convert API.

  Use the Convert API with `output_format: "json"` to get structured table data with bounding boxes.
</Warning>

## Recommended Approach

Use the Convert API for table extraction:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    output_format="json",
    mode="balanced"
)

result = client.convert("document.pdf", options=options)

# Tables are in the JSON output with block_type: "Table"
for block in result.json.get("children", []):
    if block.get("block_type") == "Table":
        print(f"Table found: {block['id']}")
        print(f"Bounding box: {block['bbox']}")
        # Access table cells in block['children']
```

### REST API

```bash theme={null}
curl -X POST https://www.datalab.to/api/v1/convert \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=json" \
  -F "mode=balanced"
```

The JSON response includes `Table` and `TableCell` blocks with bounding boxes.

## Why Use Marker Instead?

* **Single endpoint** - No need for a separate table-specific call
* **Better integration** - Tables are extracted in context with the full document
* **More features** - Access processing modes, structured extraction, and more
* **Consistent API** - Same patterns as all other document processing

## Related

* [Document Conversion](/docs/recipes/conversion/conversion-api-overview) - Full Convert API documentation
* [Structured Extraction](/docs/recipes/structured-extraction/api-overview) - Extract specific data from tables

<Card title="Try Datalab" icon="rocket" href="https://www.datalab.to/auth/sign_up">
  Get started with our API in less than a minute. We include free credits.
</Card>


# API Overview
Source: https://documentation.datalab.to/docs/welcome/api

REST API reference for document conversion, form filling, and file management.

Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns.

<Note>
  For the simplest integration, use the [Python SDK](/docs/welcome/sdk). The SDK handles authentication, polling, and provides typed responses.
</Note>

## Authentication

All requests require an API key in the `X-API-Key` header:

```bash theme={null}
curl -X POST https://www.datalab.to/api/v1/convert \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@document.pdf"
```

Get your API key from the [API Keys dashboard](https://www.datalab.to/app/keys).

## Request Pattern

All processing endpoints follow this pattern:

1. **Submit** a document for processing (returns immediately with a `request_id`)
2. **Poll** the status endpoint until processing completes
3. **Retrieve** results from the completed response

### Submit Request

```bash theme={null}
POST /api/v1/{endpoint}
```

Response:

```json theme={null}
{
  "success": true,
  "request_id": "abc123",
  "request_check_url": "https://www.datalab.to/api/v1/{endpoint}/abc123"
}
```

### Poll for Results

```bash theme={null}
GET /api/v1/{endpoint}/{request_id}
```

Response while processing:

```json theme={null}
{
  "status": "processing"
}
```

Response when complete:

```json theme={null}
{
  "status": "complete",
  "success": true,
  ...results...
}
```

<Warning>
  Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly.
</Warning>

## Document Conversion

Convert documents to Markdown, HTML, JSON, or chunks.

**Endpoint:** `POST /api/v1/convert`

### Request

```python theme={null}
import requests

url = "https://www.datalab.to/api/v1/convert"
headers = {"X-API-Key": "YOUR_API_KEY"}

with open("document.pdf", "rb") as f:
    response = requests.post(
        url,
        files={"file": ("document.pdf", f, "application/pdf")},
        data={
            "output_format": "markdown",
            "mode": "balanced",
        },
        headers=headers
    )

data = response.json()
check_url = data["request_check_url"]
```

### Parameters

| Parameter                    | Type   | Default    | Description                                                                                                                                                                                                        |
| ---------------------------- | ------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `file`                       | file   | -          | Document file (multipart upload)                                                                                                                                                                                   |
| `file_url`                   | string | -          | URL to document (alternative to file upload)                                                                                                                                                                       |
| `output_format`              | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks`                                                                                                                                                                |
| `mode`                       | string | `fast`     | Processing mode: `fast`, `balanced`, `accurate`                                                                                                                                                                    |
| `max_pages`                  | int    | -          | Maximum pages to process                                                                                                                                                                                           |
| `page_range`                 | string | -          | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index.                                                                                                                            |
| `paginate`                   | bool   | `false`    | Add page delimiters to output                                                                                                                                                                                      |
| `skip_cache`                 | bool   | `false`    | Skip cached results                                                                                                                                                                                                |
| `disable_image_extraction`   | bool   | `false`    | Don't extract images                                                                                                                                                                                               |
| `disable_image_captions`     | bool   | `false`    | Don't generate image captions                                                                                                                                                                                      |
| `save_checkpoint`            | bool   | `false`    | Save checkpoint for reuse                                                                                                                                                                                          |
| `word_bboxes`                | bool   | `false`    | Predict per-word bounding boxes. Each word is inlined as `<span data-bbox="..." data-confidence="...">` in HTML output. Billed at \$0.30/1K pages.                                                                 |
| `extras`                     | string | -          | Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_cell_bboxes`, `list_item_bboxes`, `infographic`, `new_block_types`. (`table_row_bboxes` is deprecated — use `table_cell_bboxes`.) |
| `add_block_ids`              | bool   | `false`    | Add block IDs to HTML for citations                                                                                                                                                                                |
| `include_markdown_in_chunks` | bool   | `false`    | Include markdown content in chunks output                                                                                                                                                                          |
| `token_efficient_markdown`   | bool   | `false`    | Optimize markdown for LLM token efficiency                                                                                                                                                                         |
| `fence_synthetic_captions`   | bool   | `false`    | Wrap synthetic image captions in HTML comments                                                                                                                                                                     |
| `additional_config`          | string | -          | JSON with extra config options                                                                                                                                                                                     |
| `webhook_url`                | string | -          | Override webhook URL for this request                                                                                                                                                                              |

### Processing Modes

| Mode       | Description                                         |
| ---------- | --------------------------------------------------- |
| `fast`     | Lowest latency, good for simple documents (default) |
| `balanced` | Balance of speed and accuracy                       |
| `accurate` | Highest accuracy, best for complex layouts          |

### Response

Poll `request_check_url` until `status` is `complete`:

```python theme={null}
import time

while True:
    response = requests.get(check_url, headers=headers)
    result = response.json()

    if result["status"] == "complete":
        break
    time.sleep(2)

print(result["markdown"])
```

Response fields:

| Field                 | Type   | Description                              |
| --------------------- | ------ | ---------------------------------------- |
| `status`              | string | `processing`, `complete`, or `failed`    |
| `success`             | bool   | Whether conversion succeeded             |
| `markdown`            | string | Markdown output (if format is markdown)  |
| `html`                | string | HTML output (if format is html)          |
| `json`                | object | JSON output (if format is json)          |
| `chunks`              | object | Chunked output (if format is chunks)     |
| `images`              | object | Extracted images as `{filename: base64}` |
| `metadata`            | object | Document metadata                        |
| `page_count`          | int    | Number of pages processed                |
| `parse_quality_score` | float  | Quality score (0-5)                      |
| `cost_breakdown`      | object | Cost in cents                            |
| `error`               | string | Error message if failed                  |

<Note>
  For structured data extraction, see the [Extract endpoint](#structured-extraction). For document segmentation, see the [Segment endpoint](#document-segmentation).
</Note>

## Structured Extraction

Extract structured data from documents using a JSON schema.

**Endpoint:** `POST /api/v1/extract`

### Request

```python theme={null}
import requests
import json

headers = {"X-API-Key": "YOUR_API_KEY"}

schema = {
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "line_items": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "amount": {"type": "number"}
            }
        }
    }
}

response = requests.post(
    "https://www.datalab.to/api/v1/extract",
    files={"file": ("invoice.pdf", open("invoice.pdf", "rb"), "application/pdf")},
    data={
        "page_schema": json.dumps(schema),
        "mode": "balanced"
    },
    headers=headers
)

data = response.json()
check_url = data["request_check_url"]
```

### Parameters

| Parameter         | Type   | Default    | Description                                                                                                                                            |
| ----------------- | ------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `file`            | file   | -          | Document file (multipart upload)                                                                                                                       |
| `file_url`        | string | -          | URL to document (alternative to file upload)                                                                                                           |
| `page_schema`     | string | -          | JSON schema defining the data to extract. Required unless `schema_id` is provided.                                                                     |
| `schema_id`       | string | -          | ID of a [saved extraction schema](/docs/recipes/structured-extraction/saved-schemas) (e.g. `sch_k8Hx9mP2nQ4v`). Mutually exclusive with `page_schema`. |
| `schema_version`  | int    | -          | Version of the saved schema to use. Only valid with `schema_id`; defaults to the latest version.                                                       |
| `checkpoint_id`   | string | -          | Checkpoint ID from a previous `/convert` call (with `save_checkpoint=true`). Skips re-parsing.                                                         |
| `mode`            | string | `fast`     | Processing mode: `fast`, `balanced`, `accurate`                                                                                                        |
| `output_format`   | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks`                                                                                                    |
| `max_pages`       | int    | -          | Maximum pages to process                                                                                                                               |
| `page_range`      | string | -          | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index.                                                                |
| `save_checkpoint` | bool   | `false`    | Save a checkpoint after processing for reuse with subsequent calls                                                                                     |
| `webhook_url`     | string | -          | Override webhook URL for this request                                                                                                                  |

The extracted data is returned in `extraction_schema_json` in the poll response.

See [Structured Extraction](/docs/recipes/structured-extraction/api-overview) for detailed examples.

## Document Segmentation

Segment documents into structured sections using a JSON schema.

**Endpoint:** `POST /api/v1/segment`

### Parameters

| Parameter             | Type   | Default      | Description                                                                                    |
| --------------------- | ------ | ------------ | ---------------------------------------------------------------------------------------------- |
| `file`                | file   | -            | Document file (multipart upload)                                                               |
| `file_url`            | string | -            | URL to document (alternative to file upload)                                                   |
| `segmentation_schema` | string | **required** | JSON schema defining the segments to extract                                                   |
| `checkpoint_id`       | string | -            | Checkpoint ID from a previous `/convert` call (with `save_checkpoint=true`). Skips re-parsing. |
| `mode`                | string | `fast`       | Processing mode: `fast`, `balanced`, `accurate`                                                |

See [Document Segmentation](/docs/recipes/document-segmentation/auto-segmentation) for detailed examples.

## Track Changes

Extract tracked changes (insertions and deletions) from DOCX files.

**Endpoint:** `POST /api/v1/track-changes`

```python theme={null}
response = requests.post(
    "https://www.datalab.to/api/v1/track-changes",
    files={"file": ("document.docx", open("document.docx", "rb"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")},
    headers=headers
)
```

See [Track Changes](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents) for detailed examples.

## Custom Processor

Execute custom AI-powered processors on documents.

**Endpoint:** `POST /api/v1/custom-processor`

<Warning>
  `POST /api/v1/custom-pipeline` is deprecated (sunset: September 30, 2026). Migrate to `POST /api/v1/custom-processor`.
</Warning>

### Parameters

| Parameter       | Type   | Default      | Description                                         |
| --------------- | ------ | ------------ | --------------------------------------------------- |
| `file`          | file   | -            | Document file (multipart upload)                    |
| `file_url`      | string | -            | URL to document                                     |
| `pipeline_id`   | string | **required** | Custom processor ID (`cp_XXXXX`)                    |
| `version`       | int    | -            | Processor version to run (default: active version)  |
| `run_eval`      | bool   | `false`      | Run evaluation rules defined for the processor      |
| `mode`          | string | `fast`       | Processing mode: `fast`, `balanced`, `accurate`     |
| `output_format` | string | `markdown`   | Output format: `markdown`, `html`, `json`, `chunks` |
| `webhook_url`   | string | -            | URL to POST when complete                           |

## Form Filling

Fill forms in PDFs and images.

**Endpoint:** `POST /api/v1/fill`

### Request

```python theme={null}
import json

field_data = {
    "full_name": {"value": "John Doe", "description": "Full legal name"},
    "date": {"value": "2024-01-15", "description": "Today's date"},
    "signature": {"value": "John Doe", "description": "Signature field"}
}

response = requests.post(
    "https://www.datalab.to/api/v1/fill",
    files={"file": ("form.pdf", open("form.pdf", "rb"), "application/pdf")},
    data={
        "field_data": json.dumps(field_data),
        "confidence_threshold": "0.5"
    },
    headers=headers
)
```

### Parameters

| Parameter              | Type   | Default | Description                           |
| ---------------------- | ------ | ------- | ------------------------------------- |
| `file`                 | file   | -       | Form file (PDF or image)              |
| `file_url`             | string | -       | URL to form                           |
| `field_data`           | string | -       | JSON mapping field names to values    |
| `context`              | string | -       | Additional context for field matching |
| `confidence_threshold` | float  | `0.5`   | Minimum confidence for matching (0-1) |
| `page_range`           | string | -       | Specific pages to process             |
| `skip_cache`           | bool   | `false` | Skip cached results                   |

### Field Data Format

```json theme={null}
{
  "field_key": {
    "value": "The value to fill",
    "description": "Description to help match the field"
  }
}
```

### Response

| Field              | Type   | Description                     |
| ------------------ | ------ | ------------------------------- |
| `status`           | string | Processing status               |
| `success`          | bool   | Whether filling succeeded       |
| `output_format`    | string | `pdf` or `png`                  |
| `output_base64`    | string | Base64-encoded filled form      |
| `fields_filled`    | array  | Successfully filled field names |
| `fields_not_found` | array  | Unmatched field names           |
| `page_count`       | int    | Pages processed                 |
| `cost_breakdown`   | object | Cost details                    |

See [Form Filling](/docs/recipes/form-filling/form-filling-api-overview) for more examples.

## File Management

Upload and manage files for use in pipelines.

### Upload File

**Step 1:** Request an upload URL

```bash theme={null}
POST /api/v1/files/upload
Content-Type: application/json

{
  "filename": "document.pdf",
  "content_type": "application/pdf"
}
```

Response:

```json theme={null}
{
  "file_id": 123,
  "upload_url": "https://...",
  "reference": "datalab://file-abc123"
}
```

**Step 2:** Upload directly to the presigned URL

```bash theme={null}
PUT {upload_url}
Content-Type: application/pdf

<file contents>
```

**Step 3:** Confirm upload

```bash theme={null}
GET /api/v1/files/{file_id}/confirm
```

### List Files

```bash theme={null}
GET /api/v1/files?limit=50&offset=0
```

### Get File Metadata

```bash theme={null}
GET /api/v1/files/{file_id}
```

### Get Download URL

```bash theme={null}
GET /api/v1/files/{file_id}/download?expires_in=3600
```

### Delete File

```bash theme={null}
DELETE /api/v1/files/{file_id}
```

See [File Management](/docs/recipes/file-management/file-upload-api) for detailed examples.

## Thumbnails

Generate page thumbnails from a previously processed document:

```bash theme={null}
GET /api/v1/thumbnails/{lookup_key}?thumb_width=300&page_range=0-2
```

| Parameter     | Type   | Default   | Description                               |
| ------------- | ------ | --------- | ----------------------------------------- |
| `lookup_key`  | string | Required  | The request ID from a previous conversion |
| `thumb_width` | int    | 300       | Thumbnail width in pixels                 |
| `page_range`  | string | All pages | Pages to generate (e.g., `"0,2-4"`)       |

Response:

```json theme={null}
{
  "success": true,
  "thumbnails": ["base64_encoded_jpg_1", "base64_encoded_jpg_2"]
}
```

Thumbnails are returned as base64-encoded JPG images.

## Create Document

Generate DOCX files from markdown with track changes support:

```bash theme={null}
POST /api/v1/create-document
Content-Type: application/json

{
  "markdown": "# Title\n\nThis is <ins data-revision-author=\"Editor\">newly added</ins> text.",
  "output_format": "docx"
}
```

See [Create Document](/docs/recipes/create-document/create-document-api-overview) for detailed examples.

## Webhooks

Configure webhooks to receive notifications when processing completes instead of polling.

Set a default webhook URL in your [account settings](https://www.datalab.to/settings), or override per-request with the `webhook_url` parameter.

See [Webhooks](/platform/webhooks) for configuration details.

## Rate Limits

Default rate limits apply per API key. If you exceed limits, you'll receive a `429` response.

See [Rate Limits](/docs/common/limits) for details and how to request higher limits.

## Next Steps

<CardGroup>
  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    Use the Python SDK for a simpler integration with typed responses.
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Receive notifications when processing completes instead of polling.
  </Card>

  <Card title="API Limits" icon="gauge" href="/docs/common/limits">
    Understand file size limits, page limits, and rate limiting.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Detailed guide to converting documents to Markdown, HTML, or JSON.
  </Card>
</CardGroup>


# Quickstart
Source: https://documentation.datalab.to/docs/welcome/quickstart

Get started with Datalab to convert PDFs, images, and documents into Markdown, HTML, or JSON in minutes.

## Get Your API Key

Sign up at [datalab.to/auth/sign\_up](https://www.datalab.to/auth/sign_up) — new accounts include a **free monthly usage allowance** (no credit card required), enough to run a full proof of concept on your own documents.

Then grab your API key from the [API Keys dashboard](https://www.datalab.to/app/keys).

<Tip>
  **Want to try before writing code?** Upload a document to the [Forge Playground](https://www.datalab.to/app/playground) to see results instantly — no API key required.
</Tip>

## Installation

Install the Datalab SDK:

```bash theme={null}
pip install datalab-python-sdk
```

Set your API key as an environment variable:

```bash theme={null}
export DATALAB_API_KEY=your_api_key_here
```

## Convert a Document

The SDK provides a simple interface to convert documents to Markdown, HTML, JSON, or chunks.

<CodeGroup>
  ```python SDK theme={null}
  from datalab_sdk import DatalabClient

  client = DatalabClient()  # Uses DATALAB_API_KEY env var

  # Convert PDF to markdown
  result = client.convert("document.pdf")
  print(result.markdown)

  # Save output and images
  result.save_output("output/")
  ```

  ```python Python (requests) theme={null}
  import requests
  import time

  url = "https://www.datalab.to/api/v1/convert"
  headers = {"X-API-Key": "YOUR_API_KEY"}

  with open("document.pdf", "rb") as f:
      response = requests.post(
          url,
          files={"file": ("document.pdf", f, "application/pdf")},
          data={"output_format": "markdown"},
          headers=headers
      )

  data = response.json()
  check_url = data["request_check_url"]

  # Poll for completion
  while True:
      response = requests.get(check_url, headers=headers)
      result = response.json()
      if result["status"] == "complete":
          print(result["markdown"])
          break
      time.sleep(2)
  ```

  ```bash cURL theme={null}
  # Submit document
  curl -X POST https://www.datalab.to/api/v1/convert \
    -H "X-API-Key: YOUR_API_KEY" \
    -F "file=@document.pdf" \
    -F "output_format=markdown"

  # Poll for results (use request_check_url from response)
  curl -X GET "https://www.datalab.to/api/v1/convert/{request_id}" \
    -H "X-API-Key: YOUR_API_KEY"
  ```
</CodeGroup>

<Warning>
  **Common mistakes:**

  * Forgetting to set the `DATALAB_API_KEY` environment variable
  * Using `file_url` with a private/authenticated URL (must be publicly accessible)
  * Not polling for results — the initial response only contains a `request_id`, not the actual output
</Warning>

## Conversion Options

Control the conversion with options:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    output_format="markdown",  # "markdown", "html", "json", "chunks"
    mode="balanced",           # "fast", "balanced", "accurate"
    paginate=True,             # Add page delimiters
    page_range="0-10",         # Process specific pages (0-indexed)
)

result = client.convert("document.pdf", options=options)
```

### Processing Modes

| Mode       | Description                                             |
| ---------- | ------------------------------------------------------- |
| `fast`     | Lowest latency, good for simple documents (SDK default) |
| `balanced` | Balance of speed and accuracy                           |
| `accurate` | Highest accuracy, best for complex layouts              |

## Fill PDF Forms

Fill forms in PDFs or images with structured data:

<CodeGroup>
  ```python SDK theme={null}
  from datalab_sdk import DatalabClient, FormFillingOptions

  client = DatalabClient()

  options = FormFillingOptions(
      field_data={
          "full_name": {"value": "John Doe", "description": "Full legal name"},
          "date": {"value": "2024-01-15", "description": "Today's date"},
          "signature": {"value": "John Doe", "description": "Signature field"},
      }
  )

  result = client.fill("form.pdf", options=options)
  result.save_output("filled_form.pdf")
  ```

  ```python Python (requests) theme={null}
  import requests
  import json

  url = "https://www.datalab.to/api/v1/fill"
  headers = {"X-API-Key": "YOUR_API_KEY"}

  field_data = {
      "full_name": {"value": "John Doe", "description": "Full legal name"},
      "date": {"value": "2024-01-15", "description": "Today's date"},
  }

  with open("form.pdf", "rb") as f:
      response = requests.post(
          url,
          files={"file": ("form.pdf", f, "application/pdf")},
          data={"field_data": json.dumps(field_data)},
          headers=headers
      )
  # Poll for completion using request_check_url
  ```
</CodeGroup>

## Upload and Manage Files

Upload files to Datalab for use in pipelines:

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

# Upload files
uploaded = client.upload_files(["doc1.pdf", "doc2.pdf"])
for file in uploaded:
    print(f"{file.original_filename}: {file.reference}")
    # Output: doc1.pdf: datalab://file-abc123

# List your files
files = client.list_files(limit=50)
print(f"Total files: {files['total']}")
```

## CLI

The SDK includes a command-line interface:

```bash theme={null}
# Convert a single document
datalab convert document.pdf --format markdown

# Convert with options
datalab convert document.pdf --mode accurate --paginate

# Convert a directory
datalab convert ./documents/ --output_dir ./output/
```

## Run a Pipeline

Pipelines chain processors (convert, extract, segment) into a single reusable call. Create them in [Forge](https://www.datalab.to/app/playground) or via the SDK:

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

# Run an existing pipeline
execution = client.run_pipeline(
    "pl_abc123",              # Your pipeline ID
    file_path="document.pdf"
)

# Poll until complete
execution = client.get_pipeline_execution(
    execution.execution_id,
    max_polls=300
)

# Get extraction results (step index 1 = extract step)
result = client.get_step_result(execution.execution_id, step_index=1)
print(result)
```

See [Pipelines](/docs/recipes/pipelines/pipeline-overview) for creating, versioning, and running pipelines.

## Async Support

For high-throughput applications, use the async client:

```python theme={null}
import asyncio
from datalab_sdk import AsyncDatalabClient

async def convert_documents():
    async with AsyncDatalabClient() as client:
        result = await client.convert("document.pdf")
        print(result.markdown)

asyncio.run(convert_documents())
```

## Next Steps

<CardGroup>
  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    Full Python SDK documentation with typed clients and async support.
  </Card>

  <Card title="API Reference" icon="book" href="/docs/welcome/api">
    REST API reference for document conversion, form filling, and file management.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/recipes/conversion/conversion-api-overview">
    Detailed guide to converting PDFs and documents to Markdown, HTML, or JSON.
  </Card>
</CardGroup>


# Python SDK
Source: https://documentation.datalab.to/docs/welcome/sdk

The Datalab Python SDK provides a simple interface for document conversion, pipelines, structured extraction, form filling, and file management.

## Installation

```bash theme={null}
pip install datalab-python-sdk
```

Requires Python 3.10 or higher.

## Authentication

Set your API key as an environment variable (recommended):

```bash theme={null}
export DATALAB_API_KEY=your_api_key_here
```

Or pass it directly to the client:

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient(api_key="your_api_key_here")
```

Get your API key from the [API Keys dashboard](https://www.datalab.to/app/keys).

## Quick Example

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

# Convert a document to markdown
result = client.convert("document.pdf")
print(result.markdown)

# Save output with images
result.save_output("output/")
```

## Client Options

Both sync and async clients accept the same configuration options:

```python theme={null}
from datalab_sdk import DatalabClient, AsyncDatalabClient

# Synchronous client (blocking)
client = DatalabClient(
    api_key="your_key",           # Or use DATALAB_API_KEY env var
    base_url="https://www.datalab.to",  # API endpoint
    timeout=300,                  # Request timeout in seconds
)

# Asynchronous client (non-blocking)
async_client = AsyncDatalabClient(
    api_key="your_key",
    base_url="https://www.datalab.to",
    timeout=300,
)
```

| Parameter  | Type | Default                   | Description                |
| ---------- | ---- | ------------------------- | -------------------------- |
| `api_key`  | str  | `DATALAB_API_KEY` env var | Your Datalab API key       |
| `base_url` | str  | `https://www.datalab.to`  | API base URL               |
| `timeout`  | int  | `300`                     | Request timeout in seconds |

## Async Support

For high-throughput applications, use `AsyncDatalabClient`:

```python theme={null}
import asyncio
from datalab_sdk import AsyncDatalabClient

async def process_documents():
    async with AsyncDatalabClient() as client:
        result = await client.convert("document.pdf")
        print(result.markdown)

asyncio.run(process_documents())
```

The async client is recommended when processing multiple documents concurrently.

## Error Handling

The SDK raises specific exceptions for different error types:

```python theme={null}
from datalab_sdk import DatalabClient
from datalab_sdk.exceptions import (
    DatalabAPIError,
    DatalabTimeoutError,
    DatalabFileError,
    DatalabValidationError,
)

client = DatalabClient()

try:
    result = client.convert("document.pdf")
except DatalabAPIError as e:
    print(f"API error {e.status_code}: {e.response_data}")
except DatalabTimeoutError:
    print("Request timed out")
except DatalabFileError as e:
    print(f"File error: {e}")
except DatalabValidationError as e:
    print(f"Invalid input: {e}")
```

| Exception                | Description                                                                 |
| ------------------------ | --------------------------------------------------------------------------- |
| `DatalabAPIError`        | API returned an error response (includes `status_code` and `response_data`) |
| `DatalabTimeoutError`    | Request exceeded timeout                                                    |
| `DatalabFileError`       | File not found or cannot be read                                            |
| `DatalabValidationError` | Invalid parameters provided                                                 |

## Automatic Retries

The SDK automatically retries requests for:

* `408` Request Timeout
* `429` Rate Limit Exceeded
* `5xx` Server Errors

Retries use exponential backoff. You can control polling behavior with `max_polls` and `poll_interval` parameters on individual methods.

## SDK Features

<CardGroup>
  <Card title="Document Conversion" icon="file-lines" href="/docs/welcome/sdk/conversion">
    Convert PDFs, images, and documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/welcome/sdk/extraction">
    Extract structured data from documents using JSON schemas.
  </Card>

  <Card title="Document Segmentation" icon="scissors" href="/docs/welcome/sdk/segmentation">
    Segment documents into logical sections.
  </Card>

  <Card title="Form Filling" icon="pen-to-square" href="/docs/welcome/sdk/form-filling">
    Fill PDF and image forms with structured field data.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/welcome/sdk/pipelines">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="File Management" icon="folder-open" href="/docs/welcome/sdk/file-management">
    Upload, list, and manage files in Datalab storage.
  </Card>

  <Card title="CLI" icon="terminal" href="/docs/welcome/sdk/cli">
    Command-line interface for document conversion.
  </Card>
</CardGroup>

## Method Summary

| Method                             | Description                                                 |
| ---------------------------------- | ----------------------------------------------------------- |
| `convert()`                        | Convert documents to markdown, HTML, JSON, or chunks        |
| `extract()`                        | Extract structured data from documents using JSON schemas   |
| `segment()`                        | Segment documents into sections using a schema              |
| `track_changes()`                  | Extract tracked changes from DOCX documents                 |
| `create_document()`                | Create DOCX from markdown with track changes                |
| `run_custom_processor()`           | Execute a custom processor on a document                    |
| `fill()`                           | Fill PDF or image forms with field data                     |
| `upload_files()`                   | Upload files to Datalab storage                             |
| `list_files()`                     | List uploaded files                                         |
| `get_file_metadata()`              | Get metadata for a specific file                            |
| `get_file_download_url()`          | Generate presigned download URL                             |
| `delete_file()`                    | Delete an uploaded file                                     |
| `create_pipeline()`                | Create a new pipeline                                       |
| `list_pipelines()`                 | List pipelines for your team                                |
| `get_pipeline()`                   | Get a pipeline by ID                                        |
| `update_pipeline()`                | Update pipeline steps (creates a draft)                     |
| `save_pipeline()`                  | Promote a pipeline draft to a named, published version      |
| `archive_pipeline()`               | Archive a pipeline                                          |
| `unarchive_pipeline()`             | Restore an archived pipeline                                |
| `create_pipeline_version()`        | Snapshot the current pipeline steps as an immutable version |
| `list_pipeline_versions()`         | List all versions of a pipeline                             |
| `discard_pipeline_draft()`         | Discard draft changes and revert to a published version     |
| `get_pipeline_rate()`              | Get per-page rate for a pipeline                            |
| `run_pipeline()`                   | Execute a pipeline on a file                                |
| `get_pipeline_execution()`         | Poll pipeline execution status                              |
| `list_pipeline_executions()`       | List recent executions for a pipeline                       |
| `get_step_result()`                | Fetch the result of a specific pipeline step                |
| `list_custom_processors()`         | List custom processors for your team                        |
| `get_custom_processor_status()`    | Check custom processor generation status                    |
| `list_custom_processor_versions()` | List versions of a custom processor                         |
| `set_active_processor_version()`   | Set the active version of a custom processor                |
| `archive_custom_processor()`       | Archive a custom processor                                  |
| `create_extraction_schema()`       | Create a reusable extraction schema                         |
| `list_extraction_schemas()`        | List saved extraction schemas                               |
| `get_extraction_schema()`          | Get a schema by ID                                          |
| `update_extraction_schema()`       | Update schema fields or create a new version                |
| `delete_extraction_schema()`       | Archive (soft-delete) an extraction schema                  |
| `delete_workflow()`                | Delete a workflow definition                                |
| `run_custom_pipeline()`            | *(Deprecated)* Use `run_custom_processor()` instead         |
| `ocr()`                            | *(Deprecated)* Use `convert()` instead                      |

## Next Steps

<CardGroup>
  <Card title="Document Conversion" icon="file-lines" href="/docs/welcome/sdk/conversion">
    Convert PDFs, images, and documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/welcome/sdk/extraction">
    Extract structured data from documents using JSON schemas.
  </Card>

  <Card title="Document Segmentation" icon="scissors" href="/docs/welcome/sdk/segmentation">
    Segment documents into logical sections.
  </Card>

  <Card title="Form Filling" icon="pen-to-square" href="/docs/welcome/sdk/form-filling">
    Fill PDF and image forms with structured field data.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/welcome/sdk/pipelines">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="File Management" icon="folder-open" href="/docs/welcome/sdk/file-management">
    Upload, list, and manage files in Datalab storage.
  </Card>
</CardGroup>


# Command Line Interface
Source: https://documentation.datalab.to/docs/welcome/sdk/cli

Use the Datalab CLI to convert documents from the command line.

## Installation

The CLI is included with the SDK:

```bash theme={null}
pip install datalab-python-sdk
```

## Authentication

Set your API key as an environment variable:

```bash theme={null}
export DATALAB_API_KEY=your_api_key_here
```

Or pass it with each command:

```bash theme={null}
datalab convert document.pdf --api_key YOUR_API_KEY
```

## Convert Documents

Convert documents to markdown, HTML, JSON, or chunks.

### Basic Usage

```bash theme={null}
# Convert a single file
datalab convert document.pdf

# Convert to specific format
datalab convert document.pdf --format html

# Convert with processing mode
datalab convert document.pdf --mode accurate
```

### Output Options

```bash theme={null}
# Save to specific directory
datalab convert document.pdf --output_dir ./output/

# Output formats
datalab convert document.pdf --format markdown
datalab convert document.pdf --format html
datalab convert document.pdf --format json
datalab convert document.pdf --format chunks
```

### Processing Options

```bash theme={null}
# Processing modes
datalab convert document.pdf --mode fast       # Lowest latency (default)
datalab convert document.pdf --mode balanced   # Balance of speed and accuracy
datalab convert document.pdf --mode accurate   # Highest accuracy

# Limit pages
datalab convert document.pdf --max_pages 10

# Specific page range (0-indexed)
datalab convert document.pdf --page_range "0-5,10,15-20"

# For spreadsheets, page_range filters by sheet index
datalab convert workbook.xlsx --page_range "0,2"

# Add page delimiters
datalab convert document.pdf --paginate
```

### Advanced Options

```bash theme={null}
# Add block IDs for citations (HTML only)
datalab convert document.pdf --format html --add_block_ids

# Disable image extraction
datalab convert document.pdf --disable_image_extraction

# Disable image captions
datalab convert document.pdf --disable_image_captions

# Skip cached results
datalab convert document.pdf --skip_cache
```

### Directory Processing

Convert all documents in a directory:

```bash theme={null}
# Convert all supported files
datalab convert ./documents/ --output_dir ./output/

# Filter by extension
datalab convert ./documents/ --extensions pdf,docx

# Control concurrency
datalab convert ./documents/ --max_concurrent 5
```

### Convert Command Reference

| Option                       | Description                                         |
| ---------------------------- | --------------------------------------------------- |
| `--format`                   | Output format: `markdown`, `html`, `json`, `chunks` |
| `--mode`                     | Processing mode: `fast`, `balanced`, `accurate`     |
| `--output_dir`, `-o`         | Output directory                                    |
| `--max_pages`                | Maximum pages to process                            |
| `--page_range`               | Specific pages (e.g., `"0-5,10"`)                   |
| `--paginate`                 | Add page delimiters                                 |
| `--add_block_ids`            | Add block IDs to HTML output                        |
| `--disable_image_extraction` | Don't extract images                                |
| `--disable_image_captions`   | Don't generate image captions                       |
| `--skip_cache`               | Force reprocessing                                  |
| `--extensions`               | File extensions to process (for directories)        |
| `--max_concurrent`           | Maximum concurrent requests                         |
| `--max_polls`                | Maximum polling attempts                            |
| `--poll_interval`            | Seconds between polls                               |
| `--api_key`                  | Datalab API key                                     |
| `--base_url`                 | API base URL                                        |

## Extract Structured Data

Extract structured data from documents using a JSON schema.

### Basic Usage

```bash theme={null}
# Extract data using a page schema
datalab extract invoice.pdf \
  --page_schema '{"invoice_number": {"type": "string"}, "total": {"type": "number"}}'

# Extract with a specific mode
datalab extract invoice.pdf \
  --page_schema '{"title": {"type": "string"}}' \
  --mode accurate

# Extract using a checkpoint from a previous conversion
datalab extract invoice.pdf \
  --page_schema '{"total": {"type": "number"}}' \
  --checkpoint_id "ckpt_abc123"
```

### Extract Command Reference

| Option               | Description                                           |
| -------------------- | ----------------------------------------------------- |
| `--page_schema`      | **(Required)** JSON schema defining fields to extract |
| `--checkpoint_id`    | Checkpoint ID from a previous conversion              |
| `--format`           | Output format: `markdown`, `html`, `json`, `chunks`   |
| `--mode`             | Processing mode: `fast`, `balanced`, `accurate`       |
| `--output_dir`, `-o` | Output directory                                      |
| `--max_pages`        | Maximum pages to process                              |
| `--page_range`       | Specific pages (e.g., `"0-5,10"`)                     |
| `--skip_cache`       | Force reprocessing                                    |
| `--api_key`          | Datalab API key                                       |
| `--base_url`         | API base URL                                          |

## Segment Documents

Segment documents into logical sections using a schema.

### Basic Usage

```bash theme={null}
# Segment a document
datalab segment report.pdf \
  --segmentation_schema '{"sections": [{"name": "intro", "description": "Introduction"}, {"name": "body", "description": "Main content"}]}'

# Segment with a checkpoint
datalab segment report.pdf \
  --segmentation_schema '{"sections": [{"name": "summary", "description": "Executive summary"}]}' \
  --checkpoint_id "ckpt_abc123"
```

### Segment Command Reference

| Option                  | Description                                                        |
| ----------------------- | ------------------------------------------------------------------ |
| `--segmentation_schema` | **(Required)** JSON schema defining segment names and descriptions |
| `--checkpoint_id`       | Checkpoint ID from a previous conversion                           |
| `--mode`                | Processing mode: `fast`, `balanced`, `accurate`                    |
| `--output_dir`, `-o`    | Output directory                                                   |
| `--max_pages`           | Maximum pages to process                                           |
| `--page_range`          | Specific pages (e.g., `"0-5,10"`)                                  |
| `--skip_cache`          | Force reprocessing                                                 |
| `--api_key`             | Datalab API key                                                    |
| `--base_url`            | API base URL                                                       |

## Track Changes

Extract tracked changes from DOCX documents.

### Basic Usage

```bash theme={null}
# Extract tracked changes from a Word document
datalab track-changes contract.docx

# Specify output format
datalab track-changes contract.docx --format html

# With pagination
datalab track-changes contract.docx --format html --paginate
```

### Track Changes Command Reference

| Option               | Description                                                                       |
| -------------------- | --------------------------------------------------------------------------------- |
| `--format`           | Comma-separated output formats: `markdown`, `html`, `chunks` (default: all three) |
| `--paginate`         | Add page delimiters to output                                                     |
| `--output_dir`, `-o` | Output directory                                                                  |
| `--api_key`          | Datalab API key                                                                   |
| `--base_url`         | API base URL                                                                      |

## Custom Processor

<Warning>
  The `custom-pipeline` CLI command is deprecated. It continues to work and calls the new `/api/v1/custom-processor` endpoint internally, but the command name itself will be updated in a future SDK release.
</Warning>

Execute a custom processor on a document.

### Basic Usage

```bash theme={null}
# Run a custom processor
datalab custom-pipeline document.pdf --pipeline_id "cp_XXXXX"

# Run with evaluation
datalab custom-pipeline document.pdf \
  --pipeline_id "cp_XXXXX" \
  --run_eval

# Specify format and mode
datalab custom-pipeline document.pdf \
  --pipeline_id "cp_XXXXX" \
  --format json \
  --mode accurate
```

### Custom Processor Command Reference

| Option               | Description                                         |
| -------------------- | --------------------------------------------------- |
| `--pipeline_id`      | **(Required)** Custom processor ID (`cp_XXXXX`)     |
| `--run_eval`         | Run evaluation rules for the processor              |
| `--format`           | Output format: `markdown`, `html`, `json`, `chunks` |
| `--mode`             | Processing mode: `fast`, `balanced`, `accurate`     |
| `--output_dir`, `-o` | Output directory                                    |
| `--api_key`          | Datalab API key                                     |
| `--base_url`         | API base URL                                        |

## Create Document

Create a DOCX document from markdown with track changes.

### Basic Usage

```bash theme={null}
# Create a document from a markdown file
datalab create-document --markdown input.md --output output.docx

# Create a document from inline markdown content
datalab create-document \
  --markdown "# Title\n\nDocument content here." \
  --output document.docx
```

### Create Document Command Reference

| Option           | Description                                                |
| ---------------- | ---------------------------------------------------------- |
| `--markdown`     | **(Required)** Markdown content or path to a markdown file |
| `--output`, `-o` | **(Required)** Output file path for the generated DOCX     |
| `--api_key`      | Datalab API key                                            |
| `--base_url`     | API base URL                                               |

## Examples

### Batch Convert PDFs

```bash theme={null}
# Convert all PDFs in a directory with accurate mode
datalab convert ./invoices/ \
  --extensions pdf \
  --mode accurate \
  --format json \
  --output_dir ./processed/
```

### Extract Data from Documents

```bash theme={null}
# Extract structured data using a schema
datalab extract invoice.pdf \
  --page_schema '{
    "invoice_number": {"type": "string", "description": "Invoice ID"},
    "total": {"type": "number", "description": "Total amount"},
    "vendor": {"type": "string", "description": "Vendor name"}
  }' \
  --mode balanced \
  --output_dir ./extracted/
```

### High-Throughput Processing

```bash theme={null}
# Process many files with high concurrency
datalab convert ./documents/ \
  --max_concurrent 10 \
  --mode fast \
  --output_dir ./output/
```

## Getting Help

```bash theme={null}
# General help
datalab --help

# Command-specific help
datalab convert --help
datalab extract --help
datalab segment --help
datalab track-changes --help
datalab custom-pipeline --help
datalab create-document --help
```

## Next Steps

<CardGroup>
  <Card title="Quickstart" icon="bolt" href="/docs/welcome/quickstart">
    Get up and running with Datalab in minutes.
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents efficiently in parallel.
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    Explore the full Python SDK for advanced usage.
  </Card>

  <Card title="Supported File Types" icon="file-circle-check" href="/docs/common/supportedfiletypes">
    See all document formats supported by Datalab.
  </Card>
</CardGroup>


# Document Conversion
Source: https://documentation.datalab.to/docs/welcome/sdk/conversion

Convert PDFs, images, and documents to Markdown, HTML, JSON, or chunks using the Datalab SDK.

## Basic Usage

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

# Convert to markdown (default)
result = client.convert("document.pdf")
print(result.markdown)

# Convert from URL
result = client.convert(file_url="https://example.com/document.pdf")
print(result.markdown)
```

## Conversion Options

Use `ConvertOptions` to control the conversion:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions

client = DatalabClient()

options = ConvertOptions(
    output_format="markdown",     # Output format
    mode="balanced",              # Processing mode
    paginate=True,                # Add page delimiters
    max_pages=10,                 # Limit pages processed
    page_range="0-5,10",          # Specific pages (0-indexed)
)

result = client.convert("document.pdf", options=options)
```

### All Options

| Option                        | Type | Default      | Description                                                                                                                                                                                                                                           |
| ----------------------------- | ---- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `output_format`               | str  | `"markdown"` | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"`                                                                                                                                                                                           |
| `mode`                        | str  | `"fast"`     | Processing mode: `"fast"`, `"balanced"`, `"accurate"`                                                                                                                                                                                                 |
| `paginate`                    | bool | `False`      | Add page delimiters to output                                                                                                                                                                                                                         |
| `max_pages`                   | int  | None         | Maximum number of pages to process                                                                                                                                                                                                                    |
| `page_range`                  | str  | None         | Specific pages to process (e.g., `"0-5,10,15-20"`). For spreadsheets, filters by sheet index.                                                                                                                                                         |
| `skip_cache`                  | bool | `False`      | Skip cached results, force reprocessing                                                                                                                                                                                                               |
| `disable_image_extraction`    | bool | `False`      | Don't extract images from document                                                                                                                                                                                                                    |
| `disable_image_captions`      | bool | `False`      | Don't generate captions for images                                                                                                                                                                                                                    |
| `token_efficient_markdown`    | bool | `False`      | Optimize markdown output for LLM token usage                                                                                                                                                                                                          |
| `fence_synthetic_captions`    | bool | `False`      | Fence synthetic image captions                                                                                                                                                                                                                        |
| `include_markdown_in_chunks`  | bool | `False`      | Include markdown in chunks/JSON output                                                                                                                                                                                                                |
| `save_checkpoint`             | bool | `False`      | Save intermediate checkpoint for reuse                                                                                                                                                                                                                |
| `extras`                      | str  | None         | Comma-separated features: `"track_changes"`, `"chart_understanding"`, `"extract_links"`, `"table_cell_bboxes"`, `"list_item_bboxes"`, `"infographic"`, `"new_block_types"`. (`"table_row_bboxes"` is deprecated — use `"table_cell_bboxes"` instead.) |
| `add_block_ids`               | bool | `False`      | Add block IDs to HTML output for citations                                                                                                                                                                                                            |
| `keep_spreadsheet_formatting` | bool | `False`      | Preserve spreadsheet styling in HTML output                                                                                                                                                                                                           |
| `webhook_url`                 | str  | None         | Override account webhook URL for this request                                                                                                                                                                                                         |
| `additional_config`           | dict | None         | Additional configuration options                                                                                                                                                                                                                      |

<Tip>
  Use `save_checkpoint=True` to save the parsed document state. Then call `client.extract()` or `client.segment()` with the returned `checkpoint_id` to run extraction or segmentation without re-parsing.
</Tip>

### Processing Modes

| Mode       | Description                   | Use Case                                 |
| ---------- | ----------------------------- | ---------------------------------------- |
| `fast`     | Lowest latency (default)      | Simple documents, real-time applications |
| `balanced` | Balance of speed and accuracy | General use                              |
| `accurate` | Highest accuracy              | Complex layouts, tables, figures         |

### Output Formats

| Format     | Description                                |
| ---------- | ------------------------------------------ |
| `markdown` | Clean markdown with headers, lists, tables |
| `html`     | Structured HTML preserving layout          |
| `json`     | Block-level structure with bounding boxes  |
| `chunks`   | Pre-chunked output for RAG applications    |

## Conversion Result

The `ConversionResult` object contains the converted content and metadata:

```python theme={null}
result = client.convert("document.pdf")

# Access content based on output format
print(result.markdown)        # Markdown output
print(result.html)            # HTML output
print(result.json)            # JSON structure
print(result.chunks)          # Chunked output

# Metadata
print(result.success)         # True if conversion succeeded
print(result.page_count)      # Number of pages processed
print(result.images)          # Dict of extracted images (filename -> base64)
print(result.metadata)        # Document metadata
print(result.parse_quality_score)  # Quality score (0-5)
print(result.cost_breakdown)  # Cost in cents
```

### Result Fields

| Field                 | Type  | Description                                          |
| --------------------- | ----- | ---------------------------------------------------- |
| `success`             | bool  | Whether conversion succeeded                         |
| `markdown`            | str   | Markdown output (if format is markdown)              |
| `html`                | str   | HTML output (if format is html)                      |
| `json`                | dict  | JSON output (if format is json)                      |
| `chunks`              | dict  | Chunked output (if format is chunks)                 |
| `images`              | dict  | Extracted images as `{filename: base64_data}`        |
| `metadata`            | dict  | Document metadata                                    |
| `page_count`          | int   | Number of pages processed                            |
| `parse_quality_score` | float | Quality score from 0-5                               |
| `cost_breakdown`      | dict  | Cost details (`list_cost_cents`, `final_cost_cents`) |
| `checkpoint_id`       | str   | Checkpoint ID if `save_checkpoint` was True          |
| `error`               | str   | Error message if conversion failed                   |

## Saving Output

Save the conversion result to files:

```python theme={null}
# Save during conversion
result = client.convert("document.pdf", save_output="output/document")

# Or save afterward
result.save_output("output/document", save_images=True)
```

This creates:

* `document.md` (or `.html`, `.json` based on format)
* `document_images/` directory with extracted images (if `save_images=True`)

## Async Usage

For high-throughput applications:

```python theme={null}
import asyncio
from datalab_sdk import AsyncDatalabClient, ConvertOptions

async def convert_documents():
    async with AsyncDatalabClient() as client:
        options = ConvertOptions(mode="fast", max_pages=5)
        result = await client.convert("document.pdf", options=options)
        return result.markdown

markdown = asyncio.run(convert_documents())
```

## Polling Configuration

Control polling behavior for long-running conversions:

```python theme={null}
result = client.convert(
    "large_document.pdf",
    max_polls=600,        # Maximum polling attempts (default: 300)
    poll_interval=2,      # Seconds between polls (default: 1)
)
```

## Special Features

### Track Changes (Word Documents)

Extract tracked changes and comments from DOCX files:

```python theme={null}
options = ConvertOptions(
    output_format="html",
    extras="track_changes",
)
result = client.convert("contract.docx", options=options)
# HTML contains <ins>, <del>, and <comment> tags
```

### Chart Understanding

Extract data from charts and graphs:

```python theme={null}
options = ConvertOptions(
    extras="chart_understanding",
)
result = client.convert("report.pdf", options=options)
```

### Block IDs for Citations

Add block IDs for tracking content back to source locations:

```python theme={null}
options = ConvertOptions(
    output_format="html",
    add_block_ids=True,
)
result = client.convert("document.pdf", options=options)
# HTML elements include data-block-id attributes
```

### Structured Extraction

For structured data extraction, use the dedicated [`client.extract()`](/docs/welcome/sdk/extraction) method.

## Next Steps

<CardGroup>
  <Card title="Structured Extraction Recipe" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data from documents using JSON schemas.
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents efficiently in parallel.
  </Card>

  <Card title="Form Filling SDK" icon="pen-to-square" href="/docs/welcome/sdk/form-filling">
    Programmatically fill PDF and image forms with field data.
  </Card>

  <Card title="CLI Reference" icon="terminal" href="/docs/welcome/sdk/cli">
    Convert documents from the command line.
  </Card>
</CardGroup>


# Structured Extraction
Source: https://documentation.datalab.to/docs/welcome/sdk/extraction

Extract structured data from documents using JSON schemas with the Datalab SDK.

## Basic Usage

```python theme={null}
import json
from datalab_sdk import DatalabClient, ExtractOptions

client = DatalabClient()

# Define a JSON schema for extraction
page_schema = json.dumps({
    "invoice_number": {"type": "string", "description": "Invoice number"},
    "total": {"type": "number", "description": "Total amount due"},
    "vendor": {"type": "string", "description": "Vendor or company name"},
    "items": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "description": {"type": "string"},
                "amount": {"type": "number"}
            }
        }
    }
})

options = ExtractOptions(page_schema=page_schema)
result = client.extract("invoice.pdf", options=options)

# Access the extracted data
extracted = json.loads(result.extraction_schema_json)
print(extracted)
```

## Extract Options

Use `ExtractOptions` to configure extraction behavior:

| Option            | Type | Default      | Description                                                                                       |
| ----------------- | ---- | ------------ | ------------------------------------------------------------------------------------------------- |
| `page_schema`     | str  | **Required** | JSON schema defining the fields to extract. Mutually exclusive with `schema_id`.                  |
| `schema_id`       | str  | None         | ID of a saved extraction schema (e.g. `sch_k8Hx9mP2nQ4v`). Mutually exclusive with `page_schema`. |
| `schema_version`  | int  | None         | Schema version to pin to. Only valid with `schema_id`.                                            |
| `checkpoint_id`   | str  | None         | Checkpoint ID from a previous `convert()` call                                                    |
| `mode`            | str  | `"fast"`     | Parse mode: `"fast"`, `"balanced"`, `"accurate"`. Controls document parsing quality.              |
| `output_format`   | str  | `"markdown"` | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"`                                       |
| `save_checkpoint` | bool | `False`      | Save checkpoint for reuse with subsequent calls                                                   |
| `max_pages`       | int  | None         | Maximum number of pages to process                                                                |
| `page_range`      | str  | None         | Specific pages to process (e.g., `"0-5,10"`). For spreadsheets, filters by sheet index.           |
| `skip_cache`      | bool | `False`      | Skip cached results, force reprocessing                                                           |
| `webhook_url`     | str  | None         | Webhook URL for completion notification                                                           |

<Note>
  To control the **extraction pipeline mode** (fast vs. balanced), pass `extraction_mode` as a form field via the REST API directly — it is not yet exposed in `ExtractOptions`. See [Balanced Extraction Mode](/docs/recipes/structured-extraction/balanced-mode) for details on the two modes.
</Note>

## Checkpoint Reuse

Use checkpoints to avoid re-parsing a document when running extraction after conversion. First convert with `save_checkpoint=True`, then extract using the returned `checkpoint_id`:

```python theme={null}
import json
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions

client = DatalabClient()

# Step 1: Convert and save a checkpoint
convert_options = ConvertOptions(
    mode="accurate",
    save_checkpoint=True,
)
convert_result = client.convert("report.pdf", options=convert_options)
print(convert_result.markdown)

# Step 2: Extract using the checkpoint (no re-parsing needed)
page_schema = json.dumps({
    "title": {"type": "string", "description": "Document title"},
    "author": {"type": "string", "description": "Author name"},
    "date": {"type": "string", "description": "Publication date"},
    "summary": {"type": "string", "description": "Brief summary of the document"},
})

extract_options = ExtractOptions(
    page_schema=page_schema,
    checkpoint_id=convert_result.checkpoint_id,
)
extract_result = client.extract("report.pdf", options=extract_options)
extracted = json.loads(extract_result.extraction_schema_json)
print(extracted)
```

## Extraction Result

The result object contains the extracted data alongside standard conversion fields:

```python theme={null}
result = client.extract("invoice.pdf", options=options)

# Extracted structured data (JSON string)
extracted = json.loads(result.extraction_schema_json)
print(extracted["invoice_number"])
print(extracted["total"])

# Standard conversion fields are also available
print(result.success)
print(result.markdown)
print(result.page_count)
print(result.cost_breakdown)
```

## Async Usage

```python theme={null}
import asyncio
import json
from datalab_sdk import AsyncDatalabClient, ExtractOptions

async def extract_data():
    async with AsyncDatalabClient() as client:
        page_schema = json.dumps({
            "title": {"type": "string", "description": "Document title"},
            "author": {"type": "string", "description": "Author name"},
        })
        options = ExtractOptions(page_schema=page_schema)
        result = await client.extract("document.pdf", options=options)
        return json.loads(result.extraction_schema_json)

extracted = asyncio.run(extract_data())
print(extracted)
```

## Next Steps

<CardGroup>
  <Card title="Extraction Recipe" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Learn more about structured extraction patterns and best practices.
  </Card>

  <Card title="Document Segmentation" icon="scissors" href="/docs/welcome/sdk/segmentation">
    Segment documents into logical sections using schemas.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/welcome/sdk/conversion">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="Batch Processing" icon="layer-group" href="/docs/recipes/conversion/batch-documents">
    Process multiple documents efficiently in parallel.
  </Card>
</CardGroup>


# File Management
Source: https://documentation.datalab.to/docs/welcome/sdk/file-management

Upload, list, and manage files in Datalab storage using the SDK.

## Overview

Datalab provides file storage for documents you want to process with pipelines or reuse across multiple API calls. Uploaded files get a reference URL (`datalab://file-xxx`) that you can use in pipelines.

## Upload Files

Upload one or more files to Datalab storage:

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

# Upload a single file
file = client.upload_files("document.pdf")
print(f"Uploaded: {file.original_filename}")
print(f"Reference: {file.reference}")  # datalab://file-abc123

# Upload multiple files
files = client.upload_files(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
for f in files:
    print(f"{f.original_filename}: {f.reference}")
```

### Upload Result

The `UploadedFileMetadata` object contains:

| Field               | Type | Description                                    |
| ------------------- | ---- | ---------------------------------------------- |
| `file_id`           | int  | Unique file ID                                 |
| `original_filename` | str  | Original filename                              |
| `content_type`      | str  | MIME type                                      |
| `reference`         | str  | Datalab reference URL (`datalab://file-xxx`)   |
| `upload_status`     | str  | Status: `"pending"`, `"completed"`, `"failed"` |
| `file_size`         | int  | File size in bytes                             |
| `created`           | str  | Upload timestamp                               |

## List Files

List all uploaded files with pagination:

```python theme={null}
# List first 50 files
result = client.list_files(limit=50, offset=0)

print(f"Total files: {result['total']}")
for file in result['files']:
    print(f"  {file.original_filename} ({file.file_size} bytes)")
    print(f"    Reference: {file.reference}")
    print(f"    Status: {file.upload_status}")
```

### Pagination

```python theme={null}
# Page through all files
offset = 0
limit = 50

while True:
    result = client.list_files(limit=limit, offset=offset)

    for file in result['files']:
        print(file.original_filename)

    if offset + limit >= result['total']:
        break

    offset += limit
```

## Get File Metadata

Get details for a specific file:

```python theme={null}
# By file ID (integer)
file = client.get_file_metadata(123)

# By hashid (string from reference URL)
file = client.get_file_metadata("abc123")

print(f"Filename: {file.original_filename}")
print(f"Size: {file.file_size} bytes")
print(f"Type: {file.content_type}")
print(f"Created: {file.created}")
```

## Get Download URL

Generate a presigned URL to download a file:

```python theme={null}
result = client.get_file_download_url(
    file_id=123,
    expires_in=3600  # URL valid for 1 hour (default)
)

print(f"Download URL: {result['download_url']}")
print(f"Expires in: {result['expires_in']} seconds")

# Download the file
import requests
response = requests.get(result['download_url'])
with open("downloaded.pdf", "wb") as f:
    f.write(response.content)
```

### Expiration Options

The `expires_in` parameter accepts values from 60 to 86400 seconds (1 minute to 24 hours):

```python theme={null}
# Short-lived URL (1 minute)
result = client.get_file_download_url(file_id, expires_in=60)

# Long-lived URL (24 hours)
result = client.get_file_download_url(file_id, expires_in=86400)
```

## Delete File

Delete an uploaded file:

```python theme={null}
result = client.delete_file(123)

if result['success']:
    print(f"Deleted: {result['message']}")
```

## Using Files in Pipelines

File references can be used as input to pipelines:

```python theme={null}
from datalab_sdk import DatalabClient

client = DatalabClient()

# Upload files
files = client.upload_files(["invoice1.pdf", "invoice2.pdf"])

# Run pipeline on each uploaded file
for f in files:
    execution = client.run_pipeline(
        "pl_abc123",
        file_url=f.reference  # e.g., 'datalab://file-abc123'
    )
    print(f"{f.original_filename}: {execution.execution_id}")
```

See [Pipelines](/docs/recipes/pipelines/pipeline-overview) for more details.

## Async Usage

```python theme={null}
import asyncio
from datalab_sdk import AsyncDatalabClient

async def manage_files():
    async with AsyncDatalabClient() as client:
        # Upload
        files = await client.upload_files(["doc.pdf"])

        # List
        result = await client.list_files(limit=10)

        # Get metadata
        file = await client.get_file_metadata(files[0].file_id)

        # Download URL
        url = await client.get_file_download_url(files[0].file_id)

        # Delete
        await client.delete_file(files[0].file_id)

asyncio.run(manage_files())
```

## Example: Batch Upload and Process

```python theme={null}
from datalab_sdk import DatalabClient
from pathlib import Path

client = DatalabClient()

# Find all PDFs in a directory
pdf_files = list(Path("./documents").glob("*.pdf"))

# Upload all files
uploaded = client.upload_files([str(p) for p in pdf_files])

print(f"Uploaded {len(uploaded)} files:")
for file in uploaded:
    print(f"  {file.original_filename}: {file.reference}")

# Store references for later use
references = {f.original_filename: f.reference for f in uploaded}
```

## Supported File Types

See [Supported File Types](/docs/common/supportedfiletypes) for a complete list of supported formats.

## Next Steps

<CardGroup>
  <Card title="File Upload Recipe" icon="cloud-arrow-up" href="/docs/recipes/file-management/file-upload-api">
    Step-by-step guide for uploading and managing files via the API.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="Conversion SDK" icon="file-export" href="/docs/welcome/sdk/conversion">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="API Limits" icon="gauge-high" href="/docs/common/limits">
    Understand rate limits and file size constraints.
  </Card>
</CardGroup>


# Form Filling
Source: https://documentation.datalab.to/docs/welcome/sdk/form-filling

Fill PDF and image forms with structured field data using the Datalab SDK.

## Overview

The form filling API lets you programmatically fill forms in PDFs and images. It works with both:

* **Native PDF forms** - Forms with actual form fields
* **Image-based forms** - Scanned forms or images with visual form layouts

The API matches your field data to form fields and returns a filled PDF or image.

## Basic Usage

```python theme={null}
from datalab_sdk import DatalabClient, FormFillingOptions

client = DatalabClient()

options = FormFillingOptions(
    field_data={
        "full_name": {"value": "John Doe", "description": "Full legal name"},
        "date_of_birth": {"value": "1990-01-15", "description": "Date of birth"},
        "address": {"value": "123 Main St, City, ST 12345", "description": "Mailing address"},
    }
)

result = client.fill("form.pdf", options=options)
result.save_output("filled_form.pdf")
```

## Form Filling Options

| Option                 | Type  | Default  | Description                                                                          |
| ---------------------- | ----- | -------- | ------------------------------------------------------------------------------------ |
| `field_data`           | dict  | Required | Field names mapped to values and descriptions                                        |
| `context`              | str   | None     | Additional context to help match fields                                              |
| `confidence_threshold` | float | `0.5`    | Minimum confidence for field matching (0.0-1.0)                                      |
| `max_pages`            | int   | None     | Maximum pages to process                                                             |
| `page_range`           | str   | None     | Specific pages to process (e.g., `"0-2"`). For spreadsheets, filters by sheet index. |
| `skip_cache`           | bool  | `False`  | Skip cached results                                                                  |

### Field Data Format

Each field in `field_data` is a dictionary with:

```python theme={null}
field_data = {
    "field_key": {
        "value": "The value to fill",
        "description": "Description to help match the field"
    }
}
```

The `description` helps the API match your field key to the actual form field, especially when field names in the PDF don't match your data structure.

### Example with Multiple Field Types

```python theme={null}
options = FormFillingOptions(
    field_data={
        # Text fields
        "name": {"value": "Jane Smith", "description": "Full name"},
        "email": {"value": "jane@example.com", "description": "Email address"},

        # Date fields
        "date": {"value": "2024-01-15", "description": "Today's date"},

        # Numeric fields
        "amount": {"value": "1500.00", "description": "Total amount"},

        # Checkbox (use descriptive value)
        "agree_terms": {"value": "Yes", "description": "Agreement checkbox"},

        # Signature (text is rendered)
        "signature": {"value": "Jane Smith", "description": "Signature field"},
    },
    context="This is an employment application form"
)
```

### Using Context

The `context` parameter provides additional information to improve field matching:

```python theme={null}
options = FormFillingOptions(
    field_data={
        "ssn": {"value": "123-45-6789", "description": "Social Security Number"},
        "employer": {"value": "Acme Corp", "description": "Current employer name"},
    },
    context="W-4 tax withholding form for new employee onboarding"
)
```

### Confidence Threshold

Adjust `confidence_threshold` to control field matching strictness:

```python theme={null}
options = FormFillingOptions(
    field_data={...},
    confidence_threshold=0.7,  # Higher = more strict matching
)
```

* **Lower values (0.3-0.5)**: More fields matched, but may have incorrect matches
* **Higher values (0.7-0.9)**: Fewer fields matched, but more accurate

## Form Filling Result

```python theme={null}
result = client.fill("form.pdf", options=options)

# Check results
print(result.success)           # True if filling succeeded
print(result.status)            # "complete" when done
print(result.output_format)     # "pdf" or "png"
print(result.fields_filled)     # List of successfully filled fields
print(result.fields_not_found)  # List of fields that couldn't be matched
print(result.page_count)        # Number of pages processed
print(result.cost_breakdown)    # Cost details
```

### Result Fields

| Field              | Type  | Description                               |
| ------------------ | ----- | ----------------------------------------- |
| `success`          | bool  | Whether form filling succeeded            |
| `status`           | str   | Processing status                         |
| `output_format`    | str   | Output type: `"pdf"` or `"png"`           |
| `output_base64`    | str   | Base64-encoded filled form                |
| `fields_filled`    | list  | Field names that were successfully filled |
| `fields_not_found` | list  | Field names that couldn't be matched      |
| `page_count`       | int   | Number of pages processed                 |
| `runtime`          | float | Processing time in seconds                |
| `cost_breakdown`   | dict  | Cost details                              |

## Saving the Filled Form

```python theme={null}
# Save to file
result.save_output("filled_form.pdf")

# Or access the raw base64 data
import base64
pdf_bytes = base64.b64decode(result.output_base64)
with open("filled.pdf", "wb") as f:
    f.write(pdf_bytes)
```

## Filling Image Forms

The API also works with image-based forms (PNG, JPG, etc.):

```python theme={null}
result = client.fill("scanned_form.png", options=options)
result.save_output("filled_form.png")  # Returns filled image
```

For images, the output is a PNG with the field values rendered onto the image.

## From URL

Fill a form from a URL:

```python theme={null}
result = client.fill(
    file_url="https://example.com/form.pdf",
    options=options
)
```

## Async Usage

```python theme={null}
import asyncio
from datalab_sdk import AsyncDatalabClient, FormFillingOptions

async def fill_form():
    async with AsyncDatalabClient() as client:
        options = FormFillingOptions(
            field_data={
                "name": {"value": "John Doe", "description": "Full name"},
            }
        )
        result = await client.fill("form.pdf", options=options)
        result.save_output("filled.pdf")

asyncio.run(fill_form())
```

## Handling Unmatched Fields

Check which fields couldn't be matched:

```python theme={null}
result = client.fill("form.pdf", options=options)

if result.fields_not_found:
    print("These fields couldn't be matched:")
    for field in result.fields_not_found:
        print(f"  - {field}")

    # Consider adjusting descriptions or lowering confidence threshold
```

## Example: Tax Form

```python theme={null}
from datalab_sdk import DatalabClient, FormFillingOptions

client = DatalabClient()

options = FormFillingOptions(
    field_data={
        "first_name": {"value": "John", "description": "First name"},
        "last_name": {"value": "Doe", "description": "Last name"},
        "ssn": {"value": "123-45-6789", "description": "Social Security Number"},
        "address": {"value": "123 Main Street", "description": "Street address"},
        "city": {"value": "Springfield", "description": "City"},
        "state": {"value": "IL", "description": "State abbreviation"},
        "zip": {"value": "62701", "description": "ZIP code"},
        "filing_status": {"value": "Single", "description": "Filing status"},
        "signature": {"value": "John Doe", "description": "Taxpayer signature"},
        "date": {"value": "2024-04-15", "description": "Date signed"},
    },
    context="IRS W-4 Employee's Withholding Certificate"
)

result = client.fill("w4_form.pdf", options=options)

print(f"Filled {len(result.fields_filled)} fields")
print(f"Unmatched: {result.fields_not_found}")

result.save_output("w4_filled.pdf")
```

## Next Steps

<CardGroup>
  <Card title="Form Filling Recipe" icon="file-pen" href="/docs/recipes/form-filling/form-filling-api-overview">
    Detailed guide on form filling with field matching and templates.
  </Card>

  <Card title="File Management" icon="folder-open" href="/docs/welcome/sdk/file-management">
    Upload, list, and manage files in Datalab storage.
  </Card>

  <Card title="Conversion SDK" icon="file-export" href="/docs/welcome/sdk/conversion">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>

  <Card title="Pipelines" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>
</CardGroup>


# Pipelines
Source: https://documentation.datalab.to/docs/welcome/sdk/pipelines

Create, version, and run document processing pipelines using the Datalab SDK.

## Overview

Pipelines chain processors (convert, extract, segment, custom) into reusable, versioned configurations. See [Pipeline Overview](/docs/recipes/pipelines/pipeline-overview) for concepts.

## Basic Usage

```python theme={null}
from datalab_sdk import DatalabClient, PipelineProcessor

client = DatalabClient()

# Create a pipeline
pipeline = client.create_pipeline(steps=[
    PipelineProcessor(type="convert", settings={"mode": "balanced"}),
    PipelineProcessor(type="extract", settings={
        "page_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "date": {"type": "string"}
            }
        }
    })
])

# Save and publish
pipeline = client.save_pipeline(pipeline.pipeline_id, name="My Pipeline")
version = client.create_pipeline_version(pipeline.pipeline_id)

# Run
execution = client.run_pipeline(pipeline.pipeline_id, file_path="doc.pdf")
execution = client.get_pipeline_execution(execution.execution_id, max_polls=300)

# Get results
result = client.get_step_result(execution.execution_id, step_index=1)
```

## Models

### PipelineProcessor

Defines a single processor in a pipeline.

```python theme={null}
from datalab_sdk import PipelineProcessor

step = PipelineProcessor(
    type="extract",                      # Step type
    settings={"page_schema": {...}},     # Step-specific config
    custom_processor_id="cp_abc123",     # For custom steps
    eval_rubric_id=42,                   # Optional eval rubric
)
```

| Field                 | Type | Required | Description                                          |
| --------------------- | ---- | -------- | ---------------------------------------------------- |
| `type`                | str  | Yes      | `"convert"`, `"extract"`, `"segment"`, or `"custom"` |
| `settings`            | dict | Yes      | Step-specific configuration                          |
| `custom_processor_id` | str  | No       | Custom processor ID for `"custom"` steps             |
| `eval_rubric_id`      | int  | No       | Evaluation rubric to apply                           |

### PipelineConfig

Returned by pipeline CRUD methods.

| Field            | Type     | Description                                            |
| ---------------- | -------- | ------------------------------------------------------ |
| `pipeline_id`    | str      | Unique ID (`pl_XXXXX`)                                 |
| `steps`          | list     | Ordered list of step definitions                       |
| `name`           | str      | Pipeline name (set via `save_pipeline`)                |
| `is_saved`       | bool     | Whether pipeline has been saved                        |
| `archived`       | bool     | Whether pipeline is archived                           |
| `active_version` | int      | Current published version (`0` = no published version) |
| `created`        | datetime | Creation timestamp                                     |
| `updated`        | datetime | Last update timestamp                                  |

### PipelineVersion

Immutable snapshot of pipeline steps at a point in time.

| Field         | Type     | Description                |
| ------------- | -------- | -------------------------- |
| `version`     | int      | Version number             |
| `steps`       | list     | Steps at this version      |
| `description` | str      | Version description        |
| `created`     | datetime | When version was published |

### PipelineExecution

Result from running a pipeline.

| Field              | Type     | Description                                                          |
| ------------------ | -------- | -------------------------------------------------------------------- |
| `execution_id`     | str      | Unique ID (`pex_XXXXX`)                                              |
| `pipeline_id`      | str      | Pipeline that was executed                                           |
| `pipeline_version` | int      | Version used (`0` = draft)                                           |
| `status`           | str      | `pending`, `running`, `completed`, `completed_with_errors`, `failed` |
| `steps`            | list     | List of `PipelineExecutionStepResult`                                |
| `started_at`       | datetime | Execution start time                                                 |
| `completed_at`     | datetime | Execution end time                                                   |
| `created`          | datetime | When execution was created                                           |
| `config_snapshot`  | dict     | Frozen step configuration used                                       |
| `input_config`     | dict     | Input file details                                                   |
| `rate_breakdown`   | dict     | Billing breakdown                                                    |

### PipelineExecutionStepResult

Status of a single step within an execution.

| Field           | Type     | Description                                                          |
| --------------- | -------- | -------------------------------------------------------------------- |
| `step_index`    | int      | Position in pipeline                                                 |
| `step_type`     | str      | Step type                                                            |
| `status`        | str      | `pending`, `dispatched`, `running`, `completed`, `failed`, `skipped` |
| `result_url`    | str      | URL to fetch step result                                             |
| `checkpoint_id` | str      | Checkpoint passed to downstream steps                                |
| `started_at`    | datetime | Step start time                                                      |
| `finished_at`   | datetime | Step end time                                                        |
| `error_message` | str      | Error details if failed                                              |

## Pipeline Management

### Create

```python theme={null}
pipeline = client.create_pipeline(steps=[
    PipelineProcessor(type="convert", settings={"mode": "balanced"}),
    PipelineProcessor(type="extract", settings={"page_schema": {...}})
])
```

### Save

```python theme={null}
pipeline = client.save_pipeline(pipeline.pipeline_id, name="Invoice Parser")
```

### Update

Creates a draft if a published version exists:

```python theme={null}
pipeline = client.update_pipeline(pipeline.pipeline_id, steps=[
    PipelineProcessor(type="convert", settings={"mode": "accurate"}),
    PipelineProcessor(type="extract", settings={"page_schema": {...}})
])
```

### List

```python theme={null}
result = client.list_pipelines(
    saved_only=True,           # Only saved pipelines (default)
    include_archived=False,    # Include archived (default: False)
    limit=50,
    offset=0
)

for p in result["pipelines"]:
    print(f"{p.pipeline_id}: {p.name}")
```

### Get

```python theme={null}
pipeline = client.get_pipeline("pl_abc123")
```

### Archive / Unarchive

```python theme={null}
client.archive_pipeline("pl_abc123")
client.unarchive_pipeline("pl_abc123")
```

## Versioning

### Publish a Version

```python theme={null}
version = client.create_pipeline_version(
    "pl_abc123",
    description="Added line items extraction"
)
print(f"Published v{version.version}")
```

### List Versions

```python theme={null}
result = client.list_pipeline_versions("pl_abc123")
for v in result["versions"]:
    print(f"v{v.version}: {v.description}")
```

### Discard Draft

```python theme={null}
# Revert to active published version
pipeline = client.discard_pipeline_draft("pl_abc123")

# Revert to a specific version
pipeline = client.discard_pipeline_draft("pl_abc123", version=1)
```

### Get Rate

```python theme={null}
rate = client.get_pipeline_rate("pl_abc123")
print(f"{rate['rate_per_1000_pages_cents']} cents per 1000 pages")
```

## Execution

### Run

```python theme={null}
execution = client.run_pipeline(
    "pl_abc123",
    file_path="document.pdf",     # or file_url="https://..."
    page_range="0-10",
    output_format="json",
    skip_cache=False,
    run_evals=False,
    webhook_url="https://example.com/hook",
    version=2,                    # omit for active version
    max_polls=1,                  # polls after submission
    poll_interval=1,
)
```

| Parameter       | Type | Default  | Description                                       |
| --------------- | ---- | -------- | ------------------------------------------------- |
| `pipeline_id`   | str  | Required | Pipeline to run                                   |
| `file_path`     | str  | -        | Local file path                                   |
| `file_url`      | str  | -        | URL to document                                   |
| `page_range`    | str  | -        | Pages to process (`"0-5,10"`)                     |
| `output_format` | str  | -        | Override output format                            |
| `skip_cache`    | bool | `False`  | Skip cached results                               |
| `run_evals`     | bool | `False`  | Run eval rubrics on steps                         |
| `webhook_url`   | str  | -        | Webhook URL for completion                        |
| `version`       | int  | -        | Version to run (omit=active, 0=draft, N=specific) |
| `max_polls`     | int  | `1`      | Polling attempts                                  |
| `poll_interval` | int  | `1`      | Seconds between polls                             |

### Poll Execution

```python theme={null}
execution = client.get_pipeline_execution(
    "pex_abc123",
    max_polls=300,
    poll_interval=2
)
```

### List Executions

```python theme={null}
result = client.list_pipeline_executions("pl_abc123", limit=20)
for ex in result["executions"]:
    print(f"{ex.execution_id}: {ex.status}")
```

### Get Step Result

```python theme={null}
result = client.get_step_result("pex_abc123", step_index=1)
```

## Async Usage

All pipeline methods are available on `AsyncDatalabClient`:

```python theme={null}
import asyncio
from datalab_sdk import AsyncDatalabClient, PipelineProcessor

async def run():
    async with AsyncDatalabClient() as client:
        pipeline = await client.create_pipeline(steps=[
            PipelineProcessor(type="convert", settings={"mode": "balanced"}),
            PipelineProcessor(type="extract", settings={"page_schema": {
                "type": "object",
                "properties": {"title": {"type": "string"}}
            }})
        ])

        pipeline = await client.save_pipeline(
            pipeline.pipeline_id, name="Async Pipeline"
        )

        execution = await client.run_pipeline(
            pipeline.pipeline_id, file_path="doc.pdf"
        )

        execution = await client.get_pipeline_execution(
            execution.execution_id, max_polls=300
        )

        result = await client.get_step_result(
            execution.execution_id, step_index=1
        )
        return result

result = asyncio.run(run())
```

## Next Steps

<CardGroup>
  <Card title="Pipeline Overview" icon="sitemap" href="/docs/recipes/pipelines/pipeline-overview">
    Concepts, processor types, and when to use pipelines.
  </Card>

  <Card title="Create a Pipeline" icon="hammer" href="/docs/recipes/pipelines/create-pipeline">
    Step-by-step guide to building pipelines.
  </Card>

  <Card title="Pipeline Versioning" icon="code-branch" href="/docs/recipes/pipelines/pipeline-versioning">
    Manage drafts and publish versions.
  </Card>

  <Card title="Run a Pipeline" icon="play" href="/docs/recipes/pipelines/run-pipeline">
    Execution, overrides, and result retrieval.
  </Card>
</CardGroup>


# Document Segmentation
Source: https://documentation.datalab.to/docs/welcome/sdk/segmentation

Segment documents into logical sections using the Datalab SDK.

## Basic Usage

```python theme={null}
import json
from datalab_sdk import DatalabClient, SegmentOptions

client = DatalabClient()

# Define a segmentation schema with section names and descriptions
segmentation_schema = json.dumps({
    "sections": [
        {"name": "introduction", "description": "Introduction and overview"},
        {"name": "methodology", "description": "Methods and approach"},
        {"name": "results", "description": "Findings and results"},
        {"name": "conclusion", "description": "Summary and conclusions"},
        {"name": "references", "description": "Bibliography and references"}
    ]
})

options = SegmentOptions(segmentation_schema=segmentation_schema)
result = client.segment("research_paper.pdf", options=options)

# Access segmentation results
segments = result.segmentation_results
for segment in segments:
    print(f"{segment['name']}: pages {segment['page_range']}")
```

## Segment Options

Use `SegmentOptions` to configure segmentation behavior:

| Option                | Type | Default      | Description                                                                             |
| --------------------- | ---- | ------------ | --------------------------------------------------------------------------------------- |
| `segmentation_schema` | str  | **Required** | JSON schema defining segment names and descriptions                                     |
| `checkpoint_id`       | str  | None         | Checkpoint ID from a previous `convert()` call                                          |
| `mode`                | str  | `"fast"`     | Processing mode: `"fast"`, `"balanced"`, `"accurate"`                                   |
| `save_checkpoint`     | bool | `False`      | Save checkpoint for reuse with subsequent calls                                         |
| `max_pages`           | int  | None         | Maximum number of pages to process                                                      |
| `page_range`          | str  | None         | Specific pages to process (e.g., `"0-5,10"`). For spreadsheets, filters by sheet index. |
| `skip_cache`          | bool | `False`      | Skip cached results, force reprocessing                                                 |
| `webhook_url`         | str  | None         | Webhook URL for completion notification                                                 |

## Checkpoint Reuse

Use checkpoints to avoid re-parsing a document when running segmentation after conversion. First convert with `save_checkpoint=True`, then segment using the returned `checkpoint_id`:

```python theme={null}
import json
from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions

client = DatalabClient()

# Step 1: Convert and save a checkpoint
convert_options = ConvertOptions(
    mode="accurate",
    save_checkpoint=True,
)
convert_result = client.convert("report.pdf", options=convert_options)
print(convert_result.markdown)

# Step 2: Segment using the checkpoint (no re-parsing needed)
segmentation_schema = json.dumps({
    "sections": [
        {"name": "executive_summary", "description": "Executive summary"},
        {"name": "financials", "description": "Financial data and analysis"},
        {"name": "outlook", "description": "Future outlook and projections"},
    ]
})

segment_options = SegmentOptions(
    segmentation_schema=segmentation_schema,
    checkpoint_id=convert_result.checkpoint_id,
)
segment_result = client.segment("report.pdf", options=segment_options)
print(segment_result.segmentation_results)
```

## Segmentation Result

The result object contains the segmentation data alongside standard conversion fields:

```python theme={null}
result = client.segment("document.pdf", options=options)

# Segmentation results (list of segments with names and page ranges)
segments = result.segmentation_results
for segment in segments:
    print(f"Section: {segment['name']}")
    print(f"  Pages: {segment['page_range']}")

# Standard conversion fields are also available
print(result.success)
print(result.markdown)
print(result.page_count)
print(result.cost_breakdown)
```

## Async Usage

```python theme={null}
import asyncio
import json
from datalab_sdk import AsyncDatalabClient, SegmentOptions

async def segment_document():
    async with AsyncDatalabClient() as client:
        segmentation_schema = json.dumps({
            "sections": [
                {"name": "introduction", "description": "Introduction"},
                {"name": "body", "description": "Main content"},
                {"name": "conclusion", "description": "Conclusion"},
            ]
        })
        options = SegmentOptions(segmentation_schema=segmentation_schema)
        result = await client.segment("document.pdf", options=options)
        return result.segmentation_results

segments = asyncio.run(segment_document())
print(segments)
```

## Next Steps

<CardGroup>
  <Card title="Segmentation Recipe" icon="scissors" href="/docs/recipes/document-segmentation/auto-segmentation">
    Learn more about document segmentation patterns and use cases.
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/welcome/sdk/extraction">
    Extract structured data from documents using JSON schemas.
  </Card>

  <Card title="Document Conversion" icon="file-lines" href="/docs/welcome/sdk/conversion">
    Convert documents to Markdown, HTML, JSON, or chunks.
  </Card>
</CardGroup>


# Welcome to Datalab
Source: https://documentation.datalab.to/index


Datalab provides document intelligence APIs to convert PDFs, spreadsheets, images, and other formats into structured, machine-readable outputs — fast, accurately, and at scale.

We offer a [fully managed platform](./docs/welcome/api), [on-prem deployment](./docs/on-prem/overview) for sensitive documents, and open-source tools for developers. **New accounts include \$5 in free credits** — [sign up here](https://www.datalab.to/auth/sign_up).

## Key Capabilities

* **Document Conversion** — Parse PDFs, Word docs, and spreadsheets into Markdown, HTML, or JSON (powered by [Marker](https://github.com/datalab-to/marker), [Surya](https://github.com/datalab-to/surya), and [Chandra](https://github.com/datalab-to/chandra))
* **Pipelines** — Chain processors into versioned, reusable configurations and deploy to production
* **Structured Extraction** — Extract specific fields with citations back to source bounding boxes for auditability
* **Form Filling** — Automatically fill PDF and image forms with structured data
* **Document Segmentation** — Split multi-document PDFs into separate logical sections
* **Track Changes** — Extract redlines and comments from Word documents
* **OCR** — High-accuracy text recognition supporting 90+ languages

## What do you want to do?

**Convert documents to structured formats**
→ [Document Conversion](./docs/recipes/conversion/conversion-api-overview)

**Extract specific data from documents**
→ [Structured Extraction](./docs/recipes/structured-extraction/api-overview)

**Automatically fill PDF forms**
→ [Form Filling](./docs/recipes/form-filling/form-filling-api-overview)

**Split combined PDFs into separate documents**
→ [Document Segmentation](./docs/recipes/document-segmentation/auto-segmentation)

**Build document processing pipelines**
→ [Pipelines](./docs/recipes/pipelines/pipeline-overview)

**Extract tracked changes from Word documents**
→ [Track Changes](./docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents)

## Who uses Datalab?

Datalab serves teams building AI agents, RAG systems, and document automation workflows:

* **AI/ML teams** — Feed knowledge graphs, retrieval systems, and automation pipelines with clean, structured document data
* **Enterprises** — Automate high-volume document processing with auditability and citation tracking
* **Product teams** — Convert financial statements, legal filings, tax forms, and research papers into product-ready content

## Getting Started

<CardGroup>
  <Card title="SDK Quickstart" icon="rocket" href="/docs/welcome/quickstart">
    Start converting documents in minutes with Python.
  </Card>

  <Card title="API Reference" icon="bolt" href="/docs/welcome/api">
    REST API documentation.
  </Card>

  <Card title="Build a Pipeline" icon="workflow" href="/docs/recipes/pipelines/pipeline-overview">
    Chain processors into versioned, reusable pipelines.
  </Card>

  <Card title="Open Source" icon="github" href="https://github.com/datalab-to">
    Run our models locally.
  </Card>
</CardGroup>

## Support

<CardGroup>
  <Card title="Contact Support" icon="circle-question">
    Email [support@datalab.to](mailto:support@datalab.to) for help.
  </Card>

  <Card title="Service Status" icon="chart-line" href="https://status.datalab.to/">
    Check API availability.
  </Card>
</CardGroup>


# Billing
Source: https://documentation.datalab.to/platform/billing


Datalab uses per-page pricing — you pay only for the pages you process, across whichever processors you run. This page explains how billing works and how to manage your usage.

### Per-Page Pricing

Every API request consumes credits based on the number of pages processed:

* Charges are rounded up to the nearest cent.
* Rates vary by processor (convert, extract, segment, etc.) and processing mode.
* Add-ons such as `word_bboxes`, `table_cell_bboxes`, and `list_item_bboxes` are billed additively per 1K pages on top of the base rate.

See the full rate card at [datalab.to/pricing](https://www.datalab.to/pricing).

### Free Tier

New accounts receive a **monthly usage allowance** with no credit card required:

* **\$20/month** for accounts created with a work email address
* **\$10/month** for accounts created with a personal email address

Credits reset at the start of each 30-day cycle. The free tier supports up to 10 requests per minute and is designed to let you run a complete proof of concept before committing to a paid plan.

### Pay-as-You-Go

When you outgrow the free allowance, add a payment method and switch to pay-as-you-go. There is no subscription, no plan to choose, and no minimum spend. You are billed only for the pages you actually process.

Processors are additive — if you convert and then extract the same document, you pay for each step separately.

### Team Plan

For production workloads, the **Team** plan is \$400/month and includes:

* \$400 of monthly usage (same per-page rates)
* Production rate limits
* Clickthrough BAA/DPA
* SOC 2 report
* Additional custom processor capacity

[Contact us](https://www.datalab.to/contact) for Enterprise pricing with volume discounts and custom SLAs.

### Payment Failures and Grace Periods

When a payment fails, you will receive an email notification from Stripe.

* On failure, your account enters an `unpaid` state.
* A 24-hour grace period with a usage cap gives you time to resolve the issue.
* After the grace period, API access is blocked until payment is resolved.

Update your payment method in the [billing dashboard](https://www.datalab.to/app/billing) or [contact support](mailto:support@datalab.to).

## Understanding Your Usage

### What Counts as a Page?

* **PDF files**: Each page in the PDF
* **Images**: Each image file counts as one page
* **Office documents**: Each page in the document
* **Multi-page TIFFs**: Each frame counts as a separate page
* **Spreadsheets**: Pricing varies by extraction mode:
  * **Simple mode**: 2,500 cells per page, capped at 100 pages (\$0.60) per sheet
  * **Advanced mode**: 500 cells per page, no cap
  * For files with multiple sheets, each sheet is calculated separately and then summed

### Monitoring Usage

Track usage through:

1. **Dashboard Overview**: Real-time usage statistics at [datalab.to/app](https://www.datalab.to/app)
2. **Usage Reports**: Detailed breakdown by processing type

Usage statistics may be slightly delayed.

## Next Steps

<CardGroup>
  <Card title="API Limits" icon="gauge" href="/docs/common/limits">
    Understand file size limits, page limits, and rate limiting.
  </Card>

  <Card title="Error Codes" icon="circle-exclamation" href="/platform/errors">
    Understand HTTP error codes and subscription errors.
  </Card>

  <Card title="Quickstart" icon="rocket" href="/docs/welcome/quickstart">
    Get started converting documents in minutes.
  </Card>

  <Card title="Changelog" icon="clock-rotate-left" href="/platform/changelog">
    See the latest updates and changes to the Datalab platform.
  </Card>
</CardGroup>


# Changelog
Source: https://documentation.datalab.to/platform/changelog

Major changes to the Datalab hosted service are listed here.

## 6/24/2026

* **Breaking change for `word_bboxes` users:** The `metadata.words` JSON array is no longer emitted in API responses. Per-word bounding boxes are now exclusively available as inline `<span data-bbox="..." data-confidence="...">` elements in HTML output. If you were reading word bboxes from `page_info[id].metadata.words`, update your code to parse them from the HTML spans instead.

## 6/22/2026

* API keys that reached their 30-day spend cap now return HTTP 402 with a clear spend-limit message instead of a misleading "Invalid API key" 401 error. If you were catching 401 errors to detect spend-cap exhaustion, update to catch 402.
* Checkboxes and radio buttons detected in documents now render as `☒` (checked) or `☐` (unchecked) in markdown output instead of being silently dropped.

## 6/18/2026

* Free tier and pay-as-you-go pricing launched — new accounts receive a monthly usage allowance ($20 on a work email, $10 on a personal email) with no credit card required. Add a card to upgrade to pay-as-you-go and be billed only for pages you process, with no subscription or minimum commitment. See the [pricing page](https://www.datalab.to/pricing) for per-processor rates.

## 6/16/2026

* Word-level bounding boxes (`word_bboxes`), table cell/row/column bboxes (`table_cell_bboxes` extra), and list item bboxes (`list_item_bboxes` extra) are now available to all teams — no longer require allowlist access. `table_cell_bboxes` and `list_item_bboxes` are billed at \$0.30 per 1K pages each and automatically enable `word_bboxes`. HTML output carries `data-bbox` and `data-confidence` attributes on the annotated elements; `table_row_bboxes` is deprecated and replaced by `table_cell_bboxes`.
* Maximum input image dimensions increased by 1.5× (from 4,800×7,800 px to 7,200×11,700 px), reducing rejections for large-format scans and high-resolution page images.

## 6/15/2026

* Custom Processors are now generally available to all authenticated teams — no allowlist required. Create, iterate, and run processors from the [dashboard](https://www.datalab.to/app/processors) or via `POST /api/v1/custom-processor` (SDK: `client.run_custom_processor()`).

## 6/12/2026

* On-premises container adds structured extraction via `POST /api/v1/extract`. Supports `fast` and `turbo` extraction modes. Requires the Chandra model with the Lift model enabled; `balanced` mode is not available on-prem.

## 6/4/2026

* Structured extraction now offers two modes via the `extraction_mode` parameter on `/api/v1/extract`: **fast** (lowest latency, $6 / 1K pages) and **balanced** (higher accuracy with per-field verification, reasoning, and citations, $25 / 1K pages). `balanced` is the default. See [Balanced Extraction Mode](/docs/recipes/structured-extraction/balanced-mode).
* Teams that made an extraction request in the 30 days before June 4, 2026 keep **fast** as their default extraction mode; all other teams (and new teams) default to **balanced**. Set `extraction_mode` explicitly on any request to override the default.

## 5/22/2026

* `processing_location` is now supported in the direct file upload API — include `"processing_location": "eu"` in the `POST /api/v1/files/upload` request body to store uploaded files in EU infrastructure before passing the reference to inference endpoints.

## 5/21/2026

* New `processing_location` parameter on all inference API endpoints (`/api/v1/convert`, `/api/v1/extract`, `/api/v1/segment`, `/api/v1/fill`, `/api/v1/track-changes`, and pipeline runs) — pass `"eu"` to route processing and result storage to EU infrastructure for data residency requirements. When using `processing_location`, send the file via `file_url` or a pre-uploaded `datalab://` reference; multipart form uploads are not supported with this parameter. EU-region processing carries a regional pricing premium.
* Helm chart is now available for deploying the on-prem inference container on Kubernetes clusters.

## 4/9/2026

* Custom processor creation in Forge is now a guided 3-step wizard (Describe → Documents → Review) with a chat-driven builder that helps you articulate what your processor should do before generating it.

## 4/8/2026

* Form filling is now a first-class pipeline step type in Forge — build standalone fill pipelines with `PipelineProcessor(type="fill", settings={"field_data": {...}})` to apply versioning and execution tracking to your form-filling workflows.
* Forge pipeline workspace now shows extraction confidence scores inline as each pipeline step completes.
* Playground block annotations — give feedback on individual parsed blocks directly in the document view.

## 4/6/2026

* Spreadsheets now support the `page_range` parameter — use 0-based sheet indices (e.g., `"0,2"`) to process only specific sheets from a workbook.
* Forge is now the primary hub for creating and editing pipelines. Build processor chains visually, configure per-processor settings, and deploy versioned pipelines directly from the UI.
* Pipeline versioning UI — create, browse, and restore versions from within Forge. Discard draft changes to revert to any published version.
* Per-processor execution status tracking in Forge — watch each processor (convert, extract, segment) complete in real-time as a pipeline runs.
* New pipeline draft/discard API — `POST /api/v1/pipelines/{pipeline_id}/discard` (SDK: `client.discard_pipeline_draft()`) discards unsaved edits and reverts the pipeline to any published version.
* `POST /api/v1/custom-pipeline` is deprecated — use `POST /api/v1/pipelines/{pipeline_id}/run` instead.
* Workflows API (`/api/v1/workflows`) is deprecated. We recommend using [Pipelines](/docs/recipes/pipelines/pipeline-overview) for all new integrations and migrating any existing ones.

## 4/2/2026

* Saved Schemas are now available to all users. Create and manage reusable extraction schemas from the [dashboard](https://www.datalab.to/app/schemas), then pass `schema_id` to `POST /api/v1/extract` instead of an inline `page_schema`. Use `schema_version` to pin extractions to a specific schema version.
* Forge Evals now supports extraction comparison — select saved schemas in the compare flow to run and score extractions side-by-side across document collections. Scores display inline in the eval grid.

## 4/1/2026

* Playground now shows real-time scoring-in-progress indicators while extraction confidence scores are being computed asynchronously.
* Improved parse quality scoring accuracy with an upgraded underlying model.

## 3/31/2026

* Saved Schemas — create and manage reusable extraction schemas via `POST /api/v1/extraction_schemas`. Pass `schema_id` to `POST /api/v1/extract` instead of inline `page_schema` to reference a saved schema. Schemas support versioning; use `schema_version` to pin to a specific version.

## 3/26/2026

* Extraction confidence scoring released in beta — pass `include_scores=true` to `POST /api/v1/extract` to receive per-field `_score` values alongside citations, or score asynchronously via the new `POST /api/v1/extract/score` endpoint. Scoring is free. [Learn more](/docs/recipes/structured-extraction/confidence-scoring).
* New usage threshold alerts — set daily page-count thresholds in the dashboard to receive email notifications when your API usage approaches or exceeds them.

## 3/23/2026

* New pipeline templates in Forge — browse and run example pipeline configurations directly in PipelineWorkspace to see results without any setup.

## 3/20/2026

* Custom Pipelines now support a `classify` modification type — use LLM structured output to classify pages into categories and route subsequent processing steps based on classification results.

## 3/16/2026

* Per-page concurrency limit enforcement is now active. API results will return `success: false` with an error message if your team exceeds 5,000 pages in flight simultaneously. See [API Limits](/docs/common/limits) for details.

## 1/25/2026

* New Create Document API (`POST /api/v1/create-document`) — generate DOCX files from markdown with native Word track changes. Supports insertions (`<ins>`), deletions (`<del>`), and comments (`<comment>`) that appear as reviewable changes in Microsoft Word. SDK: `client.create_document()`.

## 1/24/2026

* Custom Pipelines beta launch — create reusable AI-powered document processing pipelines and execute them via the API (`POST /api/v1/custom-pipeline`). SDK: `client.run_custom_pipeline()`.

## 1/22/2026

* Chandra 1.5 release with improved table extraction, chemistry support, diagram rendering, and latency improvements. New `new_block_types` extra enables detection of chemistry structures, handwriting, and signatures.

## 12/19/2025

* Forge Evals now supports comparing against external providers — evaluate Datalab's parsing against other open source models (OlmoOCR, RolmoOCR, DotsOCR, DeepSeekOCR) and third-party services (upon request).

## 12/17/2025

* Form Filling API launch — automatically fill PDF and image forms with structured data. Supports native PDF form fields and visual/scanned forms.

## 12/10/2025

* Forge Evals launch — evaluate and compare parsing configurations across your documents to find optimal settings.

## 12/5/2025

* Improved tracked changes extraction from Word documents with better performance and accuracy.

## 12/4/2025

* Spreadsheet parsing support — parse Excel (.xlsx, .xls) and other spreadsheet formats with the Convert API.

## 12/3/2025

* Agni model improvements for better multi-page section hierarchy detection in OCR.

## 12/2/2025

* Chandra speed improvements — faster document processing with optimized inference.

## 12/1/2025

* Chandra 1.1 release with improved accuracy and performance.

## 11/18/2025

* Enhanced password security: minimum password length increased to 12 characters and validation against 100K common/compromised passwords list per NIST SP 800-63B standards.
* Improved section header hierarchy detection in accurate mode.

## 10/23/2025

* Workflows beta launch! You can now use the API and SDK to compose various steps like parse, extract, segment, and conditional logic to create document processing workflows that are reusable.

## 10/22/2025

* New model launch! Our SOTA model, Chandra, is now publicly-available, open-source, and accessible via our API (when using modes `balanced` and `accurate`).

## 10/20/2025

* During the global AWS outage we put mitigations in place to work around issues our upstream providers were experiencing. With these mitigations, despite ongoing upstream outages, we restored API service to our customers.

## 10/10/2025

* If parses are taking over 10 seconds in the playground, users will receive an option to receive an email notification when it is complete.
* Fixes and improvements to long-document processing in the playground.
* Fixes to how request statuses are updated (from e.g. "processing" --> "complete"), so they update properly and on-time.

## 10/8/2025

* v1.0.7 of our container released with stability improvements for very long-running containers (self-serve and enterprise customers only).

## 10/6/2025

* Users can now click on "View in Playground" on API requests in the Usage tab to view how their document was parsed, segmented, or extracted. This feature is enabled as long as users have the correct data retention settings.

## 10/3/2025

* v1.0.5 of our container released with settings to significantly reduce log output, useful for highly-scaled workloads.
*

## 9/25/2025

* Improvements to Segment/Extract UX in the playground.
* Fixes and improvements to segmentation results.

## 9/18/2025

* High Accuracy Mode launch -- API users can select `mode: "accurate"` for our highest accuracy document processing, trading off latency and cost (both higher).
* Public playground launch -- unauthenticated users can access to the same playground experience as subscribers (with limitations) at [https://www.datalab.to/playground](https://www.datalab.to/playground)

## 9/16/2025

* New playground launch -- we now offer a significantly-improved playground where users can inspect how their documents are parsed or view document segmentation/structured extraction results.
* Segmentation V1 launch -- API users can segment documents automatically or with a schema.

## 9/5/2025

* v1.0.2 of our container released supporting both self-serve and enterprise customers with improved functionality and stability.
* Added `marker_lite` support to our container to measure OCR-likelihood.

## 9/1/2025

* Users can view showcased static examples in the Datalab playground.
* RTF file format support added to the API.

## 8/27/2025

* Launched our self-serve on-prem container, purchaseable via Stripe checkout -- no sales or contracting process required.
* Added support for our `/ocr` endpoint in the conatiner in addition to `/marker`.

## 8/20/2025

* Users can generate schemas automatically based on document content in the playground.
* Improvements to structured extraction quality and latency.

## 8/15/2025

* Users can view citation highlights from structured extraction requests in the  playground.
* If parse quality scores are available, they will now be returned in the `/marker` response.

## 8/5/2025

* Launch a new OCR model with improved math performance.
* Improve marker quality in cases where there are inline equations or other text that needs OCR.

## 7/25/2025

* Improve speed of LLM mode and when outputting multiple output formats.

## 7/20/2025

* Launch a visual editor for structured extraction that lets you edit schemas and visualize results.

## 7/15/2025

* Add a visual editor for marker prompts that lets you see how the document was changed, test across documents, and save prompts.

## 7/1/2025

* Structured extraction beta - pass `page_schema` to the `marker` endpoint to extract structured data from documents. The schema should be a pydantic schema generated with `.model_dump_json_schema()`, or another JSON schema format.
* Support the new `chunks` output format for marker, which is a simplified list of blocks with their full html, ideal for chunking/RAG.
* Marker endpoint is now promptable - pass `block_correction_prompt` to the marker endpoint to correct the output of marker with your custom logic.
* We support additional configuration parameters for marker via the `additional_config` parameter. This is a JSON object where the keys are the configuration options and the values are the values for those options. You can see the exact options in the API schema.

## 6/26/2025

* Support multiple output formats for one doc by passing them as comma-separated values in `output_format` for marker.
* Complete redesign of the dashboard, with a new look and feel. This will also make it easier for us improve functionality in the future.

## 6/18/2025

* Improve the playground to make it more functional (easier to test options)
* Significantly improve styling in the playground
* Add a public version of the playground to make marker easier to test

## 6/3/2025

* Initial launch of playground, for testing marker parsing configurations

## 5/27/2025

* New OCR model which benchmarks better overall, handles inline math, gives detailed character bboxes.
* Add `format_lines` flag to marker to add inline math and formatting to lines. (this will automatically OCR lines that need it, also)

## 3/26/2025

* Add support for multiple file formats - spreadsheets, epub, html, in addition to existing document, image, pdf, and presentation formats.
* Improve inline math and formatting when passing `use_llm`.
* `use_llm` (the high accuracy mode) now costs the same as regular inference.

## 1/30/2025

Marker:

* Integrate a new table recognition model, which handles rowspans and colspans better. This is a significant improvement on the old model.
* Improve the `--use_llm` option to merge tables across pages, OCR handwriting, OCR forms, and generally have much higher quality than before.
* Integrate a new LaTeX OCR model that is significantly more accurate.
* Add links and references to the markdown - the references include internal links.

General:

* Speed up inference time.
* Remove the line detection endpoint - it had low usage.
* Improve the `table_rec` endpoint - it now takes the `--use_llm` flag, and should run much faster.

## 1/3/2025

* Add the `use_llm` option to the marker API - this uses an LLM to make conversion much more accurate for tables, forms, inline math, and complex pages. It's a beta feature, and will currently double the cost per request.
* Added other options to the marker endpoint.
  * Use `disable_image_extraction` to disable image extraction for marker.
  * Use `strip_existing_ocr` to strip all existing OCR text and re-OCR (if it was added by something like tesseract)
* Better automatic heuristics for when to OCR with marker.
* Better text extraction and layout detection for marker.
* Speed up the marker and OCR endpoints by \~30%.

## 12/4/2024

* Uploaded files can now be up to 200MB in size.
* Improved speed by optimizing file handling on the backend.

## 12/3/2024

* We now offer \$5 in free credits to new signups
* Additional bugfixes to improve markdown output quality

## 12/2/2024

* We sped up file operations internally, which should result in a decent API speed boost
* We now handle blockquotes and nested lists with the marker endpoint

## 11/27/2024

* Marker is now at v1, with a lot of improvements - it's 4x faster than a month ago, and quality is much higher across all document types
* The layout model has been upgraded to a new version, with more potential prediction types

## 10/31/2024

* More API speedups, on the order of 15-20% for marker.
* Bump concurrency/rate limits to 200.
* Improve stability of service under load.
* If you cancel, you will now retain your credits until the end of the month.
* Visual improvements on the marketing site.

## 10/28/2024

* Significant API speedups, on the order of 40% faster.

## 10/25/24

* Flatten form fields into pdf when extracting tables and markdown
* Fix page separators, they now appear at the start of every page, and include a page number

## 10/23/24

* Speed up marker, layout, and detection by 20-30%
* Fix various bugs that cause edge case errors in conversion
* Increase concurrent request limit to 100

## 10/21/24

* Significantly improve marker output quality
  * Include header levels like h1, h2, etc.
  * Parse tables very accurately
  * Improve block type detection and markdown quality
  * Fix many output bugs
* Add in new table recognition model at the /table\_rec endpoint
  * This will detect and convert tables into a given format
* Improve OCR, layout, text detection quality
* Fix memory leaks and improve performance
* Fix bugs with pagination and marker

## 8/19/2024

* Add in new OCR model with better accuracy across the board
* Language is now optional for marker and OCR model
* Increase max page count and max pixel width

## 7/20/2024

* Drop prices for marker and surya inference.

## 7/12/2024

* Significant speedup for marker and surya text detection/layout. 10-15% faster.

## 7/10/2024

* Increase concurrent request limit to 50.

## 7/6/2024

* Major infrastructure stability improvements.

## 7/3/2024

* Added response caching for up to 1 hour. If you send the same document to the same endpoint, with the same options, within that time, you'll get a cache hit and won't be billed again.

## 7/2/2024

* Improved parsing for Powerpoint presentations and Word documents.
* Add status page and changelog.

## 6/26/2024

* Increase concurrency limits for all users

## 6/25/2024

* Return page count from all endpoints
* Users can now disable marker image extraction
* Webhooks are now supported instead of polling. Webhooks will ping a given URL when inference is complete.

## 6/21/2024

* Initial support for Microsoft Word and Microsoft Powerpoint documents (docx/doc/pptx/ppt).

## 6/18/2024

* Enable paginating marker output.

## 5/31/2024

* Initial launch of marker and surya APIs.


# Error Codes
Source: https://documentation.datalab.to/platform/errors

HTTP error codes, response formats, and retry guidance.

## Error Response Format

All API errors return a JSON response with a `detail` field:

```json theme={null}
{
  "detail": "Error message describing what went wrong"
}
```

For validation errors (malformed request body), the response includes field-level details:

```json theme={null}
{
  "detail": [
    {
      "type": "validation_error",
      "loc": ["body", "field_name"],
      "msg": "Field validation message",
      "input": "provided_value"
    }
  ]
}
```

## HTTP Error Codes

| Code | Type                    | Retryable | Description                                                                                       |
| ---- | ----------------------- | --------- | ------------------------------------------------------------------------------------------------- |
| 400  | `invalid_request_error` | No        | Issue with the format or content of your request                                                  |
| 401  | `authentication_error`  | No        | Invalid or missing API key                                                                        |
| 402  | `spend_cap_error`       | No        | API key has reached its configured 30-day spend cap — increase the limit in your billing settings |
| 403  | `permission_error`      | No        | API key lacks permission or subscription issue                                                    |
| 404  | `not_found_error`       | No        | Requested resource not found or expired                                                           |
| 413  | `request_too_large`     | No        | File exceeds the maximum allowed size                                                             |
| 429  | `rate_limit_error`      | **Yes**   | Rate limit exceeded — wait and retry                                                              |
| 500  | `api_error`             | **Yes**   | Internal server error — wait and retry                                                            |
| 529  | `overloaded_error`      | **Yes**   | API temporarily overloaded — wait and retry                                                       |

## SDK Exception Mapping

The Python SDK maps HTTP errors to specific exception classes:

| HTTP Code  | SDK Exception            | Description                                                |
| ---------- | ------------------------ | ---------------------------------------------------------- |
| 400        | `DatalabAPIError`        | Check the `response_data` field for details                |
| 401        | `DatalabAPIError`        | Invalid API key                                            |
| 402        | `DatalabAPIError`        | Spend cap exceeded — check `response_data` for the message |
| 403        | `DatalabAPIError`        | Subscription or permission issue                           |
| 404        | `DatalabAPIError`        | Resource not found or expired                              |
| 413        | `DatalabAPIError`        | File too large                                             |
| 429        | Auto-retried             | SDK retries automatically with exponential backoff         |
| 500        | Auto-retried             | SDK retries automatically                                  |
| Timeout    | `DatalabTimeoutError`    | Request timed out                                          |
| File error | `DatalabFileError`       | File not found or empty                                    |
| Validation | `DatalabValidationError` | Invalid input parameters                                   |

```python theme={null}
from datalab_sdk import DatalabClient
from datalab_sdk.exceptions import (
    DatalabAPIError,
    DatalabTimeoutError,
    DatalabFileError,
    DatalabValidationError,
)

client = DatalabClient()

try:
    result = client.convert("document.pdf")
except DatalabFileError as e:
    print(f"File issue: {e}")
except DatalabTimeoutError as e:
    print(f"Timed out: {e}")
except DatalabAPIError as e:
    print(f"API error (HTTP {e.status_code}): {e}")
    if e.response_data:
        print(f"Details: {e.response_data}")
```

## Common Error Messages

### 400 Bad Request

```json theme={null}
{"detail": "Invalid file type. Only PDF files, word documents, spreadsheets, powerpoints, HTML, and PNG, JPG, GIF, TIFF, and WEBP images are accepted."}
```

```json theme={null}
{"detail": "File size exceeds upload limit of 209715200 bytes."}
```

### 401 Unauthorized

```json theme={null}
{"detail": "Invalid API key provided. Set the X-API-Key header to your API key."}
```

### 403 Forbidden

```json theme={null}
{"detail": "You need an active, paid subscription to use this API."}
```

```json theme={null}
{"detail": "Your subscription has expired. You may need to re-enable your plan, or pay an unpaid invoice."}
```

```json theme={null}
{"detail": "Your payment has failed. Please pay any unpaid invoices to continue using the API."}
```

### 429 Too Many Requests

```json theme={null}
{"detail": "Rate limit exceeded for endpoint /api/v1/convert. You can make 200 requests every 60 seconds. Please try again later, or reach out to support@datalab.to if you need a higher limit."}
```

```json theme={null}
{"detail": "Concurrency exceeded for endpoint /api/v1/convert. You can have 400 concurrent requests running at once. Please try again later, or reach out to support@datalab.to if you need a higher limit."}
```

### Page Concurrency Limit (returned in results, not as an HTTP error)

The page concurrency limit is enforced during processing, not at submission time. Instead of a `429` response, the result will return with `success` set to `false`:

```json theme={null}
{"success": false, "error": "Page rate limit exceeded. Your team has {current_pages} pages in flight and this request adds {page_count} more ({total} total, limit: 5,000). Please wait for some requests to complete before submitting more, or contact support@datalab.to for a higher limit."}
```

See [API Limits](/docs/common/limits#page-concurrency-limit) for details.

## Subscription and Access Errors

When making API requests, you may encounter 403 errors related to your subscription status:

### No Active Subscription

**Error**: `"You need an active, paid subscription to use this API."`

This occurs when you don't have an active subscription and have exhausted your free credits. To resolve:

* Subscribe to a paid plan in the [dashboard](https://www.datalab.to/app/billing)
* New accounts include free credits — verify your email to claim them

### Expired Subscription

**Error**: `"Your subscription has expired. You may need to re-enable your plan, or pay an unpaid invoice."`

Your subscription has passed its end date and grace period. To resolve:

* Renew your subscription in the [dashboard](https://www.datalab.to/app/billing)
* Pay any outstanding invoices

### Payment Failed

**Error**: `"Your payment has failed. Please pay any unpaid invoices to continue using the API."`

A payment for your subscription has failed and you've exceeded the grace period. To resolve:

* Update your payment method in the [dashboard](https://www.datalab.to/app/billing)
* Pay any unpaid invoices

### Inactive Subscription

**Error**: `"Your subscription is not active. You may need to re-enable your plan or pay an unpaid invoice."`

Your subscription is canceled or inactive. To resolve:

* Reactivate your subscription in the [dashboard](https://www.datalab.to/app/billing)
* Subscribe to a new plan

## Next Steps

<CardGroup>
  <Card title="Troubleshooting" icon="wrench" href="/platform/troubleshooting">
    Detailed debugging guide for common issues
  </Card>

  <Card title="Billing" icon="credit-card" href="/platform/billing">
    Per-page pricing, payment failures, and grace periods
  </Card>

  <Card title="API Limits" icon="gauge" href="/docs/common/limits">
    File size limits, page limits, and rate limiting
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    Python SDK with automatic retries and error handling
  </Card>
</CardGroup>


# Migration Guide
Source: https://documentation.datalab.to/platform/migration

Migrate from deprecated endpoints to the current API.

This guide helps you migrate from deprecated Datalab API endpoints to their current replacements.

## Marker → Dedicated Endpoints

<Warning>
  The `/api/v1/marker` endpoint is deprecated. Migrate to the new dedicated endpoints below.
</Warning>

The monolithic `/api/v1/marker` endpoint has been replaced with dedicated endpoints for each operation:

| Old Usage                             | New Endpoint                    | SDK Method                      |
| ------------------------------------- | ------------------------------- | ------------------------------- |
| `/marker` (basic conversion)          | `POST /api/v1/convert`          | `client.convert()`              |
| `/marker` with `page_schema`          | `POST /api/v1/extract`          | `client.extract()`              |
| `/marker` with `segmentation_schema`  | `POST /api/v1/segment`          | `client.segment()`              |
| `/marker` with `extras=track_changes` | `POST /api/v1/track-changes`    | `client.track_changes()`        |
| `/marker` with `pipeline_id`          | `POST /api/v1/custom-processor` | `client.run_custom_processor()` |

### SDK upgrade

Update to the latest SDK for the new dedicated methods:

```bash theme={null}
pip install --upgrade datalab-python-sdk
```

SDK users who only use `client.convert()` do not need to change code — it continues to work and now calls `/api/v1/convert` internally.

### Document Conversion

<CodeGroup>
  ```python Python SDK theme={null}
  # No changes needed — convert() works the same
  from datalab_sdk import DatalabClient, ConvertOptions

  client = DatalabClient()
  result = client.convert("document.pdf")
  print(result.markdown)
  ```

  ```bash cURL (before) theme={null}
  # Old
  curl -X POST https://www.datalab.to/api/v1/marker \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "output_format=markdown"
  ```

  ```bash cURL (after) theme={null}
  # New
  curl -X POST https://www.datalab.to/api/v1/convert \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "output_format=markdown"
  ```
</CodeGroup>

### Structured Extraction

<CodeGroup>
  ```python Python SDK (before) theme={null}
  # Old: page_schema on ConvertOptions
  from datalab_sdk import DatalabClient, ConvertOptions
  options = ConvertOptions(page_schema=schema)
  result = client.convert("invoice.pdf", options=options)
  ```

  ```python Python SDK (after) theme={null}
  # New: Dedicated extract() method with ExtractOptions
  from datalab_sdk import DatalabClient, ExtractOptions
  import json

  client = DatalabClient()
  options = ExtractOptions(
      page_schema=json.dumps(schema)
  )
  result = client.extract("invoice.pdf", options=options)
  extracted = json.loads(result.extraction_schema_json)
  ```
</CodeGroup>

### Document Segmentation

<CodeGroup>
  ```python Python SDK (before) theme={null}
  # Old: segmentation_schema on ConvertOptions
  options = ConvertOptions(segmentation_schema=json.dumps(schema))
  result = client.convert("document.pdf", options=options)
  ```

  ```python Python SDK (after) theme={null}
  # New: Dedicated segment() method with SegmentOptions
  from datalab_sdk import DatalabClient, SegmentOptions
  import json

  client = DatalabClient()
  options = SegmentOptions(
      segmentation_schema=json.dumps(schema)
  )
  result = client.segment("document.pdf", options=options)
  segments = result.segmentation_results
  ```
</CodeGroup>

### Track Changes

<CodeGroup>
  ```python Python SDK (before) theme={null}
  # Old: extras parameter on ConvertOptions
  options = ConvertOptions(extras="track_changes", output_format="html")
  result = client.convert("contract.docx", options=options)
  ```

  ```python Python SDK (after) theme={null}
  # New: Dedicated track_changes() method
  from datalab_sdk import DatalabClient, TrackChangesOptions

  client = DatalabClient()
  options = TrackChangesOptions(output_format="markdown,html,chunks")
  result = client.track_changes("contract.docx", options=options)
  ```
</CodeGroup>

### Checkpoint reuse

The new endpoints support a checkpoint system to avoid re-parsing documents. Convert once, then extract or segment multiple times:

```python theme={null}
from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions, SegmentOptions
import json

client = DatalabClient()

# Step 1: Convert and save checkpoint
options = ConvertOptions(save_checkpoint=True)
result = client.convert("document.pdf", options=options)
checkpoint_id = result.checkpoint_id

# Step 2: Extract using checkpoint (no re-parsing)
extract_opts = ExtractOptions(
    checkpoint_id=checkpoint_id,
    page_schema=json.dumps({"invoice_number": {"type": "string"}})
)
extracted = client.extract(options=extract_opts)

# Step 3: Segment using same checkpoint
segment_opts = SegmentOptions(
    checkpoint_id=checkpoint_id,
    segmentation_schema=json.dumps({"sections": ["Header", "Body", "Footer"]})
)
segmented = client.segment(options=segment_opts)
```

## Workflows → Pipelines

<Warning>
  The Workflows API (`/api/v1/workflows`) is deprecated. Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) for all new integrations and migrate existing workflows.
</Warning>

Pipelines replace Workflows with a simpler API, per-step status tracking, versioning, and a visual editor in Forge.

| Workflows                                       | Pipelines                                                                            |
| ----------------------------------------------- | ------------------------------------------------------------------------------------ |
| `POST /api/v1/workflows/workflows`              | `POST /api/v1/pipelines` (via SDK: `client.create_pipeline()`)                       |
| `POST /api/v1/workflows/workflows/{id}/execute` | `POST /api/v1/pipelines/{id}/run` (via SDK: `client.run_pipeline()`)                 |
| `GET /api/v1/workflows/executions/{id}`         | `GET /api/v1/pipelines/executions/{id}` (via SDK: `client.get_pipeline_execution()`) |

See [Pipelines](/docs/recipes/pipelines/pipeline-overview) for a full walkthrough.

## Custom Pipeline → Custom Processor

<Warning>
  `POST /api/v1/custom-pipeline` is deprecated (sunset: September 30, 2026). Migrate to `POST /api/v1/custom-processor`. The management routes `/api/v1/custom_pipelines/*` are also deprecated; use `/api/v1/custom_processors/*` instead.
</Warning>

<CodeGroup>
  ```python Python SDK (before) theme={null}
  from datalab_sdk import DatalabClient, CustomProcessorOptions

  client = DatalabClient()
  options = CustomProcessorOptions(pipeline_id="cp_XXXXX")
  result = client.run_custom_pipeline("document.pdf", options=options)
  ```

  ```python Python SDK (after) theme={null}
  from datalab_sdk import DatalabClient, CustomProcessorOptions

  client = DatalabClient()
  options = CustomProcessorOptions(pipeline_id="cp_XXXXX")
  result = client.run_custom_processor("document.pdf", options=options)
  ```

  ```bash cURL (before) theme={null}
  curl -X POST https://www.datalab.to/api/v1/custom-pipeline \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "pipeline_id=cp_XXXXX"
  ```

  ```bash cURL (after) theme={null}
  curl -X POST https://www.datalab.to/api/v1/custom-processor \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "pipeline_id=cp_XXXXX"
  ```
</CodeGroup>

The response format is identical. `CustomPipelineOptions` remains as a backward-compatible alias for `CustomProcessorOptions`.

## Table Recognition → Document Conversion

The standalone Table Recognition endpoint (`/api/v1/table_rec`) is deprecated. Use the Document Conversion endpoint with JSON output instead.

### Before (deprecated)

```python theme={null}
# Old: Dedicated table recognition endpoint
response = requests.post(
    "https://www.datalab.to/api/v1/table_rec",
    files={"file": ("doc.pdf", f, "application/pdf")},
    headers={"X-API-Key": API_KEY}
)
```

### After (current)

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, ConvertOptions

  client = DatalabClient()

  options = ConvertOptions(
      output_format="json",
      mode="balanced"
  )

  result = client.convert("document.pdf", options=options)

  # Tables are in the JSON output with block_type "Table"
  for block in result.json.get("children", []):
      if block.get("block_type") == "Table":
          print(f"Table: {block['id']}")
          print(f"Bounding box: {block['bbox']}")
          # Access cells in block['children']
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/convert \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "output_format=json" \
    -F "mode=balanced"
  ```
</CodeGroup>

## OCR → Document Conversion

The standalone OCR endpoint (`/api/v1/ocr`) is deprecated. Use the Document Conversion endpoint instead, which includes OCR as part of its processing pipeline.

### Before (deprecated)

```python theme={null}
# Old: Dedicated OCR endpoint
response = requests.post(
    "https://www.datalab.to/api/v1/ocr",
    files={"file": ("doc.pdf", f, "application/pdf")},
    headers={"X-API-Key": API_KEY}
)
```

### After (current)

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, ConvertOptions

  client = DatalabClient()

  # For text extraction, use markdown output
  result = client.convert("document.pdf")
  print(result.markdown)

  # For page-level text, use JSON output
  options = ConvertOptions(output_format="json")
  result = client.convert("document.pdf", options=options)
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/convert \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@document.pdf" \
    -F "output_format=markdown"
  ```
</CodeGroup>

## Next Steps

<CardGroup>
  <Card title="Document Conversion" icon="file-text" href="/docs/recipes/conversion/conversion-api-overview">
    Full guide to the current conversion API
  </Card>

  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Extract structured data using JSON schemas
  </Card>

  <Card title="Changelog" icon="clock-rotate-left" href="/platform/changelog">
    See all API changes and deprecations
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    Use the SDK for the simplest migration path
  </Card>
</CardGroup>


# Security Best Practices
Source: https://documentation.datalab.to/platform/security

Keep your Datalab integration secure with these best practices.

Follow these practices to keep your Datalab integration secure in production.

## API Key Management

### Store keys in environment variables

Never hardcode API keys in source code. Use environment variables:

```bash theme={null}
export DATALAB_API_KEY="your-api-key"
```

```python theme={null}
# The SDK reads DATALAB_API_KEY automatically
from datalab_sdk import DatalabClient
client = DatalabClient()  # Uses env var
```

<Warning>
  Never commit API keys to version control. Add `.env` files to your `.gitignore`.
</Warning>

### Use per-key spend limits

Create separate API keys for different environments and set spend limits on each:

* **Development key** — low spend limit for testing
* **Staging key** — moderate limit for integration testing
* **Production key** — appropriate limit for your expected usage

Manage keys at [datalab.to/app/keys](https://www.datalab.to/app/keys).

### Rotate keys regularly

If you suspect a key has been compromised:

1. Create a new API key at [datalab.to/app/keys](https://www.datalab.to/app/keys)
2. Update your application to use the new key
3. Revoke the old key

<Tip>
  Create the new key before revoking the old one to avoid downtime.
</Tip>

## Webhook Security

### Always use HTTPS

Configure your webhook endpoint to use HTTPS. Webhook payloads contain request data that should be encrypted in transit.

### Verify webhook signatures

Always verify the webhook signature before processing the payload:

```python theme={null}
import hashlib
import hmac
from fastapi import FastAPI, Request, HTTPException

app = FastAPI()
WEBHOOK_SECRET = "your-webhook-secret"

@app.post("/webhook")
async def handle_webhook(request: Request):
    body = await request.body()
    signature = request.headers.get("X-Webhook-Signature")

    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        body,
        hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(signature, expected):
        raise HTTPException(status_code=401, detail="Invalid signature")

    # Process the webhook payload
    payload = await request.json()
    return {"status": "ok"}
```

### Handle duplicate events

Webhook deliveries may be retried on 5xx errors or timeouts. Use the `request_id` field to deduplicate:

```python theme={null}
processed_ids = set()  # Use a database in production

@app.post("/webhook")
async def handle_webhook(request: Request):
    payload = await request.json()
    request_id = payload["request_id"]

    if request_id in processed_ids:
        return {"status": "already processed"}

    processed_ids.add(request_id)
    # Process the payload
```

<Warning>
  Do not log webhook secrets or full webhook payloads containing sensitive document data.
</Warning>

## Data Handling

### Results expiration

Conversion results are automatically deleted from Datalab servers **one hour** after processing completes. Retrieve and store results in your own infrastructure promptly.

### Data retention consent

You can control whether your documents are used to improve Datalab's models. This is an opt-in setting configurable in your team settings. Teams that opt in receive discounted rates.

### Minimize data exposure

* Only send documents that need to be processed — avoid sending unnecessary files
* Use `page_range` to process only the pages you need rather than entire documents
* Download and delete results as soon as they're available

## Network Security

### For on-premises deployments

* Place the Datalab container behind a reverse proxy with TLS termination
* Restrict network access to the container's port (8000) to trusted clients only
* The on-premises container does not require API key authentication by default — implement authentication at the network or reverse proxy level
* See [On-Premises Overview](/docs/on-prem/overview) for deployment details

### IP restrictions

For additional security, consider restricting API access to known IP addresses using your infrastructure's firewall or WAF rules.

## Next Steps

<CardGroup>
  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Configure and verify webhook signatures
  </Card>

  <Card title="API Limits" icon="gauge" href="/docs/common/limits">
    Understand rate limits and quotas
  </Card>

  <Card title="Billing" icon="credit-card" href="/platform/billing">
    Manage spend limits and usage
  </Card>

  <Card title="On-Premises" icon="server" href="/docs/on-prem/overview">
    Self-hosted deployment security
  </Card>
</CardGroup>


# Troubleshooting
Source: https://documentation.datalab.to/platform/troubleshooting

Common issues and solutions when using the Datalab API.

This page covers the most common issues you may encounter when using the Datalab API, organized by the error messages you'll see.

## Authentication Errors

### "Invalid API key provided"

```json theme={null}
{"detail": "Invalid API key provided. Set the X-API-Key header to your API key."}
```

**Status:** 401 Unauthorized

**Cause:** The `X-API-Key` header is missing or contains an invalid key.

**Solution:**

1. Check that you're passing the header: `X-API-Key: YOUR_KEY`
2. Verify your key at [datalab.to/app/keys](https://www.datalab.to/app/keys)
3. If using the SDK, set `DATALAB_API_KEY` environment variable or pass `api_key` to the client
4. If the key was recently created, wait a few seconds and retry

### "You need an active, paid subscription"

```json theme={null}
{"detail": "You need an active, paid subscription to use this API."}
```

**Status:** 403 Forbidden

**Cause:** Your team does not have an active subscription or has exhausted free credits.

**Solution:**

1. Sign up for a plan at [datalab.to/pricing](https://www.datalab.to/pricing)
2. If you recently signed up, check that payment was processed successfully
3. Contact [support@datalab.to](mailto:support@datalab.to) if you believe this is in error

### "Your subscription has expired"

```json theme={null}
{"detail": "Your subscription has expired. You may need to re-enable your plan, or pay an unpaid invoice."}
```

**Status:** 403 Forbidden

**Solution:** Check your billing dashboard for unpaid invoices and update your payment method if needed.

### "Your payment has failed"

```json theme={null}
{"detail": "Your payment has failed. Please pay any unpaid invoices to continue using the API."}
```

**Status:** 403 Forbidden

**Solution:** Update your payment method and pay any outstanding invoices at [datalab.to/app/billing](https://www.datalab.to/app/billing).

***

## Rate Limiting

### "Rate limit exceeded"

```json theme={null}
{"detail": "Rate limit exceeded for endpoint /api/v1/convert. You can make 200 requests every 60 seconds. Please try again later, or reach out to support@datalab.to if you need a higher limit."}
```

**Status:** 429 Too Many Requests

**Cause:** You've exceeded the request rate limit for your plan.

**Solution:**

* Wait and retry with exponential backoff (the SDK does this automatically)
* Reduce request frequency or spread requests over time
* For higher limits, contact [support@datalab.to](mailto:support@datalab.to)

```python theme={null}
# The SDK handles retries automatically with exponential backoff.
# For the REST API, implement retry logic:
import time

def request_with_retry(url, headers, max_retries=5):
    for attempt in range(max_retries):
        response = requests.get(url, headers=headers)
        if response.status_code == 429:
            wait = min(2 ** attempt * 5, 120)
            time.sleep(wait)
            continue
        return response
    raise Exception("Max retries exceeded")
```

### "Concurrency exceeded"

```json theme={null}
{"detail": "Concurrency exceeded for endpoint /api/v1/convert. You can have 400 concurrent requests running at once. Please try again later, or reach out to support@datalab.to if you need a higher limit."}
```

**Status:** 429 Too Many Requests

**Cause:** Too many requests are being processed simultaneously.

**Solution:** Queue your requests and limit the number of concurrent submissions. See [Batch Processing](/docs/recipes/conversion/batch-documents) for patterns.

### "Page rate limit exceeded"

```json theme={null}
{"success": false, "error": "Page rate limit exceeded. Your team has {current_pages} pages in flight and this request adds {page_count} more ({total} total, limit: 5,000). Please wait for some requests to complete before submitting more, or contact support@datalab.to for a higher limit."}
```

**Status:** Not an HTTP error — returned in the result payload with `success: false`

**Cause:** Your team has too many pages being processed concurrently across all requests. The default limit is 5,000 concurrent pages.

**Solution:**

* Wait for in-flight requests to complete before submitting more documents
* If you're polling for results, back off when you see this error and retry after some results return
* If you're using webhooks, wait for completion notifications before submitting more
* For a higher page limit, contact [support@datalab.to](mailto:support@datalab.to)

<Warning>
  This limit is **not** enforced at submission time. Your request will be accepted, but the result will come back with `success: false`. Always check the `success` field when retrieving results.
</Warning>

***

## File Errors

### "Invalid file type"

```json theme={null}
{"detail": "Invalid file type. Only PDF files, word documents, spreadsheets, powerpoints, HTML, and PNG, JPG, GIF, TIFF, and WEBP images are accepted."}
```

**Status:** 400 Bad Request

**Cause:** The uploaded file's content type is not supported.

**Solution:**

1. Check [Supported File Types](/docs/common/supportedfiletypes) for the full list
2. Ensure the file extension matches the actual content type
3. If uploading via cURL, the content type may be auto-detected from the extension

### "File size exceeds upload limit"

```json theme={null}
{"detail": "File size exceeds upload limit of 209715200 bytes."}
```

**Status:** 400 Bad Request

**Cause:** The file exceeds the 200 MB size limit.

**Solution:**

* Split large PDFs into smaller files using page ranges
* Use the `page_range` parameter to process specific pages
* See [API Limits](/docs/common/limits) for current limits

### "File too large"

```json theme={null}
{"detail": "File too large. Maximum size: 200MB"}
```

**Status:** 413 Payload Too Large

**Solution:** Same as above — reduce file size or use page ranges.

***

## Request Errors

### "Request not found"

```json theme={null}
{"detail": "Request not found."}
```

**Status:** 404 Not Found

**Cause:** The request ID doesn't exist or has expired.

**Solution:**

* Results are deleted one hour after processing completes — retrieve them promptly
* Verify the request ID is correct
* Submit a new request if the results have expired

### "This resource has expired"

```json theme={null}
{"detail": "This resource has expired."}
```

**Status:** 404 Not Found

**Cause:** The conversion results have been cleaned up (1 hour after completion).

**Solution:** Submit a new conversion request. Consider using [webhooks](/platform/webhooks) to be notified immediately when results are ready.

### "This request was not made by you"

```json theme={null}
{"detail": "This request was not made by you."}
```

**Status:** 403 Forbidden

**Cause:** You're trying to retrieve results for a request made by a different team.

**Solution:** Ensure you're using the same API key that submitted the original request.

***

## Webhook Issues

### Webhook not firing

**Possible causes:**

1. Webhook URL is not configured — set it at [dashboard](https://www.datalab.to/app/settings) or per-request via `webhook_url`
2. Your server is not reachable from Datalab's servers
3. Your server is returning 4xx errors (webhooks are not retried for client errors)

**Debugging steps:**

1. Check your webhook URL is accessible from the internet
2. Verify HTTPS is properly configured (self-signed certificates may cause issues)
3. Check your server logs for incoming requests
4. Use a tool like [webhook.site](https://webhook.site) to test webhook delivery

### Webhook signature verification failing

**Possible causes:**

1. Using the wrong webhook secret
2. Request body is being modified by middleware before verification
3. Encoding issues (verify UTF-8 encoding)

**Solution:** See [Webhook Verification](/platform/webhooks#verifying-webhook-signatures) for the correct verification implementation.

***

## Processing Issues

### Conversion returns empty or poor results

**Possible causes:**

1. Scanned PDF with no OCR layer — use `mode: "accurate"` for better OCR
2. Very complex layout — try `mode: "accurate"`
3. File is corrupted or password-protected

**Solution:**

* Try a different processing mode (fast → balanced → accurate)
* Check the `parse_quality_score` in the response (0-5 scale) to assess output quality
* For scanned documents, `accurate` mode provides the best OCR

### Conversion is slow

**Possible causes:**

1. Using `accurate` mode on large documents
2. Large file with many pages

**Solution:**

* Use `fast` or `balanced` mode for lower latency
* Use `page_range` to process only the pages you need
* Use `max_pages` to limit processing

***

## Server Errors

### "Database error" / "Redis error"

```json theme={null}
{"detail": "Database error"}
```

**Status:** 500 Internal Server Error

**Cause:** Temporary infrastructure issue on Datalab's side.

**Solution:** Wait a moment and retry. If the issue persists, contact [support@datalab.to](mailto:support@datalab.to).

### 529 Service Overloaded

**Cause:** Datalab's servers are temporarily overloaded.

**Solution:** Wait and retry with exponential backoff. The SDK handles this automatically.

***

## Next Steps

<CardGroup>
  <Card title="Error Codes" icon="circle-exclamation" href="/platform/errors">
    Complete HTTP error code reference
  </Card>

  <Card title="API Limits" icon="gauge" href="/docs/common/limits">
    Rate limits and file size limits
  </Card>

  <Card title="Webhooks" icon="bell" href="/platform/webhooks">
    Set up webhook notifications
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    SDK with built-in error handling and retries
  </Card>
</CardGroup>


# Version Policies
Source: https://documentation.datalab.to/platform/versioning


Datalab is designed to be enterprise-ready.

This means that we will not introduce breaking changes for top-level versions
and provide a clear upgrade path for version changes.

## API Policy

For any given API version, we will preserve:

* Existing input parameters
* Existing output parameters

However, we may do the following:

* Add additional optional inputs
* Add additional values to the output
* Change conditions for specific error types
* Add new variants to enum-like output values (for example, streaming event types)

## SDK Policy

The SDK is built on top of the API, so it will follow the same versioning principles.

## Model Output

We frequently update our models to improve accuracy and performance. This can
introduce subtle changes in outputs. At this time, we do not support version
pinning outside of enterprise plans. Contact us at
[support@datalab.to](mailto:support@datalab.to) for information.

## Next Steps

<CardGroup>
  <Card title="Changelog" icon="clock-rotate-left" href="/platform/changelog">
    See the latest updates and changes to the Datalab platform.
  </Card>

  <Card title="SDK Reference" icon="code" href="/docs/welcome/sdk">
    Full Python SDK documentation with typed clients and async support.
  </Card>

  <Card title="API Reference" icon="book" href="/docs/welcome/api">
    REST API reference for document conversion, form filling, and file management.
  </Card>

  <Card title="Error Codes" icon="circle-exclamation" href="/platform/errors">
    Understand HTTP error codes and subscription errors.
  </Card>
</CardGroup>


# Webhooks
Source: https://documentation.datalab.to/platform/webhooks


Webhooks provide real-time notifications when your document processing jobs complete, eliminating the need for continuous polling. This event-driven approach improves efficiency and reduces unnecessary API calls.

## Setting Up Webhooks

1. Navigate to the Settings Panel
2. Locate the "Webhooks" section
3. Enter your webhook endpoint URL with an optional secret

We currently only support a single webhook per account.

<Warning>
  **Webhook reliability notes:**

  * Webhooks are retried on 5xx errors and timeouts, but **not** on 4xx errors
  * Always implement idempotent webhook handlers using `request_id` to deduplicate
  * Set a reasonable server timeout — Datalab waits up to 30 seconds for your endpoint to respond
</Warning>

### Per-Request Webhook Override

You can override the default webhook URL for specific API requests by including the `webhook_url` parameter:

```python theme={null}
import requests

url = "https://www.datalab.to/api/v1/convert"

form_data = {
    'file': ('document.pdf', open('document.pdf', 'rb'), 'application/pdf'),
    'output_format': (None, 'markdown'),
    'webhook_url': (None, 'https://your-custom-webhook.com/endpoint')
}

headers = {"X-API-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
```

This is useful when:

* Different projects need different webhook endpoints
* You want to route notifications to specific services
* Testing webhook integrations without changing account settings

The per-request webhook URL will be used instead of your account's default webhook URL for that specific request only.

## Webhook Payload

When a webhook is triggered, Datalab sends a POST request to your configured endpoint with a JSON payload containing the following fields:

```json theme={null}
{
  "request_id": "abc123",
  "request_check_url": "https://api.datalab.to/api/v1/convert/abc123",
  "webhook_secret": "your_configured_secret"
}
```

| Field               | Description                                                |
| ------------------- | ---------------------------------------------------------- |
| `request_id`        | The unique identifier for the processing request           |
| `request_check_url` | URL to retrieve the full results of the processed document |
| `webhook_secret`    | Your configured webhook secret (if set)                    |

## Webhook Secret Verification

The webhook secret is included in the JSON request body, allowing you to verify that incoming webhooks are authentic requests from Datalab.

### Verifying Webhooks on Your Server

Here's an example of how to verify the webhook secret in your receiving endpoint:

```python theme={null}
from fastapi import FastAPI, Request, HTTPException
import os

app = FastAPI()

@app.post("/my-webhook")
async def receive_webhook(request: Request):
    data = await request.json()

    # Verify the webhook secret
    expected_secret = os.environ["DATALAB_WEBHOOK_SECRET"]
    received_secret = data.get("webhook_secret")

    if received_secret != expected_secret:
        raise HTTPException(status_code=401, detail="Invalid webhook secret")

    # Process the webhook
    request_id = data["request_id"]
    check_url = data["request_check_url"]

    # Fetch the full results using check_url...
```

<Warning>
  The webhook secret is transmitted in plaintext within the request body. Ensure your webhook endpoint uses HTTPS to encrypt the data in transit. Avoid logging the full request body in production to prevent secret exposure.
</Warning>

## Troubleshooting

**Not Receiving Events**

If your webhook is not receiving events, try the following:

* Verify URL is publicly accessible
* Validate webhook secret matches
* Check your server logs for 4xx errors (authentication, invalid endpoint, etc.)
* Ensure your endpoint responds within 30 seconds

**Duplicate Events**

We may send duplicate responses to a webhook endpoint. To handle this we
recommend that you implement idempotency checks to ensure single processing.

## Coming Soon

* Project-specific webhooks