Workflow Concepts

Workflows allow you to chain together multiple document processing steps to create powerful, automated pipelines. Process single or multiple documents in parallel, extract structured data, and build complex document understanding pipelines—all with a single API call.

What Are Workflows?

A workflow is a reusable template that defines a series of document processing steps. Each workflow:

Consists of multiple steps that execute in order based on dependencies
Can process single or multiple documents in parallel
Passes data between steps automatically
Handles errors gracefully with per-file isolation

Think of workflows as recipes: define once, execute many times with different documents.

Key Concepts

Steps

A step is a single operation in your workflow (parse, extract, segment). Steps are defined with:

step_key: The type of operation (marker_parse, marker_extract, etc.)
unique_name: A unique identifier for this step within the workflow
settings: Configuration specific to the step type
depends_on: List of step names that must complete before this step runs

Dependencies

Steps execute in order based on their depends_on array. Multiple steps can depend on the same parent and will execute in parallel once dependencies are satisfied.

{
  "steps": [
    {
      "unique_name": "parse",
      "step_key": "marker_parse"
    },
    {
      "unique_name": "extract",
      "step_key": "marker_extract",
      "depends_on": ["parse"]  // Waits for parse
    },
    {
      "unique_name": "segment",
      "step_key": "marker_segment",
      "depends_on": ["parse"]  // Also waits for parse, runs in parallel with extract
    }
  ]
}

Execution Flow

Create workflow: Define the template once
Execute workflow: Run it with specific files (single or multiple)
Track progress: Poll execution status
Get results: Retrieve structured outputs organized by step and file

Available Step Types

You can get a complete list of available step types programmatically:

curl -X GET https://www.datalab.to/api/v1/workflows/step-types \
  -H "X-API-Key: YOUR_API_KEY"

Response:

{
  "step_types": [
    {
      "step_type": "marker_parse",
      "name": "Marker PDF Parse",
      "description": "Parse PDF documents using Marker. Automatically saves checkpoint if downstream steps need it.",
      "settings_schema": {
        "type": "object",
        "properties": {
          "max_pages": {"type": "integer"},
          "page_range": {"type": "string"},
          "config": {"type": "object"}
        }
      }
    },
    ...
  ]
}

marker_parse

Parse PDF documents into markdown format. Automatically saves a checkpoint when followed by extraction or segmentation steps. Common Settings:

{
  "max_pages": 10,              // Maximum pages to process
  "page_range": "0-5,10",       // Specific pages (optional)
  "config": {
    "force_ocr": false,         // Force OCR on all pages
    "high_accuracy_mode": false // Use slower, more accurate parsing
  }
}

Output:

checkpoint_id: Reference for downstream extraction/segmentation
lookup_key: Request ID for checking parse results
result: Full parse output including markdown content

Use Cases:

Convert PDFs to markdown for downstream processing
Create checkpoints for multiple extraction attempts
Parse documents with different quality settings

marker_extract

Extract structured data from parsed documents using LLM-powered extraction. Common Settings:

{
  "page_schema": {
    "title": {"type": "string"},
    "date": {"type": "string"},
    "total": {"type": "number"}
  },
  "page_range": "0-10"  // Optional: pages to extract from
}

Output:

Structured data matching your schema
lookup_key: Request ID for checking extraction results

Automatic Checkpoint Detection: Automatically uses the checkpoint_id from the most recent parse step in the dependency chain. Use Cases:

Extract invoice data (amounts, dates, line items)
Pull document metadata (title, author, date)
Parse forms and structured documents
Extract table data into JSON

marker_segment

Segment documents into logical sections using LLM-powered detection. Common Settings:

{
  "segmentation_schema": {
    "Introduction": "The opening section of the document",
    "Methods": "Description of methodology",
    "Results": "Findings and analysis"
  }
}

Output:

Identified segments with page ranges
Segment metadata and content

Use Cases:

Split research papers by section
Identify document structure (header, body, footer)
Find specific sections in long documents
Break reports into chapters

conditional

Make routing decisions based on step outputs or document properties. Enable different downstream steps based on conditions. Common Settings:

{
  "conditions": [
    {
      "left": "{{previous_step.field_name}}",
      "operator": ">=",
      "right": 4.0
    }
  ],
  "logic": "AND",
  "routes": {
    "true": {
      "enable_steps": ["step_a", "step_b"]
    },
    "false": {
      "enable_steps": ["step_c"]
    }
  }
}

Operators: >=, <=, =, !=, >, < Logic: AND, OR Output:

condition_result: Boolean result of the evaluation
enabled_steps: Array of steps enabled by this route

Use Cases:

Re-parse low-quality documents with OCR
Route invoices above a threshold to detailed extraction
Skip extraction for empty pages
Apply different processing based on document type

Learn More: See the Conditional Routing guide for detailed examples.

await_parse_quality

Wait for parse quality scoring to complete. Used before conditional routing based on quality. Common Settings:

{
  "max_wait_seconds": 120,
  "poll_interval_seconds": 10
}

Output:

parse_quality_score: Quality score from 0-5
quality_metadata: Additional quality information (OCR detection, page count)

Note: Must follow a marker_parse step. Quality scores typically available within 30-60 seconds.

Multi-File Processing

Workflows automatically handle multiple files efficiently:

Parallel Execution

When you provide multiple files:

Each file is processed independently through all single_input steps
Files execute in parallel (not sequentially)
Results are organized by file ID
If one file fails, others continue processing

Performance

Total execution time ≈ time for slowest file (not sum of all files)
No limit on number of files (subject to plan limits)
Each file counts as a separate request for rate limiting

Example Flow

Input: 3 files (A, B, C)
Steps: parse → extract

Execution:
  File A: parse → extract  |
  File B: parse → extract  | All in parallel
  File C: parse → extract  |

Results:
{
  "parse": {
    "file_A": {...},
    "file_B": {...},
    "file_C": {...}
  },
  "extract": {
    "file_A": {...},
    "file_B": {...},
    "file_C": {...}
  }
}

Execution Lifecycle

Status Progression

PENDING: Execution created, queued to start
IN_PROGRESS: Steps are actively running
COMPLETED: All steps finished successfully
FAILED: One or more critical errors occurred

Tracking Progress

Poll the execution endpoint to track real-time progress:

curl -X GET https://www.datalab.to/api/v1/workflows/executions/{execution_id} \
  -H "X-API-Key: YOUR_API_KEY"

The response includes:

Current status
Completed steps and their outputs
In-progress steps
Any errors encountered

Webhook support: Coming soon for event-driven updates.

Error Handling

Per-File Isolation

In multi-file workflows, errors are isolated:

File A fails → Files B and C continue processing
Check individual file results in step_outputs
Execution status reflects overall state

Error Responses

Failed steps include error details:

{
  "step_outputs": {
    "extract": {
      "file_abc123": {
        "error": "Extraction timeout",
        "error_code": "TIMEOUT"
      },
      "file_def456": {
        "title": "Success",
        "amount": 1250.00
      }
    }
  }
}

Billing

Workflows are currently in Beta, until they hit public release, there is no added cost for Workflows. You will still have to pay for the underlying Marker API requests in line with our API billing

Try it out

Create Your First Workflow: Follow the Simple Workflow Tutorial
Learn Conditional Logic: Explore Conditional Routing

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!

General

Parsing PDFs with Marker

Structured Extraction

Document Segmentation

Workflows (Beta)

Table Recognition

Try Datalab

Try Datalab

What Are Workflows?

Key Concepts

Steps

Dependencies

Execution Flow

Available Step Types

marker_parse

marker_extract

marker_segment

conditional

await_parse_quality

Multi-File Processing

Parallel Execution

Performance

Example Flow

Execution Lifecycle

Status Progression

Tracking Progress

Error Handling

Per-File Isolation

Error Responses

Billing

Try it out

General

Parsing PDFs with Marker

Structured Extraction

Document Segmentation

Workflows (Beta)

Table Recognition

Try Datalab

Try Datalab

​What Are Workflows?

​Key Concepts

​Steps

​Dependencies

​Execution Flow

​Available Step Types

​marker_parse

​marker_extract

​marker_segment

​conditional

​await_parse_quality

​Multi-File Processing

​Parallel Execution

​Performance

​Example Flow

​Execution Lifecycle

​Status Progression

​Tracking Progress

​Error Handling

​Per-File Isolation

​Error Responses

​Billing

​Try it out

What Are Workflows?

Key Concepts

Steps

Dependencies

Execution Flow

Available Step Types

marker_parse

marker_extract

marker_segment

conditional

await_parse_quality

Multi-File Processing

Parallel Execution

Performance

Example Flow

Execution Lifecycle

Status Progression

Tracking Progress

Error Handling

Per-File Isolation

Error Responses

Billing

Try it out