Skip to main content
Workflows allow you to chain together multiple document processing steps to create powerful, automated pipelines. Process single or multiple documents in parallel, extract structured data, and build complex document understanding pipelines—all with a single API call.

What Are Workflows?

A workflow is a reusable template that defines a series of document processing steps. Each workflow:
  • Consists of multiple steps that execute in order based on dependencies
  • Can process single or multiple documents in parallel
  • Passes data between steps automatically
  • Handles errors gracefully with per-file isolation
Think of workflows as recipes: define once, execute many times with different documents.

Key Concepts

Steps

A step is a single operation in your workflow (parse, extract, segment). Steps are defined with:
  • step_key: The type of operation (marker_parse, marker_extract, etc.)
  • unique_name: A unique identifier for this step within the workflow
  • settings: Configuration specific to the step type
  • depends_on: List of step names that must complete before this step runs

Dependencies

Steps execute in order based on their depends_on array. Multiple steps can depend on the same parent and will execute in parallel once dependencies are satisfied.
{
  "steps": [
    {
      "unique_name": "parse",
      "step_key": "marker_parse"
    },
    {
      "unique_name": "extract",
      "step_key": "marker_extract",
      "depends_on": ["parse"]  // Waits for parse
    },
    {
      "unique_name": "segment",
      "step_key": "marker_segment",
      "depends_on": ["parse"]  // Also waits for parse, runs in parallel with extract
    }
  ]
}

Execution Flow

  1. Create workflow: Define the template once
  2. Execute workflow: Run it with specific files (single or multiple)
  3. Track progress: Poll execution status
  4. Get results: Retrieve structured outputs organized by step and file

Available Step Types

You can get a complete list of available step types programmatically:
curl -X GET https://www.datalab.to/api/v1/workflows/step-types \
  -H "X-API-Key: YOUR_API_KEY"
Response:
{
  "step_types": [
    {
      "step_type": "marker_parse",
      "name": "Marker PDF Parse",
      "description": "Parse PDF documents using Marker. Automatically saves checkpoint if downstream steps need it.",
      "settings_schema": {
        "type": "object",
        "properties": {
          "max_pages": {"type": "integer"},
          "page_range": {"type": "string"},
          "config": {"type": "object"}
        }
      }
    },
    ...
  ]
}

marker_parse

Parse PDF documents into markdown format. Automatically saves a checkpoint when followed by extraction or segmentation steps. Common Settings:
{
  "max_pages": 10,              // Maximum pages to process
  "page_range": "0-5,10",       // Specific pages (optional)
  "config": {
    "force_ocr": false,         // Force OCR on all pages
    "high_accuracy_mode": false // Use slower, more accurate parsing
  }
}
Output:
  • checkpoint_id: Reference for downstream extraction/segmentation
  • lookup_key: Request ID for checking parse results
  • result: Full parse output including markdown content
Use Cases:
  • Convert PDFs to markdown for downstream processing
  • Create checkpoints for multiple extraction attempts
  • Parse documents with different quality settings

marker_extract

Extract structured data from parsed documents using LLM-powered extraction. Common Settings:
{
  "page_schema": {
    "title": {"type": "string"},
    "date": {"type": "string"},
    "total": {"type": "number"}
  },
  "page_range": "0-10"  // Optional: pages to extract from
}
Output:
  • Structured data matching your schema
  • lookup_key: Request ID for checking extraction results
Automatic Checkpoint Detection: Automatically uses the checkpoint_id from the most recent parse step in the dependency chain. Use Cases:
  • Extract invoice data (amounts, dates, line items)
  • Pull document metadata (title, author, date)
  • Parse forms and structured documents
  • Extract table data into JSON

marker_segment

Segment documents into logical sections using LLM-powered detection. Common Settings:
{
  "segmentation_schema": {
    "Introduction": "The opening section of the document",
    "Methods": "Description of methodology",
    "Results": "Findings and analysis"
  }
}
Output:
  • Identified segments with page ranges
  • Segment metadata and content
Use Cases:
  • Split research papers by section
  • Identify document structure (header, body, footer)
  • Find specific sections in long documents
  • Break reports into chapters

conditional

Make routing decisions based on step outputs or document properties. Enable different downstream steps based on conditions. Common Settings:
{
  "conditions": [
    {
      "left": "{{previous_step.field_name}}",
      "operator": ">=",
      "right": 4.0
    }
  ],
  "logic": "AND",
  "routes": {
    "true": {
      "enable_steps": ["step_a", "step_b"]
    },
    "false": {
      "enable_steps": ["step_c"]
    }
  }
}
Operators: >=, <=, =, !=, >, < Logic: AND, OR Output:
  • condition_result: Boolean result of the evaluation
  • enabled_steps: Array of steps enabled by this route
Use Cases:
  • Re-parse low-quality documents with OCR
  • Route invoices above a threshold to detailed extraction
  • Skip extraction for empty pages
  • Apply different processing based on document type
Learn More: See the Conditional Routing guide for detailed examples.

await_parse_quality

Wait for parse quality scoring to complete. Used before conditional routing based on quality. Common Settings:
{
  "max_wait_seconds": 120,
  "poll_interval_seconds": 10
}
Output:
  • parse_quality_score: Quality score from 0-5
  • quality_metadata: Additional quality information (OCR detection, page count)
Note: Must follow a marker_parse step. Quality scores typically available within 30-60 seconds.

Multi-File Processing

Workflows automatically handle multiple files efficiently:

Parallel Execution

When you provide multiple files:
  1. Each file is processed independently through all single_input steps
  2. Files execute in parallel (not sequentially)
  3. Results are organized by file ID
  4. If one file fails, others continue processing

Performance

  • Total execution time ≈ time for slowest file (not sum of all files)
  • No limit on number of files (subject to plan limits)
  • Each file counts as a separate request for rate limiting

Example Flow

Input: 3 files (A, B, C)
Steps: parse → extract

Execution:
  File A: parse → extract  |
  File B: parse → extract  | All in parallel
  File C: parse → extract  |

Results:
{
  "parse": {
    "file_A": {...},
    "file_B": {...},
    "file_C": {...}
  },
  "extract": {
    "file_A": {...},
    "file_B": {...},
    "file_C": {...}
  }
}

Execution Lifecycle

Status Progression

  1. PENDING: Execution created, queued to start
  2. IN_PROGRESS: Steps are actively running
  3. COMPLETED: All steps finished successfully
  4. FAILED: One or more critical errors occurred

Tracking Progress

Poll the execution endpoint to track real-time progress:
curl -X GET https://www.datalab.to/api/v1/workflows/executions/{execution_id} \
  -H "X-API-Key: YOUR_API_KEY"
The response includes:
  • Current status
  • Completed steps and their outputs
  • In-progress steps
  • Any errors encountered
Webhook support: Coming soon for event-driven updates.

Error Handling

Per-File Isolation

In multi-file workflows, errors are isolated:
  • File A fails → Files B and C continue processing
  • Check individual file results in step_outputs
  • Execution status reflects overall state

Error Responses

Failed steps include error details:
{
  "step_outputs": {
    "extract": {
      "file_abc123": {
        "error": "Extraction timeout",
        "error_code": "TIMEOUT"
      },
      "file_def456": {
        "title": "Success",
        "amount": 1250.00
      }
    }
  }
}

Billing

Workflows are currently in Beta, until they hit public release, there is no added cost for Workflows. You will still have to pay for the underlying Marker API requests in line with our API billing

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!