Skip to main content
Learn how to create and execute workflows with single and multiple files. This tutorial walks through invoice data extraction—one of the most common workflow use cases.

Prerequisites

  • An active Datalab account
  • An API key (get one from your account settings)
  • A workflow-enabled subscription plan
Concepts Covered:
  • Creating a workflow
  • Executing with a single file
  • Executing with multiple files in parallel
  • Checking execution status and retrieving results

Tutorial: Invoice Data Extraction

We’ll build a workflow that:
  1. Parses a PDF invoice
  2. Extracts structured data (invoice number, vendor, amount, line items)
This same workflow works for both single and multiple files.

Create the Workflow

Define your workflow template. This is created once and can be executed many times with different files.
curl -X POST https://www.datalab.to/api/v1/workflows/workflows \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Invoice Data Extraction",
    "steps": [
      {
        "step_key": "marker_parse",
        "unique_name": "parse",
        "settings": {
          "max_pages": 10
        }
      },
      {
        "step_key": "marker_extract",
        "unique_name": "extract",
        "settings": {
          "page_schema": {
            "invoice_number": {
              "type": "string",
              "description": "The invoice number or ID"
            },
            "vendor_name": {
              "type": "string",
              "description": "Name of the company issuing the invoice"
            },
            "total_amount": {
              "type": "number",
              "description": "Total amount due including tax"
            },
            "invoice_date": {
              "type": "string",
              "description": "Date the invoice was issued"
            },
            "line_items": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "description": {"type": "string"},
                  "quantity": {"type": "number"},
                  "unit_price": {"type": "number"},
                  "total": {"type": "number"}
                }
              }
            }
          }
        },
        "depends_on": ["parse"]
      }
    ]
  }'

Understanding the Structure

Parse Step:
  • step_key: "marker_parse": Uses Marker to parse the PDF
  • unique_name: "parse": Referenced by the extract step
  • max_pages: 10: Only process first 10 pages (cost optimization)
Extract Step:
  • step_key: "marker_extract": Extracts structured data
  • unique_name: "extract": Identifies this step in results
  • page_schema: Defines what data to extract
  • depends_on: ["parse"]: Waits for parse to complete
Response:
{
  "workflow_id": 42,
  "name": "Invoice Data Extraction",
  "team_id": 123,
  "created_at": "2024-01-20T10:00:00Z",
  "steps": [...]
}
Save the workflow_id - you’ll use it to execute the workflow.

Execute with a Single File

Now execute your workflow with a single invoice:
curl -X POST https://www.datalab.to/api/v1/workflows/workflows/42/execute \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_config": {
      "file_urls": [
          "https://www.wmaccess.com/downloads/sample-invoice.pdf"
      ]
    }
  }'
Response:
{
  "execution_id": 101,
  "workflow_id": 42,
  "status": "PENDING",
  "created_at": "2024-01-20T10:05:00Z",
  "temporal_workflow_id": "workflow_execution_101_abc123"
}
Save the execution_id - you’ll use it to check status.

Check Execution Status

Poll the execution endpoint to track progress:
curl -X GET https://www.datalab.to/api/v1/workflows/executions/101 \
  -H "X-API-Key: YOUR_API_KEY"

While Processing

{
  "execution_id": 101,
  "workflow_id": 42,
  "status": "IN_PROGRESS",
  "files_processed": 1,
  "created_at": "2024-01-20T10:05:00Z",
  "started_at": "2024-01-20T10:05:02Z"
}

When Complete

{
  "execution_id": 101,
  "workflow_id": 42,
  "status": "COMPLETED",
  "files_processed": 1,
  "created_at": "2024-01-20T10:05:00Z",
  "started_at": "2024-01-20T10:05:02Z",
  "completed_at": "2024-01-20T10:06:45Z",
  "step_outputs": {
    "parse": {
            "id": 1,
            "status": "COMPLETED",
            "started_at": "...",
            "finished_at": "...",
            "file_id": "5f0ebd60-d0c4-4696-af87-3453d0293d98",
            "output_url": "<PRESIGNED_URL>",
    },
    ...
  }
}

Understanding the Results

Status Codes:
  • PENDING: Queued, not started yet
  • IN_PROGRESS: Steps are running
  • COMPLETED: All steps finished successfully
  • FAILED: An error occurred
Step Outputs Structure:
step_outputs
├── parse
│   └── file_abc123 (file ID)
│       ├── checkpoint_id
│       └── success
└── extract
    └── file_abc123
        ├── invoice_number
        ├── vendor_name
        └── ...

Execute with Multiple Files

The same workflow can process multiple invoices in parallel:
curl -X POST https://www.datalab.to/api/v1/workflows/workflows/42/execute \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_config": {
      "file_urls": [
        "https://www.wmaccess.com/downloads/sample-invoice.pdf",
        "https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf",
        "https://pdfobject.com/pdf/sample.pdf"
      ]
    }
  }'
NOTE: Soon you’ll be able to create collections of documents. If this is of urgent need, reach out to us at support@datalab.to! Response:
{
  "execution_id": 102,
  "workflow_id": 42,
  "status": "PENDING",
  "created_at": "2024-01-20T10:10:00Z"
}

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!