Skip to main content
Learn to create and execute workflows. This tutorial builds an invoice data extraction pipeline.

SDK Approach

The simplest way to create workflows:
from datalab_sdk import DatalabClient
from datalab_sdk.models import WorkflowStep, InputConfig

client = DatalabClient()

# Define steps
steps = [
    WorkflowStep(
        unique_name="parse",
        step_key="marker_parse",
        settings={"max_pages": 10},
        depends_on=[]
    ),
    WorkflowStep(
        unique_name="extract",
        step_key="marker_extract",
        settings={
            "page_schema": {
                "invoice_number": {"type": "string", "description": "Invoice ID"},
                "vendor_name": {"type": "string", "description": "Company name"},
                "total_amount": {"type": "number", "description": "Total due"},
                "invoice_date": {"type": "string", "description": "Issue date"},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "number"},
                            "unit_price": {"type": "number"},
                            "total": {"type": "number"}
                        }
                    }
                }
            }
        },
        depends_on=["parse"]
    )
]

# Create workflow
workflow = client.create_workflow(name="Invoice Extraction", steps=steps)
print(f"Created workflow: {workflow.id}")

Execute with Files

# Single file
input_config = InputConfig(file_urls=["https://example.com/invoice.pdf"])
execution = client.execute_workflow(workflow.id, input_config)

# Multiple files (processed in parallel)
input_config = InputConfig(file_urls=[
    "https://example.com/invoice1.pdf",
    "https://example.com/invoice2.pdf",
    "https://example.com/invoice3.pdf"
])
execution = client.execute_workflow(workflow.id, input_config)

print(f"Execution ID: {execution.id}")

Wait for Results

result = client.get_execution_status(
    execution.id,
    max_polls=300,
    download_results=True
)

if result.status == "COMPLETED":
    print("Workflow completed!")
    for step_name, outputs in result.steps.items():
        print(f"\n{step_name}:")
        print(outputs)
elif result.status == "FAILED":
    print(f"Failed: {result.error}")

Use Local Files

Upload files first:
# Upload files
files = client.upload_files(["invoice1.pdf", "invoice2.pdf"])
references = [f.reference for f in files]

# Use in workflow
input_config = InputConfig(file_urls=references)
execution = client.execute_workflow(workflow.id, input_config)

REST API

Create Workflow

curl -X POST https://www.datalab.to/api/v1/workflows/workflows \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Invoice Extraction",
    "steps": [
      {
        "step_key": "marker_parse",
        "unique_name": "parse",
        "settings": {"max_pages": 10}
      },
      {
        "step_key": "marker_extract",
        "unique_name": "extract",
        "settings": {
          "page_schema": {
            "invoice_number": {"type": "string"},
            "total_amount": {"type": "number"}
          }
        },
        "depends_on": ["parse"]
      }
    ]
  }'
Response:
{
  "workflow_id": 42,
  "name": "Invoice Extraction",
  "steps": [...]
}

Execute Workflow

curl -X POST https://www.datalab.to/api/v1/workflows/workflows/42/execute \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_config": {
      "file_urls": ["https://example.com/invoice.pdf"]
    }
  }'
Response:
{
  "execution_id": 101,
  "workflow_id": 42,
  "status": "PENDING"
}

Check Status

curl https://www.datalab.to/api/v1/workflows/executions/101 \
  -H "X-API-Key: YOUR_API_KEY"
Response when complete:
{
  "execution_id": 101,
  "status": "COMPLETED",
  "step_outputs": {
    "parse": {
      "file_abc123": {"status": "COMPLETED", "output_url": "..."}
    },
    "extract": {
      "file_abc123": {
        "invoice_number": "INV-2024-001",
        "total_amount": 1500.00
      }
    }
  }
}

Understanding Steps

Parse Step

  • Uses Marker to parse the document
  • Creates a checkpoint for downstream steps
  • max_pages limits processing for cost control

Extract Step

  • Runs after parse (depends_on: ["parse"])
  • Uses page_schema to define extraction fields
  • Returns structured JSON matching your schema

Try Datalab

Get started with our API in less than a minute. We include free credits.