Run Extraction with Marker API

In this example, you’ll learn how to use Datalab Marker API to extract specific data out of any PDF by running marker in extraction mode. What you’ll need:

one or more PDFs you want to pull data out of
a compliant schema that describes what you want to extract - we’ll describe how below using Forge Playground
a way to run our script

Let’s get started!

Overview

While marker lets you convert PDFs into HTML, JSON, or Markdown, Structured Extraction lets you define a schema and pull out only the fields you care about You can do this by setting the page_schema parameter in your marker request, which forces it to fill in your schema after PDF conversion finishes. If you don’t include that parameter, marker will just parse your document and convert it to JSON / Markdown / HTML without running extraction. The easiest way to generate page_schema correctly is to use our editor in Forge Playground where you can define fields manually and switch into the JSON Editor tab to pull it out. You could also create a Pydantic schema, then convert to JSON with .model_dump_json(). We always recommend trying your schemas in Forge Playground first since it’s easy to visually debug issues with parse settings and schemas before running on a larger batch.

Making an API Call to Run Extraction

Let’s say we’re using a recent 10-K filing from Meta We might use a schema like this to pull a few basic metrics out. Note that the description field is useful to add more context around what the field is. (In the future, we’ll be adding separate field validator rules.)

PAGE_SCHEMA = """{
    "type": "object",
    "properties": {
      "metrics": {
        "type": "object",
        "properties": {
          "diluted_eps_2025": {
            "type": "number",
            "description": "The diluted Earnings per Share (EPS) for 2025"
          },
          "diluted_eps_2024": {
            "type": "number",
            "description": "The diluted Earnings per Share (EPS) for 2024"
          },
          "pct_change_diluted_eps_2024_to_2025": {
            "type": "number",
            "description": "The percentage change in diluted Earnings per Share (EPS) from 2024 to 2025"
          }
        }
      }
    },
    "required": ["metrics"]
  }"""

Submitting a request to marker consists of two things:

Triggering the request with your configuration and file
Polling it to see if it’s complete

Let’s go ahead and submit our request.

import requests

url = "<https://www.datalab.to/api/v1/marker>"

form_data = {
    'file': ('meta_10k.pdf', open('meta_10k.pdf', 'rb'), 'application/pdf'),
    'page_schema': (None, PAGE_SCHEMA),
    'output_format': (None, 'json'),
    'use_llm': (None, True)
}
headers = {"X-Api-Key": "YOUR_API_KEY"}

# Submit your request

response = requests.post(url, files=form_data, headers=headers)
data = response.json()

Your response will look something like this

{
	"success": true,
	"error": null,
	"request_id": "<your_request_id>",
	"request_check_url": "<https://www.datalab.to/api/v1/marker/<your_request_id>",
	"versions": null
}

You can then poll for completion by using request_check_url every few seconds.

import time

# Use request_check_url to poll for job completion

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break

Note that status will be "processing" until it’s done (at which point it changes to "complete" . When it’s done, your response will look something like this:

{
	"status": "complete",
	"json": {
		"children": [...],
	},
	"extraction_schema_json": "{...your extraction results...}",
	...
}

Two really important things to call out:

When you run in extraction mode, your extracted schema results will be available within extraction_schema_json .
- This field is returned as a string instead of a dict in case of JSON parse issues (we sometimes see edge cases especially with inline math equations and LaTeX, but it’s rare). You can usually load it directly into JSON and recover your whole schema.
- For each field you requested, we’ll also include a [fieldname]_citations which includes a list of Block IDs from your converted PDF that we cited.
When you run marker in Extraction mode, the original converted PDF is always available within the json response field. You can access all blocks within the children tag, and they maintain their original hierarchy (if there is one). Each block includes its original ID and bounding boxes, so you can show citations and track data lineage easily as part of your document ingestion pipelines!

If you’re working on super long documents, check out our next guide

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!

General

Parsing PDFs with Marker

Table Recognition

Structured Extraction

Document Segmentation

Try Datalab

Try Datalab

Overview

Making an API Call to Run Extraction

Try it out

General

Parsing PDFs with Marker

Table Recognition

Structured Extraction

Document Segmentation

Try Datalab

Try Datalab

​Overview

​Making an API Call to Run Extraction

​Try it out

Overview

Making an API Call to Run Extraction

Try it out