In this example, you’ll learn how to use Datalab Marker API to extract specific data out of any PDF by running marker in extraction mode. What you’ll need:
  • one or more PDFs you want to pull data out of
  • a compliant schema that describes what you want to extract - we’ll describe how below using Forge Extract
  • a way to run our script
Let’s get started!

Submitting an Extraction Request using Marker API

While marker lets you convert PDFs into HTML, JSON, or Markdown, Structured Extraction lets you define a schema and pull out only the fields you care about You can do this by setting the page_schema parameter in your marker request, which forces it to fill in your schema after PDF conversion finishes. If you don’t include that parameter, marker will just parse your document and convert it to JSON / Markdown / HTML without running extraction. The easiest way to generate page_schema correctly is to use our editor in Forge Extract where you can define fields manually and switch into the JSON Editor tab to pull it out. You could also create a Pydantic schema, then convert to JSON with .model_dump_json(). We always recommend trying your schemas in Forge Extract first since it’s easy to visually debug issues with parse settings and schemas before running on a larger batch.

Making an API Call to Run Extraction

Let’s say we’re using a recent 10-K filing from Meta We might use a schema like this to pull a few basic metrics out. Note that the description field is useful to add more context around what the field is. (In the future, we’ll be adding separate field validator rules.)
PAGE_SCHEMA = """{
    "type": "object",
    "properties": {
      "metrics": {
        "type": "object",
        "properties": {
          "diluted_eps_2025": {
            "type": "number",
            "description": "The diluted Earnings per Share (EPS) for 2025"
          },
          "diluted_eps_2024": {
            "type": "number",
            "description": "The diluted Earnings per Share (EPS) for 2024"
          },
          "pct_change_diluted_eps_2024_to_2025": {
            "type": "number",
            "description": "The percentage change in diluted Earnings per Share (EPS) from 2024 to 2025"
          }
        }
      }
    },
    "required": ["metrics"]
  }"""
Submitting a request to marker consists of two things:
  • Triggering the request with your configuration and file
  • Polling it to see if it’s complete
Let’s go ahead and submit our request.
import requests

url = "<https://www.datalab.to/api/v1/marker>"

form_data = {
    'file': ('meta_10k.pdf', open('meta_10k.pdf', 'rb'), 'application/pdf'),
    'page_schema': (None, PAGE_SCHEMA),
    'output_format': (None, 'json'),
    'use_llm': (None, True)
}
headers = {"X-Api-Key": "YOUR_API_KEY"}

# Submit your request

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Your response will look something like this
{
	"success": true,
	"error": null,
	"request_id": "<your_request_id>",
	"request_check_url": "<https://www.datalab.to/api/v1/marker/<your_request_id>",
	"versions": null
}
You can then poll for completion by using request_check_url every few seconds.
import time

# Use request_check_url to poll for job completion

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break
Note that status will be "processing" until it’s done (at which point it changes to "complete" . When it’s done, your response will look something like this:
{
	"status": "complete",
	"json": {
		"children": [...],
	},
	"extraction_schema_json": "{...your extraction results...}",
	...
}
Two really important things to call out:
  • When you run in extraction mode, your extracted schema results will be available within extraction_schema_json .
    • This field is returned as a string instead of a dict in case of JSON parse issues (we sometimes see edge cases especially with inline math equations and LaTeX, but it’s rare). You can usually load it directly into JSON and recover your whole schema.
    • For each field you requested, we’ll also include a [fieldname]_citations which includes a list of Block IDs from your converted PDF that we cited.
  • When you run marker in Extraction mode, the original converted PDF is always available within the json response field. You can access all blocks within the children tag, and they maintain their original hierarchy (if there is one). Each block includes its original ID and bounding boxes, so you can show citations and track data lineage easily as part of your document ingestion pipelines!
If you’re working on super long documents, check out our next guide

Try it out

Sign up for Datalab and try out Forge Extract. Reach out to us at support@datalab.to if you want credits, or have any questions tailored to your needs!