marker
in extraction mode.
What you’ll need:
- one or more PDFs you want to pull data out of
- a compliant schema that describes what you want to extract - we’ll describe how below using Forge Playground
- a way to run our script
Overview
Whilemarker
lets you convert PDFs into HTML, JSON, or Markdown, Structured Extraction lets you define a schema and pull out only the fields you care about
You can do this by setting the page_schema
parameter in your marker
request, which forces it to fill in your schema after PDF conversion finishes. If you don’t include that parameter, marker
will just parse your document and convert it to JSON / Markdown / HTML without running extraction.
The easiest way to generate page_schema
correctly is to use our editor in Forge Playground where you can define fields manually and switch into the JSON Editor
tab to pull it out. You could also create a Pydantic schema, then convert to JSON with .model_dump_json()
.
We always recommend trying your schemas in Forge Playground first since it’s easy to visually debug issues with parse settings and schemas before running on a larger batch.
Making an API Call to Run Extraction
Let’s say we’re using a recent 10-K filing from Meta We might use a schema like this to pull a few basic metrics out. Note that thedescription
field is useful to add more context around what the field is.
(In the future, we’ll be adding separate field validator rules.)
marker
consists of two things:
- Triggering the request with your configuration and file
- Polling it to see if it’s complete
request_check_url
every few seconds.
status
will be "processing"
until it’s done (at which point it changes to "complete"
.
When it’s done, your response will look something like this:
- When you run in extraction mode, your extracted schema results will be available within
extraction_schema_json
.- This field is returned as a
string
instead of adict
in case of JSON parse issues (we sometimes see edge cases especially with inline math equations and LaTeX, but it’s rare). You can usually load it directly into JSON and recover your whole schema. - For each field you requested, we’ll also include a
[fieldname]_citations
which includes a list ofBlock IDs
from your converted PDF that we cited.
- This field is returned as a
- When you run
marker
in Extraction mode, the original converted PDF is always available within thejson
response field. You can access all blocks within thechildren
tag, and they maintain their original hierarchy (if there is one). Each block includes its originalID
and bounding boxes, so you can show citations and track data lineage easily as part of your document ingestion pipelines!