Get fields you need directly out of any PDF (invoices, financial statements, scientific papers, and more).
marker
in extraction mode.
What you’ll need:
marker
lets you convert PDFs into HTML, JSON, or Markdown, Structured Extraction lets you define a schema and pull out only the fields you care about
You can do this by setting the page_schema
parameter in your marker
request, which forces it to fill in your schema after PDF conversion finishes. If you don’t include that parameter, marker
will just parse your document and convert it to JSON / Markdown / HTML without running extraction.
The easiest way to generate page_schema
correctly is to use our editor in Forge Extract where you can define fields manually and switch into the JSON Editor
tab to pull it out. You could also create a Pydantic schema, then convert to JSON with .model_dump_json()
.
We always recommend trying your schemas in Forge Extract first since it’s easy to visually debug issues with parse settings and schemas before running on a larger batch.
description
field is useful to add more context around what the field is.
(In the future, we’ll be adding separate field validator rules.)
marker
consists of two things:
request_check_url
every few seconds.
status
will be "processing"
until it’s done (at which point it changes to "complete"
.
When it’s done, your response will look something like this:
extraction_schema_json
.
string
instead of a dict
in case of JSON parse issues (we sometimes see edge cases especially with inline math equations and LaTeX, but it’s rare). You can usually load it directly into JSON and recover your whole schema.[fieldname]_citations
which includes a list of Block IDs
from your converted PDF that we cited.marker
in Extraction mode, the original converted PDF is always available within the json
response field. You can access all blocks within the children
tag, and they maintain their original hierarchy (if there is one). Each block includes its original ID
and bounding boxes, so you can show citations and track data lineage easily as part of your document ingestion pipelines!