Extract structured data from documents using JSON schemas.
Extract specific fields from documents by providing a JSON schema. Marker parses the document and fills in your schema with extracted values.Before you begin, make sure you have:
The extraction_mode parameter controls how extraction runs. This is separate from mode, which controls document parsing.
Mode
Description
Price
Latency
fast
Extraction with per-field citations
$6 / 1K pages
Lowest
balanced (default)
Extraction with independent verification, per-field reasoning, and extraction status
$25 / 1K pages
Slower — trades speed for higher accuracy
Both modes return citations for every extracted field. Balanced mode additionally returns _meta per field with extraction_status, reasoning, and verification results.
balanced is the default. Teams that made an extraction request in the 30 days before June 4, 2026 default to fast instead. Pass extraction_mode explicitly to override the default in either case.
If you already converted a document with save_checkpoint=True using the Convert API, pass the checkpoint_id to ExtractOptions to skip re-parsing. This saves time and cost when running extraction on a previously converted document.
The extract endpoint accepts the following parameters: file, page_schema or schema_id (one is required), schema_version, mode, max_pages, page_range, save_checkpoint, checkpoint_id, webhook_url, and processing_location (e.g. "eu" — routes processing and storage to EU infrastructure; requires file_url or a pre-uploaded datalab:// reference instead of a multipart upload).
Instead of passing page_schema inline, you can save schemas to Datalab and reference them by ID. This avoids repeating the schema in every request and enables versioning.
Extraction scoring is in beta.We’d love your feedback — reach out at support@datalab.to.Scoring is free.
Scoring runs automatically after every extraction. When you poll request_check_url, the response initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include _score fields and an extraction_score_average once scoring completes. No extra parameters or endpoints are needed.Each _score field is a {"score": int, "reasoning": str} object explaining what evidence was found or missing.
Once scoring finishes, each field also gets a _score object, and the top-level response includes an extraction_score_average:
{ "invoice_number": "INV-2024-001", "invoice_number_citations": ["block_123"], "invoice_number_score": { "score": 5, "reasoning": "Value found verbatim in the document header with a matching citation." }, "total_amount": 1500.00, "total_amount_citations": ["block_456"], "total_amount_score": { "score": 4, "reasoning": "Amount found in the totals row; minor ambiguity due to a subtotal nearby." }}
The top-level response also includes extraction_score_average (4.5 in this case), averaging all field scores.Score rubric:
Score
Meaning
5
High confidence — clear match with strong citation support
4
Good confidence — match found with minor ambiguity
3
Moderate confidence — partial match or uncertain citation
2
Low confidence — match is inferred or weakly supported
Don’t want to write schemas by hand? Use the schema generation endpoint to automatically suggest schemas for your document. This requires a checkpoint from a previous conversion: