Accurate Extraction Mode - Datalab Documentation

Accurate mode runs a multi-pass extraction pipeline with independent verification. Every extracted field includes an audit trail: where the value came from, how it was derived, and whether an independent check confirmed it. Before you begin, make sure you have:

A Datalab account with an API key (new accounts include $5 in free credits)
Python 3.10+ installed
The Datalab SDK: pip install datalab-python-sdk
Your DATALAB_API_KEY environment variable set

When to Use Accurate vs Balanced

	Balanced (default)	Accurate
Price	$6 / 1K pages	$25 / 1K pages
Latency	Fast	Slower — trades speed for accuracy via multi-pass verification
Per-field citations	Yes	Yes
Extraction status	No	Yes (EXTRACTED / NOT_RESOLVABLE)
Per-field reasoning	No	Yes
Independent verification	No	Yes (PASS / FAIL)
Best for	High-volume workflows: invoices, forms, bank statements	Compliance, financial, legal, and medical workflows where every field needs an audit trail

Use balanced when speed and cost matter most. Use accurate when you need to trust every field and want metadata to power downstream decisions.

Quick Start

import json
from datalab_sdk import DatalabClient, ExtractOptions

client = DatalabClient()

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string", "description": "Full legal name of the company"},
        "fiscal_year_end": {"type": "string", "description": "End date of the fiscal year (YYYY-MM-DD)"},
        "total_revenue": {"type": "number", "description": "Total revenue in the reporting currency"},
        "auditor_name": {"type": "string", "description": "Name of the external audit firm"}
    },
    "required": ["company_name", "fiscal_year_end"]
}

options = ExtractOptions(
    page_schema=json.dumps(schema),
    extraction_mode="accurate"
)

result = client.extract("annual_report.pdf", options=options)
extracted = json.loads(result.extraction_schema_json)

# Each field comes with citations and metadata
print(f"Company: {extracted['company_name']}")
print(f"Citations: {extracted['company_name_citations']}")
print(f"Status: {extracted['company_name_meta']['extraction_status']}")
print(f"Verified: {extracted['company_name_meta']['verification']['status']}")

extraction_mode controls the extraction pipeline (balanced or accurate). This is separate from mode, which controls the document parsing stage (fast, balanced, or accurate). You can combine them independently — for example, mode="fast" with extraction_mode="accurate".

Response Format

In accurate mode, each extracted field includes three sibling keys. The _citations sibling is the same format as balanced mode for compatibility — accurate mode adds _meta with richer metadata on top:

{
  "company_name": "Whitbread PLC",
  "company_name_citations": ["/page/0/Text/3", "/page/2/Table/1"],
  "company_name_meta": {
    "extraction_status": "EXTRACTED",
    "reasoning": "The company name 'Whitbread PLC' appears in the document header on the cover page (/page/0/Text/3) and is confirmed in the directors' report (/page/2/Table/1).",
    "citations": ["/page/0/Text/3", "/page/2/Table/1"],
    "verification": {
      "status": "PASS",
      "feedback": "The company name 'Whitbread PLC' is printed on the cover page (/page/0/Text/3) and confirmed in the directors' report. No conflicting name appears in the document."
    }
  }
}

The _citations key is shared with balanced mode — if you switch between modes, citation-consuming code continues to work. The _meta key is accurate-mode-only and contains the full audit trail.

Field Metadata

Each _meta object contains:

Field	Description
`extraction_status`	How the value was produced: `EXTRACTED` (value found in the document) or `NOT_RESOLVABLE` (document doesn’t contain this information)
`reasoning`	Audit-ready prose explaining how the value was produced, with block ID citations
`citations`	Block IDs from the source document that support the value
`verification`	Independent verification result with `status` and `feedback`

Extraction Status

Status	Meaning	Value
`EXTRACTED`	The value was found in or derived from the document	The extracted value
`NOT_RESOLVABLE`	The document does not contain or imply this value	`null`

Verification Status

Status	Meaning
`PASS`	The value and citations were independently confirmed against the source document
`FAIL_UNRESOLVABLE`	The document does not support a value for this field
`FAIL_FIX`	The value was flagged as incorrect during verification — the document supports a different value
`FAIL_CITATIONS`	The value is correct but the citations are wrong or insufficient
`ITEMS_MISSING`	(List fields only) The document contains entries that are not present in the extraction

In practice, most fields will be PASS or FAIL_UNRESOLVABLE after verification. The other statuses indicate cases where the verifier flagged an issue that could not be fully resolved automatically.

Building Workflows with Verification Metadata

The per-field metadata enables automated quality gates:

import json

extracted = json.loads(result.extraction_schema_json)

# Separate fields by verification status
auto_approved = []
needs_review = []

# Walk all fields and check their _meta
for key, value in extracted.items():
    if key.endswith("_meta"):
        field_name = key.removesuffix("_meta")
        meta = value
        verification = meta.get("verification", {})

        if verification.get("status") == "PASS":
            auto_approved.append(field_name)
        else:
            needs_review.append({
                "field": field_name,
                "extraction_status": meta.get("extraction_status"),
                "reasoning": meta.get("reasoning"),
                "verification_feedback": verification.get("feedback"),
            })

print(f"Auto-approved: {len(auto_approved)} fields")
print(f"Needs review: {len(needs_review)} fields")

# Route to human review queue
for item in needs_review:
    print(f"  {item['field']}: {item['extraction_status']}")
    print(f"    Reason: {item['reasoning'][:100]}...")

Common Workflow Patterns

Auto-approve when all fields have verification.status == "PASS" — no human review needed
Flag for review when any field is NOT_RESOLVABLE or has a FAIL_* verification status — the document may be missing information or the extraction needs a human check
Show citations to reviewers so they can verify in seconds — each field links back to specific blocks in the document
Use reasoning as an audit trail — for compliance workflows, the per-field reasoning documents exactly how each value was produced, with block-level citations back to the source document

Next Steps

Structured Extraction Overview

Schema format, response structure, and extraction tips

Confidence Scoring

Additional per-field confidence scores (works with both modes)

Saved Schemas

Save and version schemas for reuse across requests

Handling Long Documents

Tips for extracting from 100+ page documents

Documentation Index

​When to Use Accurate vs Balanced

​Quick Start

​Response Format

​Field Metadata

​Extraction Status

​Verification Status

​Building Workflows with Verification Metadata

​Common Workflow Patterns

​Next Steps

Structured Extraction Overview

Confidence Scoring

Saved Schemas

Handling Long Documents

When to Use Accurate vs Balanced

Quick Start

Response Format

Field Metadata

Extraction Status

Verification Status

Building Workflows with Verification Metadata

Common Workflow Patterns

Next Steps