Skip to main content

Documentation Index

Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt

Use this file to discover all available pages before exploring further.

Accurate mode runs a multi-pass extraction pipeline with independent verification. Every extracted field includes an audit trail: where the value came from, how it was derived, and whether an independent check confirmed it. Before you begin, make sure you have:
  1. A Datalab account with an API key (new accounts include $5 in free credits)
  2. Python 3.10+ installed
  3. The Datalab SDK: pip install datalab-python-sdk
  4. Your DATALAB_API_KEY environment variable set

When to Use Accurate vs Balanced

Balanced (default)Accurate
Price$6 / 1K pages$25 / 1K pages
LatencyFastSlower — trades speed for accuracy via multi-pass verification
Per-field citationsYesYes
Extraction statusNoYes (EXTRACTED / NOT_RESOLVABLE)
Per-field reasoningNoYes
Independent verificationNoYes (PASS / FAIL)
Best forHigh-volume workflows: invoices, forms, bank statementsCompliance, financial, legal, and medical workflows where every field needs an audit trail
Use balanced when speed and cost matter most. Use accurate when you need to trust every field and want metadata to power downstream decisions.

Quick Start

import json
from datalab_sdk import DatalabClient, ExtractOptions

client = DatalabClient()

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string", "description": "Full legal name of the company"},
        "fiscal_year_end": {"type": "string", "description": "End date of the fiscal year (YYYY-MM-DD)"},
        "total_revenue": {"type": "number", "description": "Total revenue in the reporting currency"},
        "auditor_name": {"type": "string", "description": "Name of the external audit firm"}
    },
    "required": ["company_name", "fiscal_year_end"]
}

options = ExtractOptions(
    page_schema=json.dumps(schema),
    extraction_mode="accurate"
)

result = client.extract("annual_report.pdf", options=options)
extracted = json.loads(result.extraction_schema_json)

# Each field comes with citations and metadata
print(f"Company: {extracted['company_name']}")
print(f"Citations: {extracted['company_name_citations']}")
print(f"Status: {extracted['company_name_meta']['extraction_status']}")
print(f"Verified: {extracted['company_name_meta']['verification']['status']}")
extraction_mode controls the extraction pipeline (balanced or accurate). This is separate from mode, which controls the document parsing stage (fast, balanced, or accurate). You can combine them independently — for example, mode="fast" with extraction_mode="accurate".

Response Format

In accurate mode, each extracted field includes three sibling keys. The _citations sibling is the same format as balanced mode for compatibility — accurate mode adds _meta with richer metadata on top:
{
  "company_name": "Whitbread PLC",
  "company_name_citations": ["/page/0/Text/3", "/page/2/Table/1"],
  "company_name_meta": {
    "extraction_status": "EXTRACTED",
    "reasoning": "The company name 'Whitbread PLC' appears in the document header on the cover page (/page/0/Text/3) and is confirmed in the directors' report (/page/2/Table/1).",
    "citations": ["/page/0/Text/3", "/page/2/Table/1"],
    "verification": {
      "status": "PASS",
      "feedback": "The company name 'Whitbread PLC' is printed on the cover page (/page/0/Text/3) and confirmed in the directors' report. No conflicting name appears in the document."
    }
  }
}
The _citations key is shared with balanced mode — if you switch between modes, citation-consuming code continues to work. The _meta key is accurate-mode-only and contains the full audit trail.

Field Metadata

Each _meta object contains:
FieldDescription
extraction_statusHow the value was produced: EXTRACTED (value found in the document) or NOT_RESOLVABLE (document doesn’t contain this information)
reasoningAudit-ready prose explaining how the value was produced, with block ID citations
citationsBlock IDs from the source document that support the value
verificationIndependent verification result with status and feedback

Extraction Status

StatusMeaningValue
EXTRACTEDThe value was found in or derived from the documentThe extracted value
NOT_RESOLVABLEThe document does not contain or imply this valuenull

Verification Status

StatusMeaning
PASSThe value and citations were independently confirmed against the source document
FAIL_UNRESOLVABLEThe document does not support a value for this field
FAIL_FIXThe value was flagged as incorrect during verification — the document supports a different value
FAIL_CITATIONSThe value is correct but the citations are wrong or insufficient
ITEMS_MISSING(List fields only) The document contains entries that are not present in the extraction
In practice, most fields will be PASS or FAIL_UNRESOLVABLE after verification. The other statuses indicate cases where the verifier flagged an issue that could not be fully resolved automatically.

Building Workflows with Verification Metadata

The per-field metadata enables automated quality gates:
import json

extracted = json.loads(result.extraction_schema_json)

# Separate fields by verification status
auto_approved = []
needs_review = []

# Walk all fields and check their _meta
for key, value in extracted.items():
    if key.endswith("_meta"):
        field_name = key.removesuffix("_meta")
        meta = value
        verification = meta.get("verification", {})

        if verification.get("status") == "PASS":
            auto_approved.append(field_name)
        else:
            needs_review.append({
                "field": field_name,
                "extraction_status": meta.get("extraction_status"),
                "reasoning": meta.get("reasoning"),
                "verification_feedback": verification.get("feedback"),
            })

print(f"Auto-approved: {len(auto_approved)} fields")
print(f"Needs review: {len(needs_review)} fields")

# Route to human review queue
for item in needs_review:
    print(f"  {item['field']}: {item['extraction_status']}")
    print(f"    Reason: {item['reasoning'][:100]}...")

Common Workflow Patterns

  • Auto-approve when all fields have verification.status == "PASS" — no human review needed
  • Flag for review when any field is NOT_RESOLVABLE or has a FAIL_* verification status — the document may be missing information or the extraction needs a human check
  • Show citations to reviewers so they can verify in seconds — each field links back to specific blocks in the document
  • Use reasoning as an audit trail — for compliance workflows, the per-field reasoning documents exactly how each value was produced, with block-level citations back to the source document

Next Steps

Structured Extraction Overview

Schema format, response structure, and extraction tips

Confidence Scoring

Additional per-field confidence scores (works with both modes)

Saved Schemas

Save and version schemas for reuse across requests

Handling Long Documents

Tips for extracting from 100+ page documents