- A Datalab account with an API key (new accounts include $5 in free credits)
- Python 3.10+ installed
- The Datalab SDK:
pip install datalab-python-sdk - Your
DATALAB_API_KEYenvironment variable set
Building for production? Use Pipelines to chain processors, version your configuration, and deploy with a single API call.
Quick Start
Schema Format
Use JSON Schema format to define what you want to extract:Tips for Better Extraction
- Use descriptive field names -
invoice_numberis clearer thanid - Add descriptions - The
descriptionfield helps the model understand context - Specify types correctly - Use
numberfor numeric values,stringfor text - Use arrays for repeating data - Line items, table rows, etc.
Response
The extracted data is returned inextraction_schema_json:
Citation Tracking
Each extracted field includes citations to the source blocks:json output to trace extracted values back to the source document.
Schema Examples
Financial Document
Scientific Paper
Contract
Using Checkpoints
If you already converted a document withsave_checkpoint=True using the Convert API, pass the checkpoint_id to ExtractOptions to skip re-parsing. This saves time and cost when running extraction on a previously converted document.
file, page_schema or schema_id (one is required), schema_version, mode, max_pages, page_range, save_checkpoint, checkpoint_id, and webhook_url.
Using Saved Schemas
Instead of passingpage_schema inline, you can save schemas to Datalab and reference them by ID. This avoids repeating the schema in every request and enables versioning.
schema_version to pin to a specific schema version; omit it to always use the latest. See Saved Schemas for full CRUD API reference.
Confidence Scoring
Extraction scoring is in beta.We’d love your feedback — reach out at support@datalab.to.Scoring is free.
request_check_url, the response initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include _score fields and an extraction_score_average once scoring completes. No extra parameters or endpoints are needed.
Each _score field is a {"score": int, "reasoning": str} object explaining what evidence was found or missing.
Score response format
Without scoring complete,extraction_schema_json contains fields and citations:
_score object, and the top-level response includes an extraction_score_average:
extraction_score_average (4.5 in this case), averaging all field scores.
Score rubric:
| Score | Meaning |
|---|---|
| 5 | High confidence — clear match with strong citation support |
| 4 | Good confidence — match found with minor ambiguity |
| 3 | Moderate confidence — partial match or uncertain citation |
| 2 | Low confidence — match is inferred or weakly supported |
| 1 | Very low confidence — no clear evidence found |
Auto-Generate Schemas
Don’t want to write schemas by hand? Use the schema generation endpoint to automatically suggest schemas for your document. This requires a checkpoint from a previous conversion:Using Forge Playground
Create and test schemas visually in Forge Playground:- Upload a sample document
- Define fields in the visual editor
- Switch to JSON Editor to copy the schema
- Test extraction before deploying
Next Steps
Saved Schemas
Create reusable schemas and reference them by ID — no need to repeat the schema in each request
Confidence Scoring
Score extraction results with per-field confidence ratings
Handling Long Documents
Strategies for extracting from 100+ page documents
Document Segmentation
Split documents by section before extraction