Basic Usage
Extract Options
UseExtractOptions to configure extraction behavior:
| Option | Type | Default | Description |
|---|---|---|---|
page_schema | str | Required | JSON schema defining the fields to extract |
checkpoint_id | str | None | Checkpoint ID from a previous convert() call |
mode | str | "fast" | Processing mode: "fast", "balanced", "accurate" |
output_format | str | "markdown" | Output format: "markdown", "html", "json", "chunks" |
save_checkpoint | bool | False | Save checkpoint for reuse with subsequent calls |
max_pages | int | None | Maximum number of pages to process |
page_range | str | None | Specific pages to process (e.g., "0-5,10"). For spreadsheets, filters by sheet index. |
skip_cache | bool | False | Skip cached results, force reprocessing |
webhook_url | str | None | Webhook URL for completion notification |
Checkpoint Reuse
Use checkpoints to avoid re-parsing a document when running extraction after conversion. First convert withsave_checkpoint=True, then extract using the returned checkpoint_id:
Extraction Result
The result object contains the extracted data alongside standard conversion fields:Async Usage
Next Steps
Extraction Recipe
Learn more about structured extraction patterns and best practices.
Document Segmentation
Segment documents into logical sections using schemas.
Document Conversion
Convert documents to Markdown, HTML, JSON, or chunks.
Batch Processing
Process multiple documents efficiently in parallel.