# [DEPRECATED] Marker Source: https://documentation.datalab.to/api-reference/[deprecated]-marker https://www.datalab.to/openapi.json post /api/v1/marker **DEPRECATED**: Use the new endpoints instead: - `/convert` for document conversion - `/extract` for structured data extraction - `/segment` for document segmentation - `/custom-pipeline` for custom pipeline execution This endpoint will be removed in a future version. # [DEPRECATED] OCR Source: https://documentation.datalab.to/api-reference/[deprecated]-ocr https://www.datalab.to/openapi.json post /api/v1/ocr [DEPRECATED] This endpoint is deprecated and will be removed in the future. This endpoint is used to submit a PDF or image for OCR. The OCR text lines will be returned, along with their bbox and polygon coordinates. # [DEPRECATED] Table Recognition Source: https://documentation.datalab.to/api-reference/[deprecated]-table-recognition https://www.datalab.to/openapi.json post /api/v1/table_rec [DEPRECATED] This endpoint is deprecated and will be removed in the future. This endpoint is used to submit a request for table recognition. The detected tables will be returned, as well as their parsed structure. # Api Health Source: https://documentation.datalab.to/api-reference/api-health https://www.datalab.to/openapi.json get /api/v1/user_health This endpoint is used to check the health of the API, given an API key. # Add Files To Collection Source: https://documentation.datalab.to/api-reference/collections/add-files-to-collection https://www.datalab.to/openapi.json post /api/v1/collections/{collection_id}/files Link existing uploaded files to a collection. # Create Collection Source: https://documentation.datalab.to/api-reference/collections/create-collection https://www.datalab.to/openapi.json post /api/v1/collections Create a new collection. # Delete Collection Source: https://documentation.datalab.to/api-reference/collections/delete-collection https://www.datalab.to/openapi.json delete /api/v1/collections/{collection_id} Soft-delete (archive) collection. # Get Batch Run Source: https://documentation.datalab.to/api-reference/collections/get-batch-run https://www.datalab.to/openapi.json get /api/v1/eval_batch_runs/{run_id} Get batch run status and progress. # Get Batch Run Results Source: https://documentation.datalab.to/api-reference/collections/get-batch-run-results https://www.datalab.to/openapi.json get /api/v1/eval_batch_runs/{run_id}/results Get per-file results for a batch run. # Get Collection Source: https://documentation.datalab.to/api-reference/collections/get-collection https://www.datalab.to/openapi.json get /api/v1/collections/{collection_id} Get collection with file list. # List Batch Runs Source: https://documentation.datalab.to/api-reference/collections/list-batch-runs https://www.datalab.to/openapi.json get /api/v1/eval_batch_runs List batch runs for the team, optionally filtered by collection, eval rubric, and/or pipeline. # List Collections Source: https://documentation.datalab.to/api-reference/collections/list-collections https://www.datalab.to/openapi.json get /api/v1/collections List collections for the team. # Remove File From Collection Source: https://documentation.datalab.to/api-reference/collections/remove-file-from-collection https://www.datalab.to/openapi.json delete /api/v1/collections/{collection_id}/files/{uploaded_file_id} Unlink a file from a collection (does NOT delete the uploaded file). # Start Batch Run Source: https://documentation.datalab.to/api-reference/collections/start-batch-run https://www.datalab.to/openapi.json post /api/v1/eval_batch_runs Start a batch evaluation run on all files in the collection. # Update Collection Source: https://documentation.datalab.to/api-reference/collections/update-collection https://www.datalab.to/openapi.json put /api/v1/collections/{collection_id} Update collection name/description. # Convert Document Source: https://documentation.datalab.to/api-reference/convert-document https://www.datalab.to/openapi.json post /api/v1/convert Convert a PDF, image, or document to markdown, HTML, JSON, or chunks. Use save_checkpoint=true to save parsed state for later /extract or /segment calls. # Convert Result Check Source: https://documentation.datalab.to/api-reference/convert-result-check https://www.datalab.to/openapi.json get /api/v1/convert/{request_id} Poll this endpoint to check the status of a Convert request and retrieve the converted document. # Create Document Source: https://documentation.datalab.to/api-reference/create-document https://www.datalab.to/openapi.json post /api/v1/create-document Create a DOCX document from markdown with track changes support. Supports , , and tags. # Create Document Result Check Source: https://documentation.datalab.to/api-reference/create-document-result-check https://www.datalab.to/openapi.json get /api/v1/create-document/{request_id} Poll this endpoint to check status of a Create Document request and retrieve the generated document # Create Workflow Source: https://documentation.datalab.to/api-reference/create-workflow https://www.datalab.to/openapi.json post /api/v1/workflows/workflows Create a new workflow definition. Example: ```json { "name": "PDF Processing Pipeline", "team_id": 1, "steps": [ { "step_key": "marker_parse", "unique_name": "parse", "settings": {"extract_images": true} }, { "step_key": "marker_extract", "unique_name": "extract", "version": "1.0.0", "settings": {}, "depends_on": ["parse"] }, { "step_key": "marker_segment", "unique_name": "segment", "settings": {"method": "auto"}, "depends_on": ["parse"] } ] } ``` This creates a template that can be executed multiple times. Note: - version is optional and defaults to the latest active version - unique_name is required and must be unique within the workflow - depends_on references other steps by their unique_name # Archive Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/archive-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/archive Archive a custom processor (soft-delete). Available to any team member with pipeline access. # Check Pipeline Access Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/check-pipeline-access https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/access Check if the current user's team has access to custom processors. Custom processors are generally available: every team has access and can create/iterate. Kept (always-true) for backwards compatibility with deployed frontends and API integrations that still poll this endpoint; creation volume is governed by the per-plan creation allowance, not by an access gate. # Delete Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/delete-custom-pipeline https://www.datalab.to/openapi.json delete /api/v1/custom_pipelines/{processor_id} Permanently delete a custom processor and all its versions. Admin-only. # Describe Customizer Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/describe-customizer https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/describe Conversational endpoint for building a custom processor description. Accepts the chat history, returns the next assistant message. When the system has enough context, includes a proposed_description. # Export Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/export-custom-pipeline https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/export Export a custom processor with all versions. Admin-only. Note: This endpoint allows admins to export ANY processor across all teams, not just processors belonging to the admin's team. Returns the full processor record and all version data, suitable for use as training data or re-importing via the seed endpoint. # Get Creation Allowance Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-creation-allowance https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/creation-allowance 2026 custom-processor CREATION allowance preview for the current team (§5). The frontend reads this BEFORE creating a NEW processor to show the at-cap block (Free / developer hard cap) or the one-time $5 confirmation (Team). NOTE: the path is registered BEFORE /custom_pipelines/{lookup_key} so the literal "creation-allowance" segment is matched by this route, not captured as a lookup_key. # Get Custom Pipeline Status Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-custom-pipeline-status https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{lookup_key} Check the status of a custom processor generation request using the request_check_url from the initial submission. # Get Pipeline Eval Definition Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-pipeline-eval-definition https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/eval_definition Get the eval_definition from a custom processor's active version. # Get Pipelines Using Processor Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-pipelines-using-processor https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/pipelines List pipelines (from the Pipeline table) that reference this custom processor in their steps JSON. # Get Processor Version Detail Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/get-processor-version-detail https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/versions/{version} Get detailed data for a specific processor version, including pipeline_params and eval_definition. # Iterate Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/iterate-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/iterate Iterate on an existing custom processor. Provides feedback to the agent which resumes from the previous session, creating a new version of the processor parameters. # List Custom Pipelines Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/list-custom-pipelines https://www.datalab.to/openapi.json get /api/v1/custom_pipelines List all custom processors for a team. Returns processors ordered by creation date (newest first). # List Pipeline Versions Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/list-pipeline-versions https://www.datalab.to/openapi.json get /api/v1/custom_pipelines/{processor_id}/versions List all versions of a custom processor, ordered by version descending. # Restore Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/restore-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/restore Restore an archived custom processor. Admin-only. # Seed Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/seed-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/seed Directly create a completed custom processor from JSON. Admin-only. Skips the agent entirely -- useful for seeding test data, local development, and populating demos. The processor is immediately usable via POST /api/v1/marker. USE WITH CAUTION # Set Active Pipeline Version Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/set-active-pipeline-version https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/set_active Set the active version of a custom processor. Changes the active_version pointer to any existing version. # Submit Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/submit-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_pipelines Submit a custom processor generation request. # Transfer Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/transfer-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_pipelines/{processor_id}/transfer Transfer a custom processor to another team. This endpoint allows admins to transfer ownership of a custom processor from one team to another. This is useful for: 1. Beta testing: Create and test processors internally, then transfer to customers 2. Sharing: Move successful processor configurations between teams 3. Updating: Push iterated versions to an existing customer processor (via to_processor_id) Superusers can transfer any processor regardless of team ownership. # Update Pipeline Eval Definition Source: https://documentation.datalab.to/api-reference/custom-pipelines-deprecated/update-pipeline-eval-definition https://www.datalab.to/openapi.json put /api/v1/custom_pipelines/{processor_id}/eval_definition Update the eval_definition on a custom processor's active version. # Custom Processor Result Check Source: https://documentation.datalab.to/api-reference/custom-processor-result-check https://www.datalab.to/openapi.json get /api/v1/custom-processor/{request_id} Poll this endpoint to check the status of a Custom Processor request and retrieve the results. # Archive Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/archive-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/archive Archive a custom processor (soft-delete). Available to any team member with pipeline access. # Check Pipeline Access Source: https://documentation.datalab.to/api-reference/custom-processors/check-pipeline-access https://www.datalab.to/openapi.json get /api/v1/custom_processors/access Check if the current user's team has access to custom processors. Custom processors are generally available: every team has access and can create/iterate. Kept (always-true) for backwards compatibility with deployed frontends and API integrations that still poll this endpoint; creation volume is governed by the per-plan creation allowance, not by an access gate. # Delete Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/delete-custom-pipeline https://www.datalab.to/openapi.json delete /api/v1/custom_processors/{processor_id} Permanently delete a custom processor and all its versions. Admin-only. # Describe Customizer Source: https://documentation.datalab.to/api-reference/custom-processors/describe-customizer https://www.datalab.to/openapi.json post /api/v1/custom_processors/describe Conversational endpoint for building a custom processor description. Accepts the chat history, returns the next assistant message. When the system has enough context, includes a proposed_description. # Export Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/export-custom-pipeline https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/export Export a custom processor with all versions. Admin-only. Note: This endpoint allows admins to export ANY processor across all teams, not just processors belonging to the admin's team. Returns the full processor record and all version data, suitable for use as training data or re-importing via the seed endpoint. # Get Creation Allowance Source: https://documentation.datalab.to/api-reference/custom-processors/get-creation-allowance https://www.datalab.to/openapi.json get /api/v1/custom_processors/creation-allowance 2026 custom-processor CREATION allowance preview for the current team (§5). The frontend reads this BEFORE creating a NEW processor to show the at-cap block (Free / developer hard cap) or the one-time $5 confirmation (Team). NOTE: the path is registered BEFORE /custom_pipelines/{lookup_key} so the literal "creation-allowance" segment is matched by this route, not captured as a lookup_key. # Get Custom Pipeline Status Source: https://documentation.datalab.to/api-reference/custom-processors/get-custom-pipeline-status https://www.datalab.to/openapi.json get /api/v1/custom_processors/{lookup_key} Check the status of a custom processor generation request using the request_check_url from the initial submission. # Get Pipeline Eval Definition Source: https://documentation.datalab.to/api-reference/custom-processors/get-pipeline-eval-definition https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/eval_definition Get the eval_definition from a custom processor's active version. # Get Pipelines Using Processor Source: https://documentation.datalab.to/api-reference/custom-processors/get-pipelines-using-processor https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/pipelines List pipelines (from the Pipeline table) that reference this custom processor in their steps JSON. # Get Processor Version Detail Source: https://documentation.datalab.to/api-reference/custom-processors/get-processor-version-detail https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/versions/{version} Get detailed data for a specific processor version, including pipeline_params and eval_definition. # Iterate Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/iterate-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/iterate Iterate on an existing custom processor. Provides feedback to the agent which resumes from the previous session, creating a new version of the processor parameters. # List Custom Pipelines Source: https://documentation.datalab.to/api-reference/custom-processors/list-custom-pipelines https://www.datalab.to/openapi.json get /api/v1/custom_processors List all custom processors for a team. Returns processors ordered by creation date (newest first). # List Pipeline Versions Source: https://documentation.datalab.to/api-reference/custom-processors/list-pipeline-versions https://www.datalab.to/openapi.json get /api/v1/custom_processors/{processor_id}/versions List all versions of a custom processor, ordered by version descending. # Restore Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/restore-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/restore Restore an archived custom processor. Admin-only. # Seed Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/seed-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_processors/seed Directly create a completed custom processor from JSON. Admin-only. Skips the agent entirely -- useful for seeding test data, local development, and populating demos. The processor is immediately usable via POST /api/v1/marker. USE WITH CAUTION # Set Active Pipeline Version Source: https://documentation.datalab.to/api-reference/custom-processors/set-active-pipeline-version https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/set_active Set the active version of a custom processor. Changes the active_version pointer to any existing version. # Submit Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/submit-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_processors Submit a custom processor generation request. # Transfer Custom Pipeline Source: https://documentation.datalab.to/api-reference/custom-processors/transfer-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom_processors/{processor_id}/transfer Transfer a custom processor to another team. This endpoint allows admins to transfer ownership of a custom processor from one team to another. This is useful for: 1. Beta testing: Create and test processors internally, then transfer to customers 2. Sharing: Move successful processor configurations between teams 3. Updating: Push iterated versions to an existing customer processor (via to_processor_id) Superusers can transfer any processor regardless of team ownership. # Update Pipeline Eval Definition Source: https://documentation.datalab.to/api-reference/custom-processors/update-pipeline-eval-definition https://www.datalab.to/openapi.json put /api/v1/custom_processors/{processor_id}/eval_definition Update the eval_definition on a custom processor's active version. # Delete Workflow Source: https://documentation.datalab.to/api-reference/delete-workflow https://www.datalab.to/openapi.json delete /api/v1/workflows/workflows/{workflow_id} Delete a workflow definition. # Create Eval Rubric Source: https://documentation.datalab.to/api-reference/eval_rubrics/create-eval-rubric https://www.datalab.to/openapi.json post /api/v1/eval_rubrics Create new eval rubric for the team. # Create From Feedback Source: https://documentation.datalab.to/api-reference/eval_rubrics/create-from-feedback https://www.datalab.to/openapi.json post /api/v1/eval_rubrics/from_feedback Convert user feedback items into structured eval rubric using LLM rewrite. # Delete Eval Rubric Source: https://documentation.datalab.to/api-reference/eval_rubrics/delete-eval-rubric https://www.datalab.to/openapi.json delete /api/v1/eval_rubrics/{rubric_id} Soft-delete (archive) eval rubric. # Generate From Feedback Source: https://documentation.datalab.to/api-reference/eval_rubrics/generate-from-feedback https://www.datalab.to/openapi.json post /api/v1/eval_rubrics/generate_from_feedback Generate eval rubric from feedback items using LLM rewrite (no DB save). # Get Eval Rubric Source: https://documentation.datalab.to/api-reference/eval_rubrics/get-eval-rubric https://www.datalab.to/openapi.json get /api/v1/eval_rubrics/{rubric_id} Get eval rubric by ID. # Import From Pipeline Source: https://documentation.datalab.to/api-reference/eval_rubrics/import-from-pipeline https://www.datalab.to/openapi.json post /api/v1/eval_rubrics/import_from_pipeline Import eval_definition from a custom pipeline's active version. # List Eval Rubrics Source: https://documentation.datalab.to/api-reference/eval_rubrics/list-eval-rubrics https://www.datalab.to/openapi.json get /api/v1/eval_rubrics List eval rubrics for the team. # Update Eval Rubric Source: https://documentation.datalab.to/api-reference/eval_rubrics/update-eval-rubric https://www.datalab.to/openapi.json put /api/v1/eval_rubrics/{rubric_id} Update eval rubric. # Execute Workflow Source: https://documentation.datalab.to/api-reference/execute-workflow https://www.datalab.to/openapi.json post /api/v1/workflows/workflows/{workflow_id}/execute Execute a workflow definition. This creates a WorkflowExecution and starts a Temporal workflow that will dynamically load the steps and execute them. Requires: X-API-Key header for authentication Body (optional): { "input_config": { "type": "single_file", "file_url": "https://example.com/file.pdf" } } or { "input_config": { "type": "file_list", "file_urls": ["https://example.com/file1.pdf", "https://example.com/file2.pdf"] } } # Extract Result Check Source: https://documentation.datalab.to/api-reference/extract-result-check https://www.datalab.to/openapi.json get /api/v1/extract/{request_id} Poll this endpoint to check the status of an Extract request and retrieve the extracted structured data. # Extract Structured Data Source: https://documentation.datalab.to/api-reference/extract-structured-data https://www.datalab.to/openapi.json post /api/v1/extract Extract structured data from a document using a JSON schema. Provide a file for end-to-end processing, or a checkpoint_id from a previous /convert call to skip re-parsing. # Extraction Schema Generation Result Check Source: https://documentation.datalab.to/api-reference/extraction-schema-generation-result-check https://www.datalab.to/openapi.json get /api/v1/marker/extraction/gen_schemas/{request_id} Poll this endpoint to check status of an Extraction Schema Generation request and retrieve final results # Create Extraction Schema Source: https://documentation.datalab.to/api-reference/extraction_schemas/create-extraction-schema https://www.datalab.to/openapi.json post /api/v1/extraction_schemas Create a new extraction schema for the team. # Delete Extraction Schema Source: https://documentation.datalab.to/api-reference/extraction_schemas/delete-extraction-schema https://www.datalab.to/openapi.json delete /api/v1/extraction_schemas/{schema_id} Soft-delete (archive) extraction schema. # Get Extraction Schema Source: https://documentation.datalab.to/api-reference/extraction_schemas/get-extraction-schema https://www.datalab.to/openapi.json get /api/v1/extraction_schemas/{schema_id} Get extraction schema by ID. # List Extraction Schemas Source: https://documentation.datalab.to/api-reference/extraction_schemas/list-extraction-schemas https://www.datalab.to/openapi.json get /api/v1/extraction_schemas List extraction schemas for the team. # Update Extraction Schema Source: https://documentation.datalab.to/api-reference/extraction_schemas/update-extraction-schema https://www.datalab.to/openapi.json put /api/v1/extraction_schemas/{schema_id} Update extraction schema. Optionally create a new version. # Confirm Upload Source: https://documentation.datalab.to/api-reference/files/confirm-upload https://www.datalab.to/openapi.json get /api/v1/files/{file_id_or_hashid}/confirm Confirm that a file was successfully uploaded to storage. Call this endpoint after successfully uploading a file using the presigned URL from /upload. This will verify the file exists, get the actual file size, and mark it as completed. Accepts either integer file_id (e.g., "4") or hashid (e.g., "npl94jxy"). This makes the file available for use in workflows. # Delete File Source: https://documentation.datalab.to/api-reference/files/delete-file https://www.datalab.to/openapi.json delete /api/v1/files/{file_id} Delete an uploaded file. Removes the file from both storage and the database. # Get File Download Url Source: https://documentation.datalab.to/api-reference/files/get-file-download-url https://www.datalab.to/openapi.json get /api/v1/files/{file_id}/download Generate presigned URL for downloading a file. The URL is valid for the specified expiry time (default: 1 hour). Args: file_id: File ID expires_in: URL expiry time in seconds (default: 3600, max: 86400) # Get File Metadata Source: https://documentation.datalab.to/api-reference/files/get-file-metadata https://www.datalab.to/openapi.json get /api/v1/files/{file_id} Get metadata for an uploaded file. Returns file information including size, content type, and upload timestamp. # List Files Source: https://documentation.datalab.to/api-reference/files/list-files https://www.datalab.to/openapi.json get /api/v1/files List all uploaded files for the team. Supports pagination with limit and offset parameters. Args: limit: Maximum number of files to return (default: 50, max: 100) offset: Number of files to skip (default: 0) # Request Upload Url Source: https://documentation.datalab.to/api-reference/files/request-upload-url https://www.datalab.to/openapi.json post /api/v1/files/upload Request a presigned upload URL for direct client-side upload to storage. This is the recommended upload flow: 1. Client calls this endpoint with filename and content_type 2. Backend creates a pending file record and returns presigned PUT URL 3. Client uploads directly to storage using the presigned URL 4. Client calls /confirm to verify upload and get actual file size # Form Filling Source: https://documentation.datalab.to/api-reference/form-filling https://www.datalab.to/openapi.json post /api/v1/fill Fill PDF or image forms with provided field data. Supports PDFs with and without native form fields. # Form Filling Result Check Source: https://documentation.datalab.to/api-reference/form-filling-result-check https://www.datalab.to/openapi.json get /api/v1/fill/{request_id} Poll this endpoint to check status of a Form Filling request and retrieve the filled form # Generate Extraction Schemas Source: https://documentation.datalab.to/api-reference/generate-extraction-schemas https://www.datalab.to/openapi.json post /api/v1/marker/extraction/gen_schemas For a given file, generate potential extraction schemas. # Get Execution Status Source: https://documentation.datalab.to/api-reference/get-execution-status https://www.datalab.to/openapi.json get /api/v1/workflows/executions/{execution_id} Get the status and results of a workflow execution. Returns execution status and step data keyed by unique_name. For completed or failed steps, output data is provided as presigned URLs since outputs can be large/complex. Users can poll this endpoint until status is COMPLETED or FAILED. Response: { "execution_id": 123, "workflow_id": 456, "status": "IN_PROGRESS" | "COMPLETED" | "FAILED" | "QUEUED" | "PENDING", "created": "2025-10-22T10:00:00", "updated": "2025-10-22T10:05:00", "steps": { "parse": { "status": "COMPLETED", "started_at": "2025-10-22T10:00:00", "completed_at": "2025-10-22T10:02:00", "output_url": "https://presigned-url-to-output.json" }, "extract": { "status": "IN_PROGRESS", "started_at": "2025-10-22T10:02:00" }, "segment": { "status": "PENDING" } } } # Get Workflow Source: https://documentation.datalab.to/api-reference/get-workflow https://www.datalab.to/openapi.json get /api/v1/workflows/workflows/{workflow_id} Get workflow definition with all steps. # Health Source: https://documentation.datalab.to/api-reference/health https://www.datalab.to/openapi.json get /api/v1/health This endpoint is used to check the health of the API. Returns a JSON object with the key "status" set to "ok". # List Step Types Source: https://documentation.datalab.to/api-reference/list-step-types https://www.datalab.to/openapi.json get /api/v1/workflows/step-types List all available step types that can be used in workflows. These are the building blocks users can compose into workflows. # List Workflows Source: https://documentation.datalab.to/api-reference/list-workflows https://www.datalab.to/openapi.json get /api/v1/workflows/workflows List all workflow definitions with their steps. # Marker Result Check Source: https://documentation.datalab.to/api-reference/marker-result-check https://www.datalab.to/openapi.json get /api/v1/marker/{request_id} Poll this endpoint to check status of Marker request and retrieve final results # OCR Result Check Source: https://documentation.datalab.to/api-reference/ocr-result-check https://www.datalab.to/openapi.json get /api/v1/ocr/{request_id} Poll this endpoint to check status of an OCR request and retrieve final results # Add Template Examples Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/add-template-examples https://www.datalab.to/openapi.json post /api/v1/pipeline_templates/{slug}/examples Upload example files for a template. Admin-only. # Clone Template Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/clone-template https://www.datalab.to/openapi.json post /api/v1/pipeline_templates/{slug}/clone Clone a template to the user's team as a new custom processor. # Download Template Example Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/download-template-example https://www.datalab.to/openapi.json get /api/v1/pipeline_templates/{slug}/examples/{filename} Fetch example file from R2 and return content directly. # Download Template Example Thumbnail Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/download-template-example-thumbnail https://www.datalab.to/openapi.json get /api/v1/pipeline_templates/{slug}/examples/{filename}/thumbnail Stream thumbnail image for an example file. # Get Template Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/get-template https://www.datalab.to/openapi.json get /api/v1/pipeline_templates/{slug} Get detailed info for a pipeline template. # List Templates Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/list-templates https://www.datalab.to/openapi.json get /api/v1/pipeline_templates List all published pipeline templates. # Promote To Template Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/promote-to-template https://www.datalab.to/openapi.json post /api/v1/pipeline_templates Create a template by copying an existing completed processor. Admin-only. Creates an independent copy so the admin can iterate on the source processor without affecting the template. Only the active version is copied, and agent session/checkpoint data is stripped so cloned copies don't share Claude sessions. # Remove Template Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/remove-template https://www.datalab.to/openapi.json delete /api/v1/pipeline_templates/{slug} Un-template a pipeline (sets is_template=False). Admin-only. # Remove Template Example Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/remove-template-example https://www.datalab.to/openapi.json delete /api/v1/pipeline_templates/{slug}/examples/{filename} Remove an example file from a template. Admin-only. # Update Template Source: https://documentation.datalab.to/api-reference/pipeline-templates-deprecated/update-template https://www.datalab.to/openapi.json put /api/v1/pipeline_templates/{slug} Update template metadata. Admin-only. # Archive Pipeline Source: https://documentation.datalab.to/api-reference/pipelines/archive-pipeline https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/archive Archive a pipeline, hiding it from the default list. # Create Pipeline Source: https://documentation.datalab.to/api-reference/pipelines/create-pipeline https://www.datalab.to/openapi.json post /api/v1/pipelines Create a new pipeline for the team. # Create Pipeline Version Source: https://documentation.datalab.to/api-reference/pipelines/create-pipeline-version https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/versions Create a new version snapshot of the pipeline's current steps. # Discard Draft Source: https://documentation.datalab.to/api-reference/pipelines/discard-draft https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/discard Discard draft changes and reset Pipeline.steps to a published version's steps. # Get Pipeline Source: https://documentation.datalab.to/api-reference/pipelines/get-pipeline https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id} Get pipeline by pipeline_id. # Get Pipeline Execution Source: https://documentation.datalab.to/api-reference/pipelines/get-pipeline-execution https://www.datalab.to/openapi.json get /api/v1/pipelines/executions/{execution_id} Poll execution status. Returns per-step status with lookup keys for partial results. Decision rule: Check PG PipelineExecution.status first. - If terminal (completed/failed): return from PostgreSQL (post-sync, complete data) - If running/pending: read Firestore for real-time step status # Get Pipeline Rate Source: https://documentation.datalab.to/api-reference/pipelines/get-pipeline-rate https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id}/rate Get the pipeline rate based on plan and effective processing region. # Get Step Result Source: https://documentation.datalab.to/api-reference/pipelines/get-step-result https://www.datalab.to/openapi.json get /api/v1/pipelines/executions/{execution_id}/steps/{step_index}/result Fetch intermediate result for a specific pipeline execution step. # List Pipeline Executions Source: https://documentation.datalab.to/api-reference/pipelines/list-pipeline-executions https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id}/executions List recent executions for a pipeline. # List Pipeline Versions Source: https://documentation.datalab.to/api-reference/pipelines/list-pipeline-versions https://www.datalab.to/openapi.json get /api/v1/pipelines/{pipeline_id}/versions List all versions of a pipeline, newest first. # List Pipelines Source: https://documentation.datalab.to/api-reference/pipelines/list-pipelines https://www.datalab.to/openapi.json get /api/v1/pipelines List pipelines for the team. # Run Pipeline Source: https://documentation.datalab.to/api-reference/pipelines/run-pipeline https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/run Execute a pipeline on a file, creating an execution DAG with per-step tracking and billing. # Save Pipeline Source: https://documentation.datalab.to/api-reference/pipelines/save-pipeline https://www.datalab.to/openapi.json put /api/v1/pipelines/{pipeline_id}/save Name and promote a pipeline to saved status. # Unarchive Pipeline Source: https://documentation.datalab.to/api-reference/pipelines/unarchive-pipeline https://www.datalab.to/openapi.json post /api/v1/pipelines/{pipeline_id}/unarchive Unarchive a pipeline, restoring it to the default list. # Update Pipeline Source: https://documentation.datalab.to/api-reference/pipelines/update-pipeline https://www.datalab.to/openapi.json put /api/v1/pipelines/{pipeline_id} Update pipeline steps. This is the auto-save path. # Add Template Examples Source: https://documentation.datalab.to/api-reference/processor-templates/add-template-examples https://www.datalab.to/openapi.json post /api/v1/processor_templates/{slug}/examples Upload example files for a template. Admin-only. # Clone Template Source: https://documentation.datalab.to/api-reference/processor-templates/clone-template https://www.datalab.to/openapi.json post /api/v1/processor_templates/{slug}/clone Clone a template to the user's team as a new custom processor. # Download Template Example Source: https://documentation.datalab.to/api-reference/processor-templates/download-template-example https://www.datalab.to/openapi.json get /api/v1/processor_templates/{slug}/examples/{filename} Fetch example file from R2 and return content directly. # Download Template Example Thumbnail Source: https://documentation.datalab.to/api-reference/processor-templates/download-template-example-thumbnail https://www.datalab.to/openapi.json get /api/v1/processor_templates/{slug}/examples/{filename}/thumbnail Stream thumbnail image for an example file. # Get Template Source: https://documentation.datalab.to/api-reference/processor-templates/get-template https://www.datalab.to/openapi.json get /api/v1/processor_templates/{slug} Get detailed info for a pipeline template. # List Templates Source: https://documentation.datalab.to/api-reference/processor-templates/list-templates https://www.datalab.to/openapi.json get /api/v1/processor_templates List all published pipeline templates. # Promote To Template Source: https://documentation.datalab.to/api-reference/processor-templates/promote-to-template https://www.datalab.to/openapi.json post /api/v1/processor_templates Create a template by copying an existing completed processor. Admin-only. Creates an independent copy so the admin can iterate on the source processor without affecting the template. Only the active version is copied, and agent session/checkpoint data is stripped so cloned copies don't share Claude sessions. # Remove Template Source: https://documentation.datalab.to/api-reference/processor-templates/remove-template https://www.datalab.to/openapi.json delete /api/v1/processor_templates/{slug} Un-template a pipeline (sets is_template=False). Admin-only. # Remove Template Example Source: https://documentation.datalab.to/api-reference/processor-templates/remove-template-example https://www.datalab.to/openapi.json delete /api/v1/processor_templates/{slug}/examples/{filename} Remove an example file from a template. Admin-only. # Update Template Source: https://documentation.datalab.to/api-reference/processor-templates/update-template https://www.datalab.to/openapi.json put /api/v1/processor_templates/{slug} Update template metadata. Admin-only. # Run Custom Pipeline Source: https://documentation.datalab.to/api-reference/run-custom-pipeline https://www.datalab.to/openapi.json post /api/v1/custom-pipeline Execute a custom pipeline configuration. The pipeline_id must reference a completed custom pipeline ID or a template ID. # Run Custom Processor Source: https://documentation.datalab.to/api-reference/run-custom-processor https://www.datalab.to/openapi.json post /api/v1/custom-processor Execute a custom processor configuration. The pipeline_id must reference a completed custom processor ID or a template ID. # Segment Document Source: https://documentation.datalab.to/api-reference/segment-document https://www.datalab.to/openapi.json post /api/v1/segment Segment a document into sections using a schema. Returns page ranges for each identified segment. Provide a file for end-to-end processing, or a checkpoint_id from a previous /convert call. # Segment Result Check Source: https://documentation.datalab.to/api-reference/segment-result-check https://www.datalab.to/openapi.json get /api/v1/segment/{request_id} Poll this endpoint to check the status of a Segment request and retrieve the segmentation results. # Table Rec Result Check Source: https://documentation.datalab.to/api-reference/table-rec-result-check https://www.datalab.to/openapi.json get /api/v1/table_rec/{request_id} Poll this endpoint to check status of Table Rec request and retrieve final results # Thumbnails Source: https://documentation.datalab.to/api-reference/thumbnails https://www.datalab.to/openapi.json get /api/v1/thumbnails/{lookup_key} # Track Changes Source: https://documentation.datalab.to/api-reference/track-changes https://www.datalab.to/openapi.json post /api/v1/track-changes Extract and display tracked changes from DOCX documents. Returns markdown, HTML, and/or chunks with change annotations. # Track Changes Result Check Source: https://documentation.datalab.to/api-reference/track-changes-result-check https://www.datalab.to/openapi.json get /api/v1/track-changes/{request_id} Poll this endpoint to check the status of a Track Changes request and retrieve the results. # API Limits & Rate Limiting Source: https://documentation.datalab.to/docs/common/limits Datalab implements limits to ensure fair usage and maintain service quality. This guide covers file size limits, page limits, and rate limiting. ## File Size Limits | File Type | Maximum Size | | ---------------- | ------------ | | PDF Documents | 200 MB | | Images | 200 MB | | Office Documents | 200 MB | ## Page Limits | Limit | Value | | ------------------------- | ----- | | Maximum pages per request | 7,000 | For documents exceeding these limits, use the `page_range` parameter to process in segments: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() # Process a large document in segments options = ConvertOptions(page_range="0-999") result1 = client.convert("large_document.pdf", options=options) options = ConvertOptions(page_range="1000-1999") result2 = client.convert("large_document.pdf", options=options) ``` ## Rate Limits ### Request Rate Limit | Plan | Requests per minute | Concurrent requests | | --------- | ------------------- | ------------------- | | Free tier | 10 | 5 | | Team | 200 | 400 | When you exceed request rate limits, you'll receive a `429` response. The SDK handles retries automatically. For raw API calls, implement retry logic: ```python theme={null} import time import requests def api_call_with_retry(url, headers, files, data, max_retries=3): for attempt in range(max_retries): response = requests.post(url, headers=headers, files=files, data=data) if response.status_code == 429: time.sleep(60) continue return response raise Exception("Max retries exceeded") ``` ### Page Concurrency Limit In addition to request rate limits, Datalab enforces a limit on the total number of pages being processed concurrently across all your requests. | Limit | Value | | -------------------------- | ----- | | Concurrent pages in flight | 5,000 | Most workloads will not hit this limit. It primarily affects high-volume workloads with longer-running requests — for example, large or complex documents processed in accurate mode with additional features enabled — or extremely high-volume workloads. Such sustained workloads would benefit from an enterprise agreement or a batch job that we orchestrate for you. Contact [support@datalab.to](mailto:support@datalab.to) to discuss your requirements. This limit differs from request rate limits in two important ways: 1. **It is not time-bound.** It limits the number of pages actively being processed at any given moment, not the number of requests per minute. 2. **It is enforced during processing, not at submission.** You will not receive a `429` response when submitting a document. Instead, the result will return with `success` set to `false` and an error message: ```json theme={null} { "success": false, "error": "Page rate limit exceeded. Your team has {current_pages} pages in flight and this request adds {page_count} more ({total} total, limit: 5,000). Please wait for some requests to complete before submitting more, or contact support@datalab.to for a higher limit." } ``` Because this limit is not enforced at submission time, you won't get an HTTP error when submitting. Always check the `success` field in your results. If you're polling for results, back off and wait for in-flight requests to complete before submitting more. ## Enterprise Limits Custom limits are available for enterprise plans: * Higher file size limits * Increased rate limits * Priority processing See [pricing](https://www.datalab.to/pricing) for details, or contact support to discuss your requirements. ## Next Steps Process multiple documents efficiently in batch. Understand HTTP error codes and subscription errors. Learn about per-page pricing and usage monitoring. Receive notifications when processing completes instead of polling. # Supported File Types Source: https://documentation.datalab.to/docs/common/supportedfiletypes Datalab supports the following file types for document conversion: ## PDF | Extension | MIME Type | | --------- | ----------------- | | `.pdf` | `application/pdf` | ## Spreadsheets | Extension | MIME Type | | --------- | ---------------------------------------------------------------------- | | `.xls` | `application/vnd.ms-excel` | | `.xlsx` | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet` | | `.xlsm` | `application/vnd.ms-excel.sheet.macroEnabled.12` | | `.xltx` | `application/vnd.openxmlformats-officedocument.spreadsheetml.template` | | `.csv` | `text/csv` | | `.ods` | `application/vnd.oasis.opendocument.spreadsheet` | ## Word Documents | Extension | MIME Type | | --------- | ------------------------------------------------------------------------- | | `.doc` | `application/msword` | | `.docx` | `application/vnd.openxmlformats-officedocument.wordprocessingml.document` | | `.odt` | `application/vnd.oasis.opendocument.text` | ## Presentations | Extension | MIME Type | | --------- | --------------------------------------------------------------------------- | | `.ppt` | `application/vnd.ms-powerpoint` | | `.pptx` | `application/vnd.openxmlformats-officedocument.presentationml.presentation` | | `.odp` | `application/vnd.oasis.opendocument.presentation` | ## HTML | Extension | MIME Type | | --------- | ----------- | | `.html` | `text/html` | ## Ebooks | Extension | MIME Type | | --------- | ---------------------- | | `.epub` | `application/epub+zip` | ## Images | Extension | MIME Type | | --------- | ------------ | | `.png` | `image/png` | | `.jpg` | `image/jpeg` | | `.jpeg` | `image/jpeg` | | `.webp` | `image/webp` | | `.gif` | `image/gif` | | `.tiff` | `image/tiff` | ## Detecting MIME Types To automatically detect a file's MIME type in Python: ```python theme={null} import filetype mime = filetype.guess("document.pdf") if mime: print(mime.mime) # application/pdf ``` Install with `pip install filetype`. ## Size Limits See [API Limits](/docs/common/limits) for file size and page limits. ## Next Steps Get started converting documents in minutes. Detailed guide to converting documents to Markdown, HTML, or JSON. Understand file size limits, page limits, and rate limiting. Upload files to Datalab storage for use in pipelines. # API Source: https://documentation.datalab.to/docs/on-prem/api Our on-prem container's API mimics Datalab's API. Our cloud-hosted API documentation can be found [here](https://documentation.datalab.to/docs/welcome/api). With caveats and exceptions detailed below, the container image shares the same API. # Supported endpoints The container currently supports: * `/api/v1/marker` documented [here](https://documentation.datalab.to/docs/welcome/api#marker). This uses both the Marker and Chandra models. * `/api/v1/ocr` documented [here](https://documentation.datalab.to/docs/welcome/api#ocr). * `/api/v1/extract` — structured extraction via JSON schema. Supports `fast` and `turbo` extraction modes. Requires the Chandra model with the Lift model enabled; `balanced` mode is not available on-prem. * `/api/v1/usage` documented [here](/docs/on-prem/usage-analytics) — provides usage analytics and performance metrics for your on-prem deployment. # Authentication API authentication is not supported in the container. We assume customers will be running our image on their own infrastructure in private networks. You may send the `X-API-Key` header detailed [here](https://documentation.datalab.to/docs/welcome/api#authentication), but it will be ignored and any value works. # PDFs and images are supported, document conversion not yet supported Datalab's API supports [many file types](https://documentation.datalab.to/docs/common/supportedfiletypes). The container currently supports PDFs and image file types. Other file types are not yet supported, but will be supported in an upcoming release. ## Feature Parity | Feature | Cloud API | On-Premises | | --------------------------------------------- | --------- | ---------------------------------------------------------------------------------------- | | Document conversion (`/marker`) | Yes | Yes | | OCR (`/ocr`) | Yes | Yes | | Output formats (markdown, html, json, chunks) | Yes | Yes | | Parse quality scoring | Yes | Yes | | Chart understanding | Yes | Yes (Chandra containers only) | | Page range selection | Yes | Yes | | Block IDs | Yes | Yes | | Token-efficient markdown | Yes | Yes | | Form filling (`/fill`) | Yes | **No** | | Create document (`/create-document`) | Yes | **No** | | Thumbnails | Yes | **No** | | Accurate mode | Yes | **No** | | Fast mode | Yes | **No** | | Link extraction | Yes | **No** | | Checkpoints | Yes | **No** | | File URL download | Yes | **No** | | Structured extraction (`/extract`) | Yes | Yes — `fast` and `turbo` modes only (requires Lift model); `balanced` mode not available | | Document segmentation | Yes | **No** | On-premises containers do not require API key authentication. Implement access control at the network or reverse proxy level. ## Next Steps Monitor request volumes, performance metrics, and system status. Get the on-prem container up and running in minutes. Full REST API reference that the on-prem container mirrors. Compare open-source and paid on-prem options. # Overview Source: https://documentation.datalab.to/docs/on-prem/overview Run inference on your own infrastructure **Customers can run our models on infrastructure they control with an Enterprise contract.** To get started, please [**fill out this form**](https://www.datalab.to/contact). # **What’s the difference between Open Source and Datalab's paid On-Prem options?** Our free open-source options ([Chandra](https://github.com/datalab-to/chandra), [Marker](https://github.com/datalab-to/marker), and [Surya](https://github.com/datalab-to/surya)) are ideal for research, personal use, and early-stage startups. Our paid on-prem options are for teams that need a commercial license to run our models and have one or multiple of the following requirements: * Require data privacy/operate in highly-regulated environments * Extremely high volume * Model training or customization * White-glove support and SLAs Here’s a more detailed breakdown. | | **Free (Open Source)** | **Datalab On-Prem** | | ------------------------- | ---------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Intended Use** | Research, personal use, startups \< \$2M ARR/funding | Commercial workloads requiring data privacy, easy deployments, and white-glove support | | **Models** | Open source model weights | Access to newer, more accurate models not available in open source (e.g., latest Chandra versions) | | **License** | GPL + custom RAILs | Commercial license to use our models on-prem (Marker/Surya/Chandra) without sublicensing. Custom rights as needed. | | **Deployment** | Self-install from OSS repos | Custom deployment topologies and Datalab-assisted rollout options as-needed. For small deployments, we have a Docker image that is simple to use and upgrade. | | **Support** | Community only | Premium support with SLAs | | **Contracting** | None | Custom agreements and security reviews. For POCs and small deployments we offer no-redline agreements to get started, fast. | | **Billing** | Free | Invoice/PO, custom terms. For small deployments we offer credit card checkout. | | **Scale and Performance** | Self-effort | Throughput/latency/accuracy tuning; custom page counts; custom rate limits. For small deployments we offer optimized single-GPU support that is simple to deploy and operate. | For higher page volume & GPU concurrency, fully-airgapped deployments, white glove support, and other custom needs, # **Try before you buy** We have two easy ways for customers to try our models: * Our open-source projects, [Chandra](https://github.com/datalab-to/chandra), [Marker](https://github.com/datalab-to/marker), and [Surya](https://github.com/datalab-to/surya). * Datalab's [Cloud API](https://www.datalab.to). Our container image mimics the cloud-hosted API for a simple transition: * [Cloud API Docs](https://documentation.datalab.to/docs/welcome/api) * [On Prem Container API Docs](./api) ## Next Steps Get the on-prem container up and running in minutes. API reference for the on-prem container image. Try the cloud-hosted API to evaluate before deploying on-prem. Understand pricing for on-prem and cloud plans. # Running the Container Source: https://documentation.datalab.to/docs/on-prem/running-the-container Getting our container up-and-running takes minutes. Running Datalab's containers requires **a Google Cloud service account key** to pull the container image. If the terms of your agreement require a license, we'll also provide **a license key.** # License-enabled containers Copy your license key, download the service account key, and [run the script in this Github repository to get up-and-running](https://github.com/datalab-to/datalab-on-prem): ```shellscript theme={null} export DATALAB_LICENSE_KEY=your-license-key export SERVICE_ACCOUNT_KEY_FILE=path/to/key.json ./run-datalab-inference-container.sh ``` # Kubernetes deployment (Helm) A Helm chart is available for deploying the container on Kubernetes clusters. Contact [support@datalab.to](mailto:support@datalab.to) to receive the chart and values reference for your deployment. # Fully-airgapped containers A license key is not required to run a fully-airgapped container. If the terms of your agreement require a fully-airgapped container, we will provide: * Access to private registries that contain those images. * A Google Cloud service account key to pull images. * Directions for how to run the container. # [www.datalab.to](http://www.datalab.to) must be reachable Our on-prem license requires that [https://www.datalab.to](https://www.datalab.to) is reachable in order to: 1. Activate and register your license with our servers. 2. Send usage metrics. # Usage data sent to Datalab License activation and usage heartbeats **do not send private data to Datalab.** Our intent is to ensure compliance with our license and to easily support customers when they run into problems. The container sends the following to our servers: 1. On container startup we activate your license. In that request, we send information about your hardware and OS available in `/proc` and `/sys` (in the container, not on your host). 2. At regular intervals we send usage heartbeats that contain: 1. The # of successful/failed inference requests completed since the last heartbeat 2. The # of inference requests submitted to the container over a recent time window # I need a fully-airgapped deployment We also support fully-airgapped deployments that do not require a license. [Get started by filling out this form.](https://www.datalab.to/contact) Please reach out to us at [support@datalab.to](mailto:support@datalab.to) if you have questions. ## Hardware Requirements | Container Type | GPU Required | Minimum VRAM | Recommended Use | | --------------- | ------------ | ------------ | ------------------------------------------- | | `marker` | Yes (CUDA) | 24 GB | Standard document conversion with Surya OCR | | `chandra` | Yes (CUDA) | 80 GB | Full Chandra VLM for highest accuracy | | `chandra-small` | Yes (CUDA) | 16 GB | Smaller Chandra variants (2B/4B models) | ## Health Check Verify your container is running: ```bash theme={null} curl http://localhost:8000/health_check ``` Expected response: ```json theme={null} {"status": "healthy"} ``` ## Next Steps API reference for the on-prem container image. Compare open-source and paid on-prem deployment options. Try the cloud-hosted API for quick evaluation. Understand HTTP error codes and troubleshooting steps. # Usage Analytics Source: https://documentation.datalab.to/docs/on-prem/usage-analytics Monitor inference request analytics and performance metrics in your on-prem deployment. The usage analytics endpoint provides comprehensive metrics about inference requests processed by your on-premises container. Use this endpoint to monitor request volumes, success rates, performance statistics, and current system status. This endpoint is only available in on-premises deployments and requires a valid license. ## Endpoint ```bash theme={null} GET /api/v1/usage ``` ## Query Parameters | Parameter | Type | Default | Description | | ------------ | ----------------- | ------------ | --------------------------------- | | `start_date` | string (ISO 8601) | 24 hours ago | Start of time range for analytics | | `end_date` | string (ISO 8601) | Now | End of time range for analytics | The time range cannot exceed 7 days. Requests with larger ranges will return a 400 error. ## Authentication This endpoint requires a valid on-premises license. If your license is invalid or expired, the endpoint returns a 423 (Locked) status code. ## Response Structure The endpoint returns a comprehensive analytics object with five main sections: ### Period The effective time range for the query (normalized to UTC): ```json theme={null} { "period": { "start_date": "2024-06-01T00:00:00+00:00", "end_date": "2024-06-01T23:59:59+00:00" } } ``` ### Summary Aggregate statistics across all request types: ```json theme={null} { "summary": { "total_requests": 1250, "successful_requests": 1200, "failed_requests": 50, "successful_pages_processed": 15000, "failed_pages_processed": 500, "success_rate": 0.96 } } ``` | Field | Type | Description | | ---------------------------- | ----- | ------------------------------------------- | | `total_requests` | int | Total completed requests in time range | | `successful_requests` | int | Requests completed without errors | | `failed_requests` | int | Requests that failed with errors | | `successful_pages_processed` | int | Total pages from successful requests | | `failed_pages_processed` | int | Total pages from failed requests | | `success_rate` | float | Ratio of successful to total requests (0-1) | ### By Request Type Per-type breakdown of the same metrics: ```json theme={null} { "by_request_type": { "marker": { "total_requests": 1000, "successful_requests": 980, "failed_requests": 20, "successful_pages_processed": 12000, "failed_pages_processed": 200 }, "ocr": { "total_requests": 250, "successful_requests": 220, "failed_requests": 30, "successful_pages_processed": 3000, "failed_pages_processed": 300 } } } ``` ### Performance Processing time and queue wait statistics (only includes successful requests): ```json theme={null} { "performance": { "average_processing_time_secs": 12.5, "median_processing_time_secs": 10.2, "p95_processing_time_secs": 25.8, "p99_processing_time_secs": 35.4, "average_queue_wait_secs": 2.3 } } ``` | Field | Type | Description | | ------------------------------ | ----- | ---------------------------------- | | `average_processing_time_secs` | float | Mean time from start to completion | | `median_processing_time_secs` | float | 50th percentile processing time | | `p95_processing_time_secs` | float | 95th percentile processing time | | `p99_processing_time_secs` | float | 99th percentile processing time | | `average_queue_wait_secs` | float | Mean time from submission to start | Performance metrics are `null` when there are no successful requests in the time range. Failed requests are excluded from performance calculations. ### Current Status Live snapshot of in-progress and queued requests (not filtered by time range): ```json theme={null} { "current_status": { "requests_in_progress": 5, "requests_queued": 12 } } ``` | Field | Type | Description | | ---------------------- | ---- | ---------------------------------- | | `requests_in_progress` | int | Requests currently being processed | | `requests_queued` | int | Requests waiting to be processed | ## Examples ### Basic Usage (Default 24-Hour Window) ```python Python SDK theme={null} # The Python SDK does not yet support the usage endpoint # Use the requests library directly import requests response = requests.get( "http://localhost:8000/api/v1/usage", headers={"X-API-Key": "any-value"} # Not validated in on-prem ) data = response.json() print(f"Total requests: {data['summary']['total_requests']}") print(f"Success rate: {data['summary']['success_rate']:.2%}") ``` ```bash cURL theme={null} curl -X GET http://localhost:8000/api/v1/usage \ -H "X-API-Key: any-value" ``` ```python Python (requests) theme={null} import requests response = requests.get( "http://localhost:8000/api/v1/usage", headers={"X-API-Key": "any-value"} ) data = response.json() # Print summary summary = data["summary"] print(f"Total: {summary['total_requests']}") print(f"Success: {summary['successful_requests']}") print(f"Failed: {summary['failed_requests']}") print(f"Success rate: {summary['success_rate']:.2%}") # Print performance metrics perf = data["performance"] if perf["average_processing_time_secs"]: print(f"\nAvg processing time: {perf['average_processing_time_secs']:.2f}s") print(f"P95 processing time: {perf['p95_processing_time_secs']:.2f}s") ``` ### Custom Time Range ```python Python SDK theme={null} import requests from datetime import datetime, timedelta, timezone # Query last 7 days end_date = datetime.now(timezone.utc) start_date = end_date - timedelta(days=7) response = requests.get( "http://localhost:8000/api/v1/usage", params={ "start_date": start_date.isoformat(), "end_date": end_date.isoformat() }, headers={"X-API-Key": "any-value"} ) data = response.json() ``` ```bash cURL theme={null} # Query specific date range curl -X GET "http://localhost:8000/api/v1/usage?start_date=2024-06-01T00:00:00Z&end_date=2024-06-07T23:59:59Z" \ -H "X-API-Key: any-value" ``` ```python Python (requests) theme={null} import requests from datetime import datetime, timedelta, timezone # Query last 3 days end_date = datetime.now(timezone.utc) start_date = end_date - timedelta(days=3) response = requests.get( "http://localhost:8000/api/v1/usage", params={ "start_date": start_date.isoformat(), "end_date": end_date.isoformat() }, headers={"X-API-Key": "any-value"} ) data = response.json() print(f"Period: {data['period']['start_date']} to {data['period']['end_date']}") ``` ### Monitoring Dashboard Example ```python Python SDK theme={null} import requests from datetime import datetime, timezone def get_usage_metrics(): """Fetch current usage metrics for monitoring dashboard.""" response = requests.get( "http://localhost:8000/api/v1/usage", headers={"X-API-Key": "any-value"} ) if response.status_code != 200: raise Exception(f"Failed to fetch metrics: {response.status_code}") return response.json() def print_dashboard(): """Print a simple monitoring dashboard.""" data = get_usage_metrics() print("=" * 60) print("DATALAB ON-PREM USAGE DASHBOARD") print("=" * 60) # Summary summary = data["summary"] print(f"\n📊 SUMMARY (Last 24 Hours)") print(f" Total Requests: {summary['total_requests']:,}") print(f" Successful: {summary['successful_requests']:,}") print(f" Failed: {summary['failed_requests']:,}") print(f" Success Rate: {summary['success_rate']:.2%}") print(f" Pages Processed: {summary['successful_pages_processed']:,}") # By type print(f"\n📈 BY REQUEST TYPE") for req_type, metrics in data["by_request_type"].items(): print(f" {req_type.upper()}:") print(f" Requests: {metrics['total_requests']:,} ({metrics['successful_requests']:,} successful)") print(f" Pages: {metrics['successful_pages_processed']:,}") # Performance perf = data["performance"] if perf["average_processing_time_secs"]: print(f"\n⚡ PERFORMANCE") print(f" Avg Processing: {perf['average_processing_time_secs']:.2f}s") print(f" Median Processing: {perf['median_processing_time_secs']:.2f}s") print(f" P95 Processing: {perf['p95_processing_time_secs']:.2f}s") print(f" P99 Processing: {perf['p99_processing_time_secs']:.2f}s") print(f" Avg Queue Wait: {perf['average_queue_wait_secs']:.2f}s") # Current status status = data["current_status"] print(f"\n🔄 CURRENT STATUS") print(f" In Progress: {status['requests_in_progress']}") print(f" Queued: {status['requests_queued']}") print("=" * 60) if __name__ == "__main__": print_dashboard() ``` ```bash cURL theme={null} # Simple monitoring script curl -s http://localhost:8000/api/v1/usage \ -H "X-API-Key: any-value" | \ jq '{ total: .summary.total_requests, success_rate: .summary.success_rate, in_progress: .current_status.requests_in_progress, queued: .current_status.requests_queued }' ``` ```python Python (requests) theme={null} import requests from datetime import datetime, timezone def monitor_system_health(): """Check system health based on usage metrics.""" response = requests.get( "http://localhost:8000/api/v1/usage", headers={"X-API-Key": "any-value"} ) data = response.json() summary = data["summary"] status = data["current_status"] perf = data["performance"] # Check success rate if summary["success_rate"] < 0.95: print(f"⚠️ WARNING: Success rate is {summary['success_rate']:.2%}") # Check queue depth if status["requests_queued"] > 100: print(f"⚠️ WARNING: {status['requests_queued']} requests queued") # Check processing time if perf["p95_processing_time_secs"] and perf["p95_processing_time_secs"] > 60: print(f"⚠️ WARNING: P95 processing time is {perf['p95_processing_time_secs']:.1f}s") print("✅ System health check complete") monitor_system_health() ``` ## Error Responses ### 400 Bad Request Invalid query parameters: ```json theme={null} { "detail": "start_date must be before end_date." } ``` ```json theme={null} { "detail": "Time range must not exceed 7 days." } ``` ### 423 Locked License validation failed: ```json theme={null} { "detail": "License validation failed" } ``` ## Implementation Notes * Only **completed requests** (with `end_time` set) are included in summary statistics * Failed requests are counted in totals but excluded from performance metrics * Performance percentiles use linear interpolation for accurate calculation * Queue wait time is calculated as `start_time - submission_time` * Processing time is calculated as `end_time - start_time` * Naive datetimes (without timezone) are treated as UTC * The `current_status` section provides a live snapshot and is not filtered by the time range ## Use Cases ### Capacity Planning Monitor request volumes and processing times to plan infrastructure scaling: ```python theme={null} import requests from datetime import datetime, timedelta, timezone # Get last 7 days of data end = datetime.now(timezone.utc) start = end - timedelta(days=7) response = requests.get( "http://localhost:8000/api/v1/usage", params={"start_date": start.isoformat(), "end_date": end.isoformat()}, headers={"X-API-Key": "any-value"} ) data = response.json() avg_daily_requests = data["summary"]["total_requests"] / 7 avg_daily_pages = data["summary"]["successful_pages_processed"] / 7 print(f"Average daily requests: {avg_daily_requests:.0f}") print(f"Average daily pages: {avg_daily_pages:.0f}") ``` ### Performance Monitoring Track processing times to identify performance degradation: ```python theme={null} import requests response = requests.get( "http://localhost:8000/api/v1/usage", headers={"X-API-Key": "any-value"} ) perf = response.json()["performance"] # Alert if P95 exceeds threshold if perf["p95_processing_time_secs"] and perf["p95_processing_time_secs"] > 30: print(f"ALERT: P95 processing time is {perf['p95_processing_time_secs']:.1f}s") ``` ### Queue Monitoring Monitor queue depth to detect bottlenecks: ```python theme={null} import requests response = requests.get( "http://localhost:8000/api/v1/usage", headers={"X-API-Key": "any-value"} ) status = response.json()["current_status"] if status["requests_queued"] > 50: print(f"WARNING: {status['requests_queued']} requests in queue") ``` ## Next Steps Full API reference for the on-prem container. Get the on-prem container up and running. Understand HTTP error codes and troubleshooting. Compare open-source and paid on-prem options. # Batch Processing Source: https://documentation.datalab.to/docs/recipes/conversion/batch-documents Convert multiple documents efficiently with parallel processing. Process directories of documents with the SDK or CLI. Both handle rate limiting and retries automatically. ## SDK Batch Processing Process multiple files using Python's async capabilities: ### Async Batch Processing For higher throughput: ```python theme={null} import asyncio from pathlib import Path from datalab_sdk import AsyncDatalabClient, ConvertOptions async def process_directory(input_dir: str, output_dir: str): async with AsyncDatalabClient() as client: pdf_files = list(Path(input_dir).glob("*.pdf")) # Process all files concurrently tasks = [ client.convert(str(pdf), options=ConvertOptions(mode="balanced")) for pdf in pdf_files ] results = await asyncio.gather(*tasks, return_exceptions=True) for pdf, result in zip(pdf_files, results): if isinstance(result, Exception): print(f"Error processing {pdf.name}: {result}") else: output_path = Path(output_dir) / f"{pdf.stem}.md" output_path.write_text(result.markdown) print(f"Saved: {output_path}") asyncio.run(process_directory("./documents/", "./output/")) ``` ## CLI Batch Processing The CLI handles directory processing automatically: ```bash theme={null} # Convert all PDFs in a directory datalab convert ./documents/ --output_dir ./output/ # Filter by extension datalab convert ./documents/ --extensions pdf,docx # Control concurrency datalab convert ./documents/ --max_concurrent 10 # With processing options datalab convert ./documents/ \ --mode balanced \ --format markdown \ --output_dir ./output/ ``` See [CLI Reference](/docs/welcome/sdk/cli) for all options. ## REST API Batch Processing For raw API usage, implement parallel requests with retry handling: ```python theme={null} import os import time import requests from pathlib import Path from concurrent.futures import ThreadPoolExecutor, as_completed from requests.adapters import HTTPAdapter, Retry API_URL = "https://www.datalab.to/api/v1/convert" API_KEY = os.getenv("DATALAB_API_KEY") # Configure session with retries session = requests.Session() retries = Retry( total=20, backoff_factor=4, status_forcelist=[429], allowed_methods=["GET", "POST"], raise_on_status=False, ) session.mount("https://", HTTPAdapter(max_retries=retries)) def convert_document(pdf_path: Path, output_format="markdown", mode="balanced"): """Convert a single document with polling.""" headers = {"X-API-Key": API_KEY} # Submit request with open(pdf_path, "rb") as f: response = session.post( API_URL, files={"file": (pdf_path.name, f, "application/pdf")}, data={"output_format": output_format, "mode": mode}, headers=headers ) data = response.json() check_url = data["request_check_url"] # Poll for completion for _ in range(300): result = session.get(check_url, headers=headers).json() if result["status"] == "complete": return result elif result["status"] == "failed": raise Exception(f"Failed: {result.get('error')}") time.sleep(2) raise Exception("Timeout") def batch_convert(directory: str, max_workers: int = 5): """Process all PDFs in a directory.""" doc_dir = Path(directory) pdfs = list(doc_dir.glob("*.pdf")) print(f"Found {len(pdfs)} PDFs") results = {} with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = { executor.submit(convert_document, pdf): pdf.name for pdf in pdfs } for future in as_completed(futures): filename = futures[future] try: result = future.result() results[filename] = result print(f"Converted: {filename}") except Exception as e: print(f"Error processing {filename}: {e}") return results # Usage results = batch_convert("./documents/", max_workers=5) ``` ## Rate Limits * **Request rate limit:** 200 requests per minute per account (429 on exceed) * **Concurrent request limit:** 400 concurrent requests (429 on exceed) * **Page concurrency limit:** 5,000 pages in flight across all requests — this is enforced during processing, not at submission. Results return with `success: false` if exceeded. Always check the `success` field when polling for results. * The SDK and CLI handle request rate limiting and retries automatically * For higher limits, contact [support@datalab.to](mailto:support@datalab.to) See [API Limits](/docs/common/limits) for details. ## Tips 1. **Use async for high throughput** - Async processing handles many concurrent requests efficiently 2. **Limit concurrency** - Start with 5-10 concurrent requests and adjust based on your rate limits 3. **Handle failures gracefully** - Use `return_exceptions=True` with `asyncio.gather` to continue processing on errors 4. **Save progress** - Write results incrementally to avoid losing work on long batches ## Next Steps Learn more about Marker's conversion API and output formats. Understand rate limits and how to optimize throughput. Chain processors into versioned, reusable pipelines. Get notified when batch conversions complete via webhooks. # Document Conversion Source: https://documentation.datalab.to/docs/recipes/conversion/conversion-api-overview Convert documents to Markdown, HTML, JSON, or chunks using the Convert API. Convert PDFs, Word documents, spreadsheets, and images to machine-readable formats. Marker handles complex layouts, tables, math, and images. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call. ## Quick Start ```python Python SDK theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() # Basic conversion result = client.convert("document.pdf") print(result.markdown) # With options options = ConvertOptions( output_format="markdown", mode="balanced", paginate=True ) result = client.convert("document.pdf", options=options) ``` ```bash cURL theme={null} # Submit request curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=markdown" \ -F "mode=balanced" # Poll for results (use request_check_url from response) curl https://www.datalab.to/api/v1/convert/REQUEST_ID \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import os, time, requests API_URL = "https://www.datalab.to/api/v1/convert" headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} # Submit request with open("document.pdf", "rb") as f: response = requests.post( API_URL, files={"file": ("document.pdf", f, "application/pdf")}, data={"output_format": "markdown", "mode": "balanced"}, headers=headers ) check_url = response.json()["request_check_url"] # Poll for completion for _ in range(300): result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": print(result["markdown"]) break time.sleep(2) ``` The SDK handles polling automatically. For the REST API, you submit a request and poll the `request_check_url` until the status is `complete`. See [SDK Conversion](/docs/welcome/sdk/conversion) for complete SDK documentation. **File limits:** Maximum file size is 200 MB, with up to 7,000 pages per request. See [API Limits](/docs/common/limits) for the full list. ## Parameters ### Core Parameters | Parameter | Type | Default | Description | | --------------- | ------ | ---------- | --------------------------------------------------- | | `file` | file | - | Document file (multipart upload) | | `file_url` | string | - | URL to document (alternative to file) | | `output_format` | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks` | | `mode` | string | `fast` | Processing mode (see below) | **Which output format should I use?** * **LLM/RAG pipelines** → `markdown` (default, most compatible) * **Web display** → `html` (preserves visual structure) * **Programmatic access to blocks** → `json` (includes bounding boxes and block types) * **Embedding and search** → `chunks` (pre-chunked for vector databases) ### Processing Modes | Mode | Description | Best For | | ---------- | ----------------------------------------------- | ------------------------------------------------ | | `fast` | Lowest latency, good for simple documents | High-throughput pipelines, simple layouts | | `balanced` | Balance of speed and accuracy **(recommended)** | Most use cases | | `accurate` | Highest accuracy, best for complex layouts | Complex tables, dense layouts, scanned documents | **Which mode should I use?** * **Most use cases** → `balanced` (recommended default) * **Simple, clean PDFs** at high throughput → `fast` * **Scanned documents, complex tables, or dense layouts** → `accurate` ### Page Control | Parameter | Type | Default | Description | | ------------ | ------ | ------- | --------------------------------------------------------------------------------------- | | `max_pages` | int | - | Maximum pages to process | | `page_range` | string | - | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index. | | `paginate` | bool | `false` | Add page delimiters to output | ### Image Handling | Parameter | Type | Default | Description | | -------------------------- | ---- | ------- | ----------------------------- | | `disable_image_extraction` | bool | `false` | Don't extract images | | `disable_image_captions` | bool | `false` | Don't generate image captions | ### Advanced Options | Parameter | Type | Default | Description | | ---------------------------- | ------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `add_block_ids` | bool | `false` | Add `data-block-id` attributes to HTML elements | | `skip_cache` | bool | `false` | Skip cached results | | `save_checkpoint` | bool | `false` | Save checkpoint for reuse | | `word_bboxes` | bool | `false` | Predict per-word bounding boxes with confidence scores. Each word is inlined into HTML output as a `` element (markdown output strips these). Billed at \$0.30 per 1K pages. | | `extras` | string | - | Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_cell_bboxes`, `list_item_bboxes`, `infographic`, `new_block_types`. (`table_row_bboxes` is deprecated — use `table_cell_bboxes` instead.) | | `include_markdown_in_chunks` | bool | `false` | Include markdown content in chunks/JSON output | | `token_efficient_markdown` | bool | `false` | Optimize markdown for LLM token efficiency | | `fence_synthetic_captions` | bool | `false` | Wrap synthetic image captions in HTML comments | | `additional_config` | string | - | JSON with extra config (see below) | | `webhook_url` | string | - | Override webhook URL for this request | | `processing_location` | string | - | Data residency region override: `"eu"` or `"us"`. When set, use `file_url` or a pre-uploaded `datalab://` reference — multipart uploads are not supported. EU processing carries a regional pricing premium. | For structured extraction, use the [Extract API](/docs/recipes/structured-extraction/api-overview). For document segmentation, use the [Segment API](/docs/recipes/document-segmentation/auto-segmentation). The `track_changes` extra is supported on this endpoint. You can also use the dedicated [Track Changes endpoint](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents). ### Bounding Box Add-ons Three add-ons annotate HTML output with spatial coordinates and confidence scores. All are billed at **\$0.30 per 1K pages** each (additive on top of the base conversion rate) and require the `html` output format to expose the attributes. | Add-on | How to enable | What it annotates | | ----------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------- | | Word bboxes | `word_bboxes=True` | Every word in the document gets a `data-bbox` and `data-confidence` span in HTML | | Table cell bboxes | `extras="table_cell_bboxes"` | ``, ``, and ``/`` elements get `data-bbox`/`data-confidence`; also enables `word_bboxes` | | List item bboxes | `extras="list_item_bboxes"` | Each `
  • ` element gets `data-bbox`/`data-confidence`; also enables `word_bboxes` | ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() # Get table cell bboxes (also includes word bboxes) options = ConvertOptions( output_format="html", extras="table_cell_bboxes,list_item_bboxes", ) result = client.convert("document.pdf", options=options) # HTML contains data-bbox and data-confidence on table cells, list items, and words ``` ### Additional Config Options Pass as JSON string in `additional_config`: | Key | Type | Description | | ----------------------------- | ---- | ------------------------------- | | `keep_spreadsheet_formatting` | bool | Preserve spreadsheet formatting | | `keep_pageheader_in_output` | bool | Include page headers | | `keep_pagefooter_in_output` | bool | Include page footers | Example: ```python theme={null} options = ConvertOptions( additional_config={ "keep_spreadsheet_formatting": True, "keep_pageheader_in_output": False } ) ``` ## Response Fields | Field | Type | Description | | --------------------- | ------ | --------------------------------------------- | | `status` | string | `processing`, `complete`, or `failed` | | `success` | bool | Whether conversion succeeded | | `output_format` | string | Requested output format | | `markdown` | string | Markdown output (if format is markdown) | | `html` | string | HTML output (if format is html) | | `json` | object | JSON output (if format is json) | | `chunks` | object | Chunked output (if format is chunks) | | `images` | object | Extracted images as `{filename: base64}` | | `metadata` | object | Document metadata | | `page_count` | int | Number of pages processed | | `parse_quality_score` | float | Quality score (0-5) | | `cost_breakdown` | object | Cost in cents | | `checkpoint_id` | string | Checkpoint ID (if `save_checkpoint` was true) | | `error` | string | Error message if failed | ## Examples ### Convert with High Accuracy ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() options = ConvertOptions( mode="accurate", output_format="markdown" ) result = client.convert("complex_document.pdf", options=options) print(f"Quality score: {result.parse_quality_score}") print(result.markdown) ``` ### HTML with Block IDs for Citations ```python theme={null} options = ConvertOptions( output_format="html", add_block_ids=True ) result = client.convert("document.pdf", options=options) # HTML elements have data-block-id attributes for citation tracking ``` ### Process Specific Pages ```python theme={null} options = ConvertOptions( page_range="0-4,10,15-20", # Pages 0-4, 10, and 15-20 output_format="markdown" ) result = client.convert("large_document.pdf", options=options) ``` ### Process Specific Sheets from a Spreadsheet For spreadsheet files, `page_range` filters by sheet index (0-based): ```python theme={null} options = ConvertOptions( page_range="0,2", # First and third sheets only output_format="markdown" ) result = client.convert("workbook.xlsx", options=options) ``` ### Extract Track Changes from Word Documents ```python theme={null} options = ConvertOptions( extras="track_changes", output_format="json" ) result = client.convert("document_with_changes.docx", options=options) ``` ## Parse Quality Score Every conversion response includes a `parse_quality_score` (0-5) that indicates how well the document was parsed: | Score Range | Quality | Recommended Action | | ----------- | --------- | -------------------------------------------------- | | 4.0 - 5.0 | Excellent | Use the output directly | | 3.0 - 3.9 | Good | Review for minor issues | | 2.0 - 2.9 | Fair | Consider retrying with `accurate` mode | | 0.0 - 1.9 | Poor | Retry with `accurate` mode or check the input file | Use quality scores to build automated quality gates: ```python theme={null} result = client.convert("document.pdf", options=ConvertOptions(mode="balanced")) if result.parse_quality_score < 3.0: # Retry with higher accuracy result = client.convert("document.pdf", options=ConvertOptions(mode="accurate")) ``` Use quality scores to gate pipeline execution or route documents to different processing configurations. ## Checkpoints Save a processing checkpoint to reuse parsed results for extraction or segmentation without re-processing: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions import json client = DatalabClient() # Step 1: Convert and save checkpoint options = ConvertOptions( save_checkpoint=True, output_format="markdown" ) result = client.convert("document.pdf", options=options) checkpoint_id = result.checkpoint_id # Step 2: Use checkpoint for extraction (no re-processing needed) extraction_options = ExtractOptions( page_schema=json.dumps({"type": "object", "properties": {"title": {"type": "string"}}}), checkpoint_id=checkpoint_id ) extract_result = client.extract("document.pdf", options=extraction_options) ``` Checkpoints save time and cost when you need to run multiple operations (extraction, segmentation) on the same document. Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly. ## Next Steps Extract structured data from documents using JSON schemas Process multiple documents concurrently Split multi-document PDFs into segments Get notified when conversions complete # Create Document Source: https://documentation.datalab.to/docs/recipes/create-document/create-document-api-overview Generate DOCX files from markdown with track changes support. Convert markdown to Word documents (DOCX) with support for track changes, insertions, deletions, and comments. This is useful for generating legal documents, contracts with redlines, and collaborative review documents. ## Quick Start ```python Python SDK theme={null} from datalab_sdk import DatalabClient client = DatalabClient() markdown = ( "# Contract\n\n" "This agreement is between " '' "Acme Corp and the client." ) result = client.create_document(markdown=markdown) result.save_output("contract") # saves contract.docx print(f"Document created: {result.page_count} page(s)") ``` ```bash cURL theme={null} # Submit request curl -X POST https://www.datalab.to/api/v1/create-document \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "markdown": "# Contract\n\nThis agreement is between Acme Corp and the client.", "output_format": "docx" }' # Poll for results (use request_check_url from response) curl https://www.datalab.to/api/v1/create-document/REQUEST_ID \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import requests, json, time, base64, os headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} # Submit request response = requests.post( "https://www.datalab.to/api/v1/create-document", json={ "markdown": "# Contract\n\nThis agreement is between " '' "Acme Corp and the client.", "output_format": "docx" }, headers=headers ) check_url = response.json()["request_check_url"] # Poll for results while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": docx_bytes = base64.b64decode(result["output_base64"]) with open("contract.docx", "wb") as f: f.write(docx_bytes) print("Document saved as contract.docx") break elif result.get("error"): print(f"Error: {result['error']}") break time.sleep(2) ``` ## SDK Usage Use `client.create_document()` to create a DOCX from markdown: ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() result = client.create_document( markdown="# Title\n\nDocument content here.", output_format="docx", # Only 'docx' is supported webhook_url=None, # Optional completion webhook save_output="output/doc", # Optional: saves output.docx automatically ) print(result.success) # True if creation succeeded print(result.page_count) # Number of pages print(result.cost_breakdown) # Cost details result.save_output("output/contract") # Saves contract.docx ``` ### SDK Method Parameters | Parameter | Type | Default | Description | | --------------- | -------- | ------------ | --------------------------------------------------- | | `markdown` | str | **Required** | Markdown content with optional track changes markup | | `output_format` | str | `"docx"` | Output format (only `"docx"` is supported) | | `webhook_url` | str | None | Optional webhook URL for completion notification | | `save_output` | str/Path | None | File path to save the output DOCX | | `max_polls` | int | `300` | Maximum polling attempts | | `poll_interval` | int | `1` | Seconds between polls | ### SDK Result Fields | Field | Type | Description | | ---------------- | ----- | ----------------------------------- | | `success` | bool | Whether document creation succeeded | | `status` | str | `"complete"` when done | | `output_format` | str | `"docx"` | | `output_base64` | str | Base64-encoded DOCX file | | `runtime` | float | Processing time in seconds | | `page_count` | int | Pages in the generated document | | `cost_breakdown` | dict | Cost details | | `error` | str | Error message if creation failed | ## How It Works 1. Send markdown content with optional track changes markup 2. The API converts it to a DOCX file with proper Word formatting 3. Track changes tags become native Word revision marks 4. The DOCX file is returned as a base64-encoded string ## Track Changes Markup ### Insertions Mark inserted text with `` tags: ```html theme={null} newly added text ``` | Attribute | Required | Description | | ------------------------ | -------- | ------------------------------------------------- | | `data-revision-author` | Yes | Author name for the insertion | | `data-revision-datetime` | Yes | ISO 8601 timestamp (e.g., `2024-01-15T10:00:00Z`) | ### Deletions Mark deleted text with `` tags: ```html theme={null} removed text ``` | Attribute | Required | Description | | ------------------------ | -------- | ---------------------------- | | `data-revision-author` | Yes | Author name for the deletion | | `data-revision-datetime` | Yes | ISO 8601 timestamp | ### Comments Add comments with `` tags: ```html theme={null} annotated text ``` | Attribute | Required | Description | | ----------------------- | -------- | ----------------------------------------------------- | | `data-comment-author` | Yes | Author/reviewer name | | `text` | Yes | The comment text | | `data-comment-datetime` | No | ISO 8601 timestamp (defaults to current time) | | `data-comment-initial` | No | Author initials (auto-generated from name if omitted) | ## Parameters | Parameter | Type | Required | Description | | --------------- | ------ | -------- | --------------------------------------------------- | | `markdown` | string | Yes | Markdown content with optional track changes markup | | `output_format` | string | No | Output format (currently only `docx` is supported) | | `webhook_url` | string | No | Webhook URL to notify when processing completes | ## Response The response follows the standard async pattern — submit, then poll: **Initial response:** ```json theme={null} { "success": true, "request_id": "abc123", "request_check_url": "https://www.datalab.to/api/v1/create-document/abc123" } ``` **Final response (when polling):** | Field | Type | Description | | ---------------- | ------ | ----------------------------------- | | `status` | string | `processing` or `complete` | | `success` | bool | Whether document creation succeeded | | `output_format` | string | `docx` | | `output_base64` | string | Base64-encoded DOCX file | | `runtime` | float | Processing time in seconds | | `page_count` | int | Pages in the generated document | | `cost_breakdown` | object | Cost details | | `error` | string | Error message if creation failed | ## Full Example A contract with insertions, deletions, and reviewer comments: ```python theme={null} import requests, json, time, base64, os headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} markdown = """# Service Agreement ## Parties This agreement is between Acme Corporation ("Provider") and the ClientGlobalTech Inc. ("Client"). ## Terms The service period begins on January 1, 2025 and continues for 1224 months. ## Payment The total contract value is $150,000 payable in quarterly installments. """ response = requests.post( "https://www.datalab.to/api/v1/create-document", json={"markdown": markdown, "output_format": "docx"}, headers=headers ) check_url = response.json()["request_check_url"] while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": docx_bytes = base64.b64decode(result["output_base64"]) with open("service_agreement.docx", "wb") as f: f.write(docx_bytes) print(f"Document saved ({result['page_count']} pages)") break time.sleep(2) ``` The generated DOCX file opens in Microsoft Word with native track changes visible, allowing reviewers to accept or reject each change. ## Use Cases * **Legal document generation** — create contracts with tracked revisions * **Contract redlining** — mark up agreements with insertions and deletions * **Collaborative review** — add reviewer comments to documents * **Document automation** — generate Word documents from templates with dynamic content ## Next Steps Extract track changes from existing Word documents Convert documents to markdown, HTML, or JSON Get notified when document creation completes Chain processors into versioned, reusable pipelines. # Document Segmentation Source: https://documentation.datalab.to/docs/recipes/document-segmentation/auto-segmentation Automatically split multi-document PDFs into separate segments. Automatically identify and split PDFs that contain multiple documents (like batch-scanned files) into their component parts. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call. ## Quick Start ```python Python SDK theme={null} import json from datalab_sdk import DatalabClient, SegmentOptions client = DatalabClient() # Define segmentation schema segmentation_schema = { "segments": [] } options = SegmentOptions( segmentation_schema=json.dumps(segmentation_schema), mode="balanced" ) result = client.segment("combined_documents.pdf", options=options) # Access segmentation results for segment in result.segmentation_results["segments"]: print(f"{segment['name']}: pages {segment['pages']}") ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/segment \ -H "X-API-Key: YOUR_API_KEY" \ -F "file=@combined_documents.pdf" \ -F "output_format=markdown" \ -F "mode=balanced" \ -F 'segmentation_schema={"segments": []}' ``` ```python Python (requests) theme={null} import requests import json import time API_KEY = "YOUR_API_KEY" headers = {"X-API-Key": API_KEY} # Submit segmentation request with open("combined.pdf", "rb") as f: response = requests.post( "https://www.datalab.to/api/v1/segment", files={"file": ("combined.pdf", f, "application/pdf")}, data={ "output_format": "markdown", "mode": "balanced", "segmentation_schema": json.dumps({"segments": []}) }, headers=headers ) check_url = response.json()["request_check_url"] # Poll for results while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": segments = result["segmentation_results"]["segments"] for seg in segments: print(f"{seg['name']}: pages {seg['pages']}") break elif result["status"] == "failed": print(f"Error: {result.get('error')}") break time.sleep(2) ``` ## When to Use Segmentation is useful when: * Batch-scanned documents are combined into a single PDF * Multiple document types are stapled together * You need to apply different processing to different sections ## Response Format ```json theme={null} { "segmentation_results": { "segments": [ { "name": "Research Paper", "pages": [0, 1, 2], "confidence": "medium" }, { "name": "Invoice", "pages": [3, 4], "confidence": "high" } ], "metadata": { "total_pages": 5, "segmentation_method": "auto_detected" } } } ``` ## Process Each Segment After segmentation, process each segment separately: ```python theme={null} import json from datalab_sdk import DatalabClient, SegmentOptions, ExtractOptions client = DatalabClient() # First, get segments seg_options = SegmentOptions( segmentation_schema=json.dumps({"segments": []}), mode="balanced" ) result = client.segment("combined.pdf", options=seg_options) # Process each segment with appropriate schema using the Extract API extraction_schemas = { "Invoice": { "type": "object", "properties": { "invoice_number": {"type": "string"}, "total": {"type": "number"} } }, "Contract": { "type": "object", "properties": { "parties": {"type": "array", "items": {"type": "string"}}, "effective_date": {"type": "string"} } } } extracted_data = {} for segment in result.segmentation_results["segments"]: segment_name = segment["name"] pages = segment["pages"] schema = extraction_schemas.get(segment_name) if schema: # Build page range string page_range = ",".join(str(p) for p in pages) options = ExtractOptions( page_schema=json.dumps(schema), page_range=page_range, mode="balanced" ) seg_result = client.extract("combined.pdf", options=options) extracted_data[segment_name] = json.loads(seg_result.extraction_schema_json) print(extracted_data) ``` ## Using Checkpoints If you already converted a document with `save_checkpoint=True` using the [Convert API](/docs/recipes/conversion/conversion-api-overview), pass the `checkpoint_id` to `SegmentOptions` to skip re-parsing. This saves time and cost when running segmentation on a previously converted document. ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions import json client = DatalabClient() # Step 1: Convert and save checkpoint convert_result = client.convert("combined.pdf", options=ConvertOptions(save_checkpoint=True)) checkpoint_id = convert_result.checkpoint_id # Step 2: Segment using checkpoint (no re-parsing needed) options = SegmentOptions( segmentation_schema=json.dumps({"segments": []}), checkpoint_id=checkpoint_id ) result = client.segment("combined.pdf", options=options) ``` ## Custom Segmentation Schema Define expected segment types for better accuracy: ```python theme={null} segmentation_schema = { "segments": [ {"type": "invoice", "description": "Invoice or billing document"}, {"type": "contract", "description": "Legal contract or agreement"}, {"type": "receipt", "description": "Payment receipt"} ] } ``` ## Next Steps Extract structured data from document segments using JSON schemas. Tips for TOC-based segmentation on documents with 50+ pages. Convert documents to Markdown, HTML, JSON, or chunks. Chain processors into versioned, reusable pipelines. # Track Changes in Word Docs Source: https://documentation.datalab.to/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents Pull tracked changes and comments from Word documents for review workflows If you're working with legal documents, contracts, or any collaborative review process, you know how painful it is to manually track all the changes, comments, and revisions in Word documents. This guide shows you how to extract all that markup programmatically using the Track Changes API. # Overview The Track Changes API extracts: * Tracked changes: insertions and deletions with author names and timestamps * Comments: all margin comments with author details This allows you get a full revision history from your Word docs into clean HTML and Markdown. `track_changes` is perfect for legal workflows where you need to: * Generate redline summaries for clients * Identify all changes made by specific parties * Extract action items from comments * Analyze negotiation patterns across contract versions * Create audit trails of document revisions Submit your Word document to the dedicated Track Changes endpoint. The output will be provided in Markdown and HTML format by default, with all tracked changes and comments preserved in the markup. # Quick Start (SDK) The simplest way to extract tracked changes is with the Python SDK: ```python theme={null} from datalab_sdk import DatalabClient, TrackChangesOptions client = DatalabClient() options = TrackChangesOptions(output_format="markdown,html,chunks") result = client.track_changes("contract.docx", options=options) print(result.markdown) ``` # Making the API Request Here's how to submit a Word document and extract its tracked changes using the REST API: ```python theme={null} import requests import time import os API_URL = "https://www.datalab.to/api/v1/track-changes" API_KEY = os.getenv("DATALAB_API_KEY") def extract_tracked_changes(docx_path, output_format='html,markdown'): """ Extract tracked changes and comments from a Word document. Args: docx_path: Path to the .docx file output_format: 'html' or 'markdown' or `html,markdown` Returns: Dictionary with the converted content including tracked changes """ with open(docx_path, 'rb') as f: form_data = { 'file': (docx_path, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'), 'output_format': (None, output_format), 'paginate': (None, False) # Set to True if you want page breaks } headers = {"X-API-Key": API_KEY} response = requests.post(API_URL, files=form_data, headers=headers) data = response.json() # Poll for completion check_url = data["request_check_url"] max_polls = 300 # Set longer if needed for i in range(max_polls): time.sleep(2) response = requests.get(check_url, headers=headers) result = response.json() if result["status"] == "complete": return result elif result["status"] == "failed": raise Exception(f"Conversion failed: {result.get('error')}") raise TimeoutError("Conversion did not complete in time") ``` The response will contain your document with all tracked changes preserved. Here's what the markup looks like: * Insertions: `new text` * Deletions: `old text` * Comments: `marked text` This markup will appear in both HTML and Markdown output. # Analyzing Changes with LLMs Once you have the extracted markup, you can use an LLM to analyze the changes. Here's an example using OpenRouter to generate a redline summary: ```python theme={null} import requests import os OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") OPENROUTER_MODEL = os.getenv("OPENROUTER_MODEL") def analyze_changes_with_llm(marked_up_content, analysis_type='summary'): """ Use an LLM via OpenRouter to analyze tracked changes. Args: marked_up_content: The HTML or Markdown with tracked changes analysis_type: Type of analysis ('summary', 'risks', 'by_author', etc.) Returns: LLM analysis of the changes """ prompts = { 'summary': """Analyze this contract with tracked changes and provide: 1. A concise summary of all changes made 2. Key changes that materially affect the agreement 3. Any changes that shift risk or obligations between parties 4. Recommended action items for legal review Document with tracked changes: {content}""", 'by_author': """Review this document with tracked changes and create a report organized by author: - List each author's changes - Categorize changes as substantive vs. stylistic - Highlight any conflicting changes between authors Document: {content}""", 'risks': """Analyze this contract's tracked changes for potential legal risks: - Identify changes that increase liability or obligations - Flag any deletions of protective language - Note additions that could be problematic - Assess the overall risk profile of the revisions Document: {content}""" } prompt = prompts.get(analysis_type, prompts['summary']).format(content=marked_up_content) response = requests.post( url="https://openrouter.ai/api/v1/chat/completions", headers={ "Authorization": f"Bearer {OPENROUTER_API_KEY}", "Content-Type": "application/json" }, json={ "model": OPENROUTER_MODEL, "messages": [ { "role": "user", "content": prompt } ] } ) return response.json()['choices'][0]['message']['content'] # Example usage result = extract_tracked_changes('nda_draft_v3.docx', output_format='html') marked_up_doc = result['html'] # Generate different types of analysis summary = analyze_changes_with_llm(marked_up_doc, 'summary') risk_analysis = analyze_changes_with_llm(marked_up_doc, 'risks') by_author = analyze_changes_with_llm(marked_up_doc, 'by_author') print("Change Summary:") print(summary) print("\n" + "="*80 + "\n") print("Risk Analysis:") print(risk_analysis) ``` # Pagination For longer documents, you may want to preserve page breaks in the output so you can split them. Set paginate to True in your request: ```python theme={null} form_data = { 'file': (docx_path, f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'), 'output_format': (None, 'html'), 'paginate': (None, True) # Enable pagination } ``` **For Markdown output**, each page will preceded by a horizontal rule containing the page number: ``` {0}------------------------------------------------ {1}------------------------------------------------ ``` **For HTML output**, each page will be wrapped in a div with the page number: ```html theme={null}
    ``` This makes it easy to process documents page-by-page or display them with proper pagination in your UI. # Full Code Sample Here's a complete example that extracts tracked changes and generates a legal review summary: ```python theme={null} import os import requests import time import json API_URL = "https://www.datalab.to/api/v1/track-changes" DATALAB_API_KEY = os.getenv("DATALAB_API_KEY") OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") def extract_tracked_changes(docx_path, output_format='html', paginate=False): """Extract tracked changes from a Word document.""" with open(docx_path, 'rb') as f: form_data = { 'file': (os.path.basename(docx_path), f, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'), 'output_format': (None, output_format), 'paginate': (None, paginate) } headers = {"X-API-Key": DATALAB_API_KEY} response = requests.post(API_URL, files=form_data, headers=headers) data = response.json() # Poll for completion check_url = data["request_check_url"] max_polls = 300 for i in range(max_polls): time.sleep(2) response = requests.get(check_url, headers=headers) result = response.json() if result["status"] == "complete": return result elif result["status"] == "failed": raise Exception(f"Conversion failed: {result.get('error')}") raise TimeoutError("Conversion did not complete in time") def analyze_with_llm(content, prompt_template): """Send content to LLM for analysis via OpenRouter.""" response = requests.post( url="https://openrouter.ai/api/v1/chat/completions", headers={ "Authorization": f"Bearer {OPENROUTER_API_KEY}", "Content-Type": "application/json" }, json={ "model": "anthropic/claude-3.5-sonnet", "messages": [ { "role": "user", "content": prompt_template.format(content=content) } ] } ) return response.json()['choices'][0]['message']['content'] def generate_legal_review(docx_path): """ Complete workflow: extract tracked changes and generate legal review. """ print(f"Processing {docx_path}...") # Extract tracked changes result = extract_tracked_changes(docx_path, output_format='html', paginate=True) marked_up_doc = result['html'] print("Document converted with tracked changes preserved.") # Generate comprehensive legal review review_prompt = """You are a legal reviewer analyzing a contract with tracked changes. Please provide: 1. **Executive Summary**: Brief overview of the document and key changes 2. **Material Changes**: List substantive changes that affect rights, obligations, or liabilities 3. **Risk Assessment**: Identify any changes that increase risk exposure 4. **Comments Analysis**: Summarize unresolved comments and action items 5. **Recommendations**: Specific next steps for legal review Document with tracked changes: {content}""" print("\nGenerating legal review with LLM...") review = analyze_with_llm(marked_up_doc, review_prompt) # Also generate author-specific analysis author_prompt = """Analyze this document's tracked changes by author. For each author who made changes: - Total number of insertions and deletions - Types of changes (substantive vs. editorial) - Key themes in their revisions - Any patterns in their negotiation strategy Document: {content}""" print("Generating per-author analysis...") author_analysis = analyze_with_llm(marked_up_doc, author_prompt) return { 'marked_up_document': marked_up_doc, 'legal_review': review, 'author_analysis': author_analysis } if __name__ == "__main__": # Process a contract with tracked changes results = generate_legal_review('contract_redline_v3.docx') # Save results with open('legal_review.txt', 'w') as f: f.write("LEGAL REVIEW\n") f.write("="*80 + "\n\n") f.write(results['legal_review']) f.write("\n\n" + "="*80 + "\n\n") f.write("AUTHOR ANALYSIS\n") f.write("="*80 + "\n\n") f.write(results['author_analysis']) with open('marked_up_document.html', 'w') as f: f.write(results['marked_up_document']) print("\nReview complete! Results saved to:") print(" - legal_review.txt") print(" - marked_up_document.html") ``` ## Next Steps Explore the full conversion API and output format options. Extract structured data from documents using JSON schemas. Chain processors into versioned, reusable pipelines. Use the Python SDK for simpler document conversion workflows. # File Upload Source: https://documentation.datalab.to/docs/recipes/file-management/file-upload-api Upload and manage files for use in pipelines and document processing. Upload files to Datalab storage and reference them across API calls and pipelines. ## SDK Usage The SDK handles the upload flow automatically: ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Upload a single file file = client.upload_files("document.pdf") print(f"Uploaded: {file.reference}") # datalab://file-abc123 # Upload multiple files files = client.upload_files(["doc1.pdf", "doc2.pdf", "doc3.pdf"]) for f in files: print(f"{f.original_filename}: {f.reference}") ``` ### Use in Pipelines ```python theme={null} # Upload files files = client.upload_files(["invoice1.pdf", "invoice2.pdf"]) # Use in pipeline for f in files: execution = client.run_pipeline("pl_abc123", file_url=f.reference) ``` ### File Management ```python theme={null} # List files result = client.list_files(limit=50) for file in result['files']: print(f"{file.original_filename}: {file.file_size} bytes") # Get metadata file = client.get_file_metadata(123) # Get download URL download = client.get_file_download_url(file_id=123, expires_in=3600) print(download['download_url']) # Delete file client.delete_file(123) ``` See [SDK File Management](/docs/welcome/sdk/file-management) for complete documentation. ## REST API The upload flow has three steps: ### 1. Request Upload URL ```bash theme={null} curl -X POST https://www.datalab.to/api/v1/files/upload \ -H "X-API-Key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"filename": "document.pdf", "content_type": "application/pdf"}' ``` To store the file in EU infrastructure, add `"processing_location": "eu"` to the request body: ```bash theme={null} curl -X POST https://www.datalab.to/api/v1/files/upload \ -H "X-API-Key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"filename": "document.pdf", "content_type": "application/pdf", "processing_location": "eu"}' ``` Response: ```json theme={null} { "file_id": 123, "upload_url": "https://presigned-url...", "expires_in": 3600, "reference": "datalab://file-abc123" } ``` ### 2. Upload File ```bash theme={null} curl -X PUT "{upload_url}" \ -H "Content-Type: application/pdf" \ --data-binary @document.pdf ``` ### 3. Confirm Upload ```bash theme={null} curl https://www.datalab.to/api/v1/files/123/confirm \ -H "X-API-Key: YOUR_API_KEY" ``` ### Complete Python Example ```python theme={null} import requests API_KEY = "YOUR_API_KEY" headers = {"X-API-Key": API_KEY} # Step 1: Request upload URL response = requests.post( "https://www.datalab.to/api/v1/files/upload", json={"filename": "document.pdf", "content_type": "application/pdf"}, headers=headers ) data = response.json() file_id = data["file_id"] upload_url = data["upload_url"] reference = data["reference"] # Step 2: Upload file with open("document.pdf", "rb") as f: requests.put(upload_url, data=f, headers={"Content-Type": "application/pdf"}) # Step 3: Confirm upload requests.get(f"https://www.datalab.to/api/v1/files/{file_id}/confirm", headers=headers) print(f"File ready: {reference}") ``` ## File Management API ### List Files ```bash theme={null} GET /api/v1/files?limit=50&offset=0 ``` ### Get File Metadata ```bash theme={null} GET /api/v1/files/{file_id} ``` ### Get Download URL ```bash theme={null} GET /api/v1/files/{file_id}/download?expires_in=3600 ``` ### Delete File ```bash theme={null} DELETE /api/v1/files/{file_id} ``` ## Using File References Once uploaded, use `datalab://file-{id}` references in any API call: ```python theme={null} # In Convert API response = requests.post( "https://www.datalab.to/api/v1/convert", data={ "file_url": "datalab://file-abc123", "output_format": "markdown", "mode": "balanced" }, headers=headers ) # In Form Filling API response = requests.post( "https://www.datalab.to/api/v1/fill", data={ "file_url": "datalab://file-abc123", "field_data": json.dumps(field_data) }, headers=headers ) ``` ## Limits | Limit | Value | | ------------------- | -------------------- | | Maximum file size | 200 MB | | Upload URL expiry | 1 hour | | Download URL expiry | 1 minute to 24 hours | See [API Limits](/docs/common/limits) for complete details. Get started with our API in less than a minute. We include free credits. # Forge Evals Source: https://documentation.datalab.to/docs/recipes/forge-evals/overview Compare parsing configurations across multiple documents to find the best settings for your use case Forge Evals is a powerful tool for evaluating and comparing different parsing configurations across multiple documents. Use it to determine which settings work best for your specific document types and use cases. ## What is Forge Evals? Forge Evals allows you to: * Upload up to 10 documents at once * Test up to 5 different parsing configurations simultaneously * Compare results side-by-side with visual diff highlighting * Identify the optimal parsing settings for your document types This is particularly useful when you need to: * Determine which parsing mode (Fast, Balanced, or Accurate) works best for your documents * Evaluate special features like Track Changes or Chart Understanding * Compare parsing results across different document types * Optimize for speed vs. accuracy trade-offs ## Getting started Access Forge Evals at [https://www.datalab.to/app/evals](https://www.datalab.to/app/evals) ### Step 1: Upload documents Upload the documents you want to evaluate. You can: * Drag and drop files directly into the upload zone * Click to browse and select files * Upload up to 10 documents per evaluation session **Supported formats:** PDF, DOCX, XLSX, PPTX, images, and more. See [supported file types](/docs/common/supportedfiletypes) for the complete list. Spreadsheet files (XLS, XLSX, CSV, ODS) are processed automatically without additional configuration options. ### Step 2: Select configurations Choose which parsing configurations to test. Configurations are organized into three tabs: #### Datalab tab Select from Datalab's preset configurations or create custom ones: **Preset configurations:** * **Fast Mode**: Lowest latency, great for real-time use cases * **Balanced Mode**: Balanced accuracy and latency, works well with most documents * **Accurate Mode**: Highest accuracy and latency, good for complex documents * **Track Changes**: Extract tracked changes from DOCX files (DOCX only) * **Chart Understanding**: Extract data from charts and graphs **Custom configurations:** Create custom configurations to test specific combinations of: * Processing mode (Fast, Balanced, or Accurate) * Page range selection * Special features (Track Changes, Chart Understanding) * Output options (pagination, headers, footers) * Run count (1-3×): Run the same configuration multiple times to test consistency Track Changes only works with DOCX files. The grid will show "N/A" for incompatible document/configuration combinations. #### Other Models tab Compare Datalab against other open source models hosted on our infrastructure: * **OlmoOCR** * **RolmoOCR** * **DotsOCR** * **DeepSeekOCR** These models are hosted by Datalab and don't require any API credentials. Because Datalab models have additional optimizations when hosted on our managed API, we omit timing numbers from other hosted models to avoid confusion since a fair comparison is difficult. If you'd like to see additional models or want help with custom evals / timings, contact us at [support@datalab.to](mailto:support@datalab.to). #### External Providers tab Access to external providers is currently limited to select users. If you're actively evaluating Datalab against other providers, [contact us](mailto:support@datalab.to) to request access. You can also use Evals to compare Datalab outputs to other proprietary document processing providers. Get in touch to enable this. ### Step 3: Run evaluation Click "Start Evaluation" to begin processing. The system will: 1. Process each document with each selected configuration 2. Display progress in a grid view 3. Show completion status and processing time for each run You can: * Monitor progress in real-time * Cancel all runs if needed * Retry failed runs ### Step 4: Compare results Once runs complete, click any two cells in the grid to compare their results side-by-side. The comparison view shows: * **Parallel view**: Full documents side-by-side with inline diff highlighting * **Multiple output formats**: Switch between Markdown, HTML, JSON, and Chunks * **Rendered output**: Toggle between raw and rendered views for HTML, Markdown, and JSON formats * **Visual diffs**: When enabled with rendered output, see word-level highlighting of changes * **JSON visualization**: View JSON output with document thumbnails and bounding boxes overlaid * **Processing metrics**: Duration and configuration details for each run * **Diff statistics**: Lines added, removed, and changed #### Viewing modes * **Raw view**: See the original output text with line numbers * **Rendered view**: View formatted HTML/Markdown or visualized JSON with thumbnails * **Diff view**: Compare outputs with line-by-line or word-level highlighting * **Rendered diff**: Combine rendered output with word-level diff highlighting (HTML/Markdown only) Rendered diff view is only available for HTML and Markdown formats. JSON rendered view shows bounding boxes but does not support diff highlighting. Use the "Switch Runs" button to select different runs for comparison without leaving the comparison view. ## Visualization features ### Rendered output Toggle the "Render" button to view formatted output instead of raw text: * **HTML/Markdown**: See the fully rendered document with proper formatting, including math equations rendered with MathJax * **JSON**: View document thumbnails with bounding boxes overlaid on detected blocks (text, tables, figures, etc.) ### Diff highlighting When comparing two runs, enable "Show Diff" to see differences: * **Raw diff**: Line-by-line comparison with added/removed lines highlighted * **Rendered diff**: Word-level highlighting within rendered HTML/Markdown output, preserving formatting and math rendering The rendered diff view intelligently highlights: * Changed paragraphs with block-level highlighting * Specific changed words within modified paragraphs * Preserved math equations with accurate semantic comparison Rendered diff is not available for JSON format. Use raw diff view to compare JSON outputs. ### Multiple iterations When a configuration is set to run multiple times (2× or 3×), each iteration appears as a separate column in the grid (e.g., "Accurate #1", "Accurate #2"). This allows you to: * Compare consistency across multiple runs of the same configuration * Identify variability in parsing results * Validate that your configuration produces stable outputs ## Excluding runs Right-click any cell in the grid to exclude that specific document/configuration combination from running. This is useful when: * You know certain configurations won't work for specific documents * You want to reduce the total number of runs * You need to focus on specific comparisons Excluded cells appear with a yellow background and can be re-included by clicking them again. ## Best practices ### Choosing configurations * Start with the three preset modes (Fast, Balanced, Accurate) to establish a baseline * Add Track Changes if you're working with DOCX files that contain revisions * Add Chart Understanding if your documents contain charts or graphs * Create custom configurations to test specific parameter combinations ### Document selection * Include representative samples of your document types * Test edge cases (complex layouts, mixed content, etc.) * Keep document count manageable (3-5 documents is often sufficient) ### Interpreting results * Compare processing times to understand speed/accuracy trade-offs * Use the diff view to identify where configurations produce different outputs * Toggle between raw and rendered views to see formatted output * Use rendered diff view for word-level highlighting of changes in HTML/Markdown * Visualize JSON output with bounding boxes to see document structure * Pay attention to "N/A" cells indicating incompatible combinations * Look for patterns across similar document types * Run configurations multiple times (using run count) to test consistency and identify variability ## Limitations * Maximum 10 documents per evaluation session * Maximum 5 run configurations per session * Maximum 3 iterations per configuration * Track Changes feature only works with DOCX files * Spreadsheet files use automatic configuration (no mode selection) * Rendered diff view only available for HTML and Markdown formats * External provider access is limited to select users (contact us for access) ## Custom evaluations For larger document sets or custom evaluation needs, [contact us](https://www.datalab.to/contact) to discuss enterprise evaluation options. ## Next Steps Dive into Marker's conversion API to configure the settings you evaluated. Extract structured data from documents using JSON schemas. Automatically fill PDF forms with extracted data. Get up and running with the Datalab API in minutes. # Form Filling Source: https://documentation.datalab.to/docs/recipes/form-filling/form-filling-api-overview Automatically fill PDF and image forms with structured data. The form filling API fills PDF and image forms with your structured data. It works with native PDF form fields and scanned/image forms. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## Quick Start ```python Python SDK theme={null} from datalab_sdk import DatalabClient, FormFillingOptions client = DatalabClient() options = FormFillingOptions( field_data={ "name": {"value": "John Doe", "description": "Full name"}, "email": {"value": "john@example.com", "description": "Email address"}, "date": {"value": "12/15/2024", "description": "Today's date"}, } ) result = client.fill("form.pdf", options=options) result.save_output("filled_form.pdf") print(f"Fields filled: {result.fields_filled}") print(f"Fields not found: {result.fields_not_found}") ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/fill \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@form.pdf" \ -F 'field_data={"name": {"value": "John Doe", "description": "Full name"}, "email": {"value": "john@example.com", "description": "Email address"}}' # Poll request_check_url from response until status is "complete" # Response includes output_base64 with the filled form ``` ```python Python (requests) theme={null} import requests, json, time, base64, os headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} field_data = { "name": {"value": "John Doe", "description": "Full name"}, "email": {"value": "john@example.com", "description": "Email"}, "date": {"value": "12/15/2024", "description": "Date"} } with open("form.pdf", "rb") as f: response = requests.post( "https://www.datalab.to/api/v1/fill", files={"file": ("form.pdf", f, "application/pdf")}, data={"field_data": json.dumps(field_data), "confidence_threshold": "0.5"}, headers=headers ) check_url = response.json()["request_check_url"] while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": pdf_bytes = base64.b64decode(result["output_base64"]) with open("filled_form.pdf", "wb") as f: f.write(pdf_bytes) print(f"Fields filled: {result['fields_filled']}") break elif result["status"] == "failed": print(f"Error: {result.get('error')}") break time.sleep(2) ``` See [SDK Form Filling](/docs/welcome/sdk/form-filling) for complete SDK documentation. ## How It Works 1. Upload your form (PDF or image) with field data 2. The API detects form fields and matches them to your data 3. Fields are filled and the form is returned as PDF or PNG ## Field Data Format Provide field names with values and descriptions: ```python theme={null} field_data = { "field_key": { "value": "The value to fill", "description": "Description to help match the field" } } ``` ### Examples **Basic fields:** ```python theme={null} field_data = { "first_name": {"value": "John", "description": "First name"}, "last_name": {"value": "Doe", "description": "Last name"}, "ssn": {"value": "123-45-6789", "description": "Social Security Number"} } ``` **Checkboxes:** ```python theme={null} field_data = { "is_citizen": {"value": "yes", "description": "US citizenship status"}, "agree_terms": {"value": "checked", "description": "Terms agreement"} } ``` Values like `"yes"`, `"true"`, `"1"`, `"checked"`, `"x"` will check boxes. **Compound data:** ```python theme={null} field_data = { "full_address": { "value": "123 Main St, New York, NY, 10001", "description": "Complete address" } } ``` The API can split compound data across multiple form fields. ## Options | Option | Type | Default | Description | | ---------------------- | ------ | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `field_data` | dict | Required | Field names mapped to values and descriptions | | `context` | str | None | Additional context to help match fields | | `confidence_threshold` | float | `0.5` | Minimum confidence for field matching (0.0-1.0) | | `max_pages` | int | None | Maximum pages to process | | `page_range` | str | None | Specific pages to process | | `skip_cache` | bool | `False` | Skip cached results | | `processing_location` | string | - | Data residency region: `"eu"` or `"us"`. When set, use `file_url` or a pre-uploaded `datalab://` reference — multipart uploads are not supported. EU carries a regional pricing premium. | ### Context Parameter Use `context` to improve matching for specific form types: ```python theme={null} options = FormFillingOptions( field_data={...}, context="W-4 Employee's Withholding Certificate for new hire" ) ``` ## Response | Field | Type | Description | | ------------------ | ----- | ------------------------------------- | | `status` | str | `processing`, `complete`, or `failed` | | `success` | bool | Whether filling succeeded | | `output_format` | str | `pdf` or `png` | | `output_base64` | str | Base64-encoded filled form | | `fields_filled` | list | Field names that were filled | | `fields_not_found` | list | Field names that couldn't be matched | | `page_count` | int | Pages processed | | `runtime` | float | Processing time in seconds | | `cost_breakdown` | dict | Cost details | ## Supported Form Types * **PDF with native AcroForm fields** - Uses PDF form fields directly * **PDF with visual fields** - Detects field locations and adds text overlays * **Images** (PNG, JPG) - Detects field locations and draws text on image The API automatically detects the input type and uses the appropriate method. Results are deleted from Datalab servers one hour after processing completes. ## Next Steps Complete SDK reference for form filling Upload forms for reuse across requests Chain processors into versioned, reusable pipelines. Get notified when form filling completes # Recipes Overview Source: https://documentation.datalab.to/docs/recipes/overview End-to-end guides for common document processing workflows. Recipes are detailed, end-to-end guides with fully working code samples. Pick a recipe based on what you're trying to accomplish. ## By Use Case Chain processors into versioned pipelines for production use Convert documents to markdown or chunks for retrieval-augmented generation Pull structured fields (amounts, dates, line items) from financial documents Extract parties, dates, and clauses from legal documents Extract titles, authors, abstracts, and citations from academic papers Automatically fill PDF and image forms with structured data Create Word documents from markdown with track changes Separate multi-document PDFs into individual documents Extract redlines, insertions, deletions, and comments from Word documents ## By Feature | Feature | Description | Guide | | ---------------------- | ----------------------------------------------------------- | -------------------------------------------------------------------------------------- | | Document Conversion | Convert PDFs, images, and office docs to markdown/HTML/JSON | [Guide](/docs/recipes/conversion/conversion-api-overview) | | Batch Processing | Process multiple documents concurrently | [Guide](/docs/recipes/conversion/batch-documents) | | Structured Extraction | Extract fields using JSON schemas | [Guide](/docs/recipes/structured-extraction/api-overview) | | Long Document Handling | Strategies for 100+ page documents | [Guide](/docs/recipes/structured-extraction/handling-long-documents) | | Document Segmentation | Split multi-document PDFs by section | [Guide](/docs/recipes/document-segmentation/auto-segmentation) | | Form Filling | Fill PDF and image forms programmatically | [Guide](/docs/recipes/form-filling/form-filling-api-overview) | | Create Document | Generate DOCX files from markdown | [Guide](/docs/recipes/create-document/create-document-api-overview) | | File Upload | Upload and manage files for reuse | [Guide](/docs/recipes/file-management/file-upload-api) | | Pipelines | Chain processors into versioned, reusable configurations | [Guide](/docs/recipes/pipelines/pipeline-overview) | | Pipeline Versioning | Manage drafts, publish versions, pin production deployments | [Guide](/docs/recipes/pipelines/pipeline-versioning) | | Track Changes | Extract redlines and comments from Word docs | [Guide](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents) | | Forge Evals | Compare parsing configurations side-by-side | [Guide](/docs/recipes/forge-evals/overview) | ## Self-Hosted All cloud API recipes work with our [on-premises containers](/docs/on-prem/overview) for sensitive documents. See the [feature parity table](/docs/on-prem/api#feature-parity) for available features. # Create a Pipeline Source: https://documentation.datalab.to/docs/recipes/pipelines/create-pipeline Build pipelines using Forge or the SDK to chain document processors. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## Using Forge [Forge](https://www.datalab.to/app/playground) provides a visual pipeline builder where you can: 1. **Start from a template** or create a blank pipeline 2. **Add processors** — click to add convert, extract, segment, custom, or fill processors 3. **Configure each processor** — set processing mode, schemas, field data, and options in the configuration panel 4. **Test with a document** — run the pipeline and watch each processor complete in real-time 5. **Save and version** — name your pipeline and publish versions for production use Edits in Forge auto-save as a draft. Your published versions remain unchanged until you explicitly publish a new version. ## Using the SDK ### Create a Pipeline Define processors using `PipelineProcessor` and create the pipeline: ```python theme={null} from datalab_sdk import DatalabClient, PipelineProcessor client = DatalabClient() steps = [ PipelineProcessor(type="convert", settings={ "mode": "balanced", "output_format": "markdown" }), PipelineProcessor(type="extract", settings={ "page_schema": { "type": "object", "properties": { "title": {"type": "string", "description": "Document title"}, "date": {"type": "string", "description": "Document date"}, "summary": {"type": "string", "description": "Brief summary"} } } }) ] pipeline = client.create_pipeline(steps=steps) print(f"Created: {pipeline.pipeline_id}") # pl_XXXXX ``` The pipeline starts as an unsaved draft. ### Save the Pipeline Name and save the pipeline so it appears in your pipeline list: ```python theme={null} pipeline = client.save_pipeline( pipeline.pipeline_id, name="Document Summarizer" ) print(f"Saved: {pipeline.name}") ``` ### Update Steps Update a pipeline's steps. This creates a draft if the pipeline has a published version: ```python theme={null} updated_steps = [ PipelineProcessor(type="convert", settings={ "mode": "accurate", # Changed from balanced "output_format": "markdown" }), PipelineProcessor(type="extract", settings={ "page_schema": { "type": "object", "properties": { "title": {"type": "string"}, "date": {"type": "string"}, "summary": {"type": "string"}, "author": {"type": "string"} # Added field } } }) ] pipeline = client.update_pipeline(pipeline.pipeline_id, steps=updated_steps) ``` ## Using the REST API ```bash Create theme={null} curl -X POST https://www.datalab.to/api/v1/pipelines \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "steps": [ {"type": "convert", "settings": {"mode": "balanced"}}, {"type": "extract", "settings": { "page_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}" }} ] }' ``` ```bash Save theme={null} curl -X PUT https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/save \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{"name": "Document Summarizer"}' ``` ```bash Update steps theme={null} curl -X PUT https://www.datalab.to/api/v1/pipelines/PIPELINE_ID \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "steps": [ {"type": "convert", "settings": {"mode": "accurate"}}, {"type": "extract", "settings": { "page_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}" }} ] }' ``` ## Processor Configuration Reference ### Convert Processor Controls how the document is parsed. ```python theme={null} PipelineProcessor(type="convert", settings={ "mode": "balanced", # fast, balanced, accurate "output_format": "markdown", # markdown, html, json, chunks "paginate": True, # Add page delimiters "include_images": True, # Extract images "include_image_captions": True, "add_block_ids": False, # Block IDs for citations }) ``` | Setting | Type | Default | Description | | -------------------------- | ---- | ------------ | ----------------------------------- | | `mode` | str | `"fast"` | Processing mode | | `output_format` | str | `"markdown"` | Output format | | `paginate` | bool | `false` | Add page delimiters | | `include_images` | bool | `true` | Extract images from document | | `include_image_captions` | bool | `true` | Generate image captions | | `include_headers_footers` | bool | `false` | Include page headers/footers | | `add_block_ids` | bool | `false` | Add block IDs for citation tracking | | `fence_synthetic_captions` | bool | `false` | Fence synthetic image captions | ### Extract Processor Extracts structured data using a JSON schema. Requires a preceding `convert` processor (or `segment` / `custom`). ```python theme={null} PipelineProcessor(type="extract", settings={ "page_schema": { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice ID"}, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": {"type": "string"}, "amount": {"type": "number"} } } } } } }) ``` | Setting | Type | Description | | ------------- | ---- | -------------------------------------- | | `page_schema` | dict | JSON schema defining fields to extract | Use detailed `description` fields in your schema to improve extraction accuracy. Tell the model what to look for. ### Segment Processor Splits a document into logical sections. Requires a preceding `convert` processor. ```python theme={null} PipelineProcessor(type="segment", settings={ "segmentation_schema": { "Cover Letter": "The cover letter or introductory section", "Resume": "The applicant's resume or CV", "References": "Reference letters or contact information" } }) ``` | Setting | Type | Description | | --------------------- | ---- | ------------------------------------ | | `segmentation_schema` | dict | Map of section names to descriptions | ### Custom Processor Applies use-case-specific customizations to convert output. Requires a preceding `convert` processor. See [Custom Processors](/docs/recipes/pipelines/custom-processors) for details. ```python theme={null} PipelineProcessor( type="custom", settings={}, custom_processor_id="cp_abc123" # Your custom processor ID ) ``` | Field | Type | Description | | --------------------- | ---- | --------------------------------------- | | `custom_processor_id` | str | ID of the custom processor (`cp_XXXXX`) | | `eval_rubric_id` | int | Optional evaluation rubric to apply | ### Fill Processor Fills form fields in a PDF or image. `fill` is always the only step in a pipeline — it cannot be chained with `convert`, `extract`, or `segment`. Use it to apply versioning and execution tracking to your form-filling workflows. ```python theme={null} PipelineProcessor(type="fill", settings={ "field_data": { "full_name": {"value": "John Doe", "description": "Full legal name"}, "date": {"value": "2024-01-15", "description": "Today's date"}, }, "context": "Employee onboarding form", # Optional "confidence_threshold": 0.5, # Optional, default 0.5 }) ``` | Setting | Type | Required | Description | | ---------------------- | ----- | -------- | -------------------------------------------------------------- | | `field_data` | dict | Yes | Map of field keys to `{value, description}` objects | | `context` | str | No | Additional context to improve field matching | | `confidence_threshold` | float | No | Minimum confidence for field matching (0.0–1.0, default `0.5`) | ## List and Manage Pipelines ```python theme={null} # List saved pipelines result = client.list_pipelines(saved_only=True, limit=50) for p in result["pipelines"]: print(f"{p.pipeline_id}: {p.name} (v{p.active_version})") # Get a specific pipeline pipeline = client.get_pipeline("pl_abc123") # Archive (soft-delete) client.archive_pipeline("pl_abc123") # Restore client.unarchive_pipeline("pl_abc123") ``` ## Next Steps Manage drafts, publish versions, and pin production deployments. Execute pipelines with overrides and track results. Deep dive on extraction schemas and confidence scoring. Full SDK reference for all pipeline methods. # Custom Processors Source: https://documentation.datalab.to/docs/recipes/pipelines/custom-processors Fine-tune document conversion output with AI-generated custom processors. Custom processors customize the output of the `convert` processor. When standard conversion doesn't produce exactly what you need — edge-case layouts, domain-specific formatting, or use-case-specific output transformations — custom processors let you fine-tune the result. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## How Custom Processors Work A custom processor applies modifications on top of document conversion. The flow is: 1. The `convert` processor parses your document into structured output 2. The custom processor applies your modifications to refine that output Modifications can operate at different levels: * **Block-level** — Modify individual blocks (e.g., rewrite table captions, summarize content) * **Page-level** — Modify entire pages with full structural control (e.g., reorder blocks, add/remove elements) * **Classification** — Classify pages into categories for downstream routing ## Creating a Custom Processor The recommended way to create a custom processor is through [Forge](https://www.datalab.to/app/playground). The creation flow is a 3-step guided wizard: 1. **Describe** — Use the chat-driven builder to articulate what your processor should do. Describe your goal in natural language (e.g., "Summarize all tables into bullet points" or "Extract only the financial data sections") and the AI assistant will help you refine and confirm the specification before generating the processor. 2. **Documents** — Upload example documents that represent your use case. These are used to generate and validate the processor configuration. 3. **Review** — See the generated processor run on your examples. If the results aren't right, use the **Improve** tab in the sidebar to describe what to change and generate a new version. The **History** tab shows all past versions and lets you revert to any of them; **Details** shows the active configuration. Each custom processor gets an ID in the format `cp_XXXXX`. ## Using a Custom Processor ### Standalone Run a custom processor directly on a document: ```python Python SDK theme={null} from datalab_sdk import DatalabClient, CustomProcessorOptions client = DatalabClient() options = CustomProcessorOptions( pipeline_id="cp_abc123", # Your custom processor ID mode="balanced", output_format="markdown", ) result = client.run_custom_processor("document.pdf", options=options) print(result.markdown) ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/custom-processor \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "pipeline_id=cp_abc123" \ -F "mode=balanced" \ -F "output_format=markdown" ``` ```python Python (requests) theme={null} import os, time, requests url = "https://www.datalab.to/api/v1/custom-processor" headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} with open("document.pdf", "rb") as f: resp = requests.post(url, headers=headers, files={"file": ("document.pdf", f, "application/pdf")}, data={ "pipeline_id": "cp_abc123", "mode": "balanced", "output_format": "markdown" }) check_url = resp.json()["request_check_url"] for _ in range(300): result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": print(result["markdown"]) break time.sleep(2) ``` ### In a Pipeline Use a custom processor as part of a pipeline by adding it as a `custom` processor: ```python theme={null} from datalab_sdk import DatalabClient, PipelineProcessor client = DatalabClient() pipeline = client.create_pipeline(steps=[ PipelineProcessor(type="convert", settings={"mode": "balanced"}), PipelineProcessor(type="custom", settings={}, custom_processor_id="cp_abc123"), PipelineProcessor(type="extract", settings={ "page_schema": { "type": "object", "properties": { "summary": {"type": "string"} } } }) ]) ``` This chains convert → custom → extract: the document is parsed, your custom modifications are applied, then structured data is extracted from the customized output. ## CustomProcessorOptions | Option | Type | Default | Description | | -------------------------- | ---- | -------------- | ----------------------------------------------------------- | | `pipeline_id` | str | Required | Custom processor ID (`cp_XXXXX`) | | `version` | int | Active version | Specific processor version to run | | `run_eval` | bool | `False` | Run evaluation rules after processing | | `mode` | str | `"fast"` | Processing mode: `"fast"`, `"balanced"`, `"accurate"` | | `output_format` | str | `"markdown"` | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"` | | `paginate` | bool | `False` | Add page delimiters | | `add_block_ids` | bool | `False` | Add block IDs for citation tracking | | `disable_image_extraction` | bool | `False` | Don't extract images | | `disable_image_captions` | bool | `False` | Don't generate image captions | | `webhook_url` | str | - | Webhook URL for completion notification | ## Versioning Custom processors support versioning. Each iteration creates a new version, letting you refine behavior over time: ```python theme={null} # List versions versions = client.list_custom_processor_versions("cp_abc123") for v in versions["versions"]: print(f"v{v.version}: {v.description}") # Switch active version client.set_active_processor_version("cp_abc123", version=2) ``` ## Managing Custom Processors ```python theme={null} # List your custom processors result = client.list_custom_processors(limit=50) for p in result["processors"]: print(f"{p.processor_id}: {p.name} (v{p.active_version})") # Archive client.archive_custom_processor("cp_abc123") ``` ## Next Steps Processor types, composition rules, and when to use pipelines. Build pipelines that include custom processors. Understand the convert processor that custom processors build on. Evaluate and compare processor configurations across your document collection. # Pipelines Source: https://documentation.datalab.to/docs/recipes/pipelines/pipeline-overview Build versioned document processing pipelines by chaining processors together. Pipelines chain processors — convert, extract, segment, and custom — into a single reusable unit. Define a pipeline once, version it, and run it against any document with one API call. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## Why Pipelines Individual endpoints like `/convert` and `/extract` work well for one-off tasks. Pipelines are better when you need to: * **Chain processors** — Convert a document, then extract structured data, in one call * **Version your configuration** — Pin production integrations to a specific version while iterating on drafts * **Standardize processing** — Share pipeline configurations across your team * **Track execution** — Monitor each processor's status as a pipeline runs You can build pipelines visually in [Forge](https://www.datalab.to/app/playground) or programmatically via the SDK and API. ## How Pipelines Work A pipeline is an ordered chain of processors. Each processor processes the document and passes its output to the next via checkpoints. ``` convert → segment → extract ``` Most pipelines start with `convert`. The `fill` processor is the exception — it runs as a standalone step and cannot be chained. ### Processor Types | Processor | Description | Can Follow | | --------- | ------------------------------------------------------------------- | ------------------------------ | | `convert` | Parse document to markdown/HTML/JSON | Must be first | | `segment` | Split document into logical sections | `convert` | | `extract` | Extract structured data using a JSON schema | `convert`, `segment`, `custom` | | `custom` | Run a [custom processor](/docs/recipes/pipelines/custom-processors) | `convert` | | `fill` | Fill form fields in a PDF or image | Standalone only | ### Composition Rules * Every pipeline starts with a `convert` or `fill` processor * `extract` is always terminal (nothing can follow it) * `segment` can feed into `extract` * `custom` can feed into `extract` * `fill` is always standalone — it cannot follow or precede other processors Common patterns: | Pattern | Use Case | | ----------------------------- | ---------------------------------------- | | `convert` | Simple document parsing | | `convert → extract` | Parse and extract structured fields | | `convert → segment` | Parse and split into sections | | `convert → segment → extract` | Split, then extract from each section | | `convert → custom → extract` | Apply custom processing, then extract | | `fill` | Version and track form-filling workflows | ## Pipeline Lifecycle Pipelines have three states: 1. **Draft** — Edits auto-save. Not versioned yet. 2. **Saved** — Named and visible in your pipeline list. 3. **Published** — An immutable version snapshot. Safe to use in production. ``` Create (draft) → Save (named) → Publish version (immutable) ↑ | └──── Edit (new draft) ←───────┘ ``` When you edit a published pipeline, your changes go into a draft. The published version remains unchanged until you publish a new version. You can discard the draft at any time to revert. See [Pipeline Versioning](/docs/recipes/pipelines/pipeline-versioning) for the full lifecycle. ## Quick Example Create a pipeline that converts a document and extracts invoice data: ```python Python SDK theme={null} from datalab_sdk import DatalabClient, PipelineProcessor client = DatalabClient() # Define steps steps = [ PipelineProcessor(type="convert", settings={"mode": "balanced"}), PipelineProcessor(type="extract", settings={ "page_schema": { "type": "object", "properties": { "invoice_number": {"type": "string"}, "total_amount": {"type": "number"}, "vendor_name": {"type": "string"} } } }) ] # Create and save pipeline = client.create_pipeline(steps=steps) pipeline = client.save_pipeline(pipeline.pipeline_id, name="Invoice Extractor") # Run on a document execution = client.run_pipeline( pipeline.pipeline_id, file_path="invoice.pdf" ) # Poll until complete execution = client.get_pipeline_execution( execution.execution_id, max_polls=300 ) # Get extraction result result = client.get_step_result(execution.execution_id, step_index=1) print(result) ``` ```bash cURL theme={null} # Create pipeline curl -X POST https://www.datalab.to/api/v1/pipelines \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "steps": [ {"type": "convert", "settings": {"mode": "balanced"}}, {"type": "extract", "settings": { "page_schema": { "type": "object", "properties": { "invoice_number": {"type": "string"}, "total_amount": {"type": "number"} } } }} ] }' # Save pipeline (use pipeline_id from response) curl -X PUT https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/save \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{"name": "Invoice Extractor"}' # Run pipeline curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/run \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@invoice.pdf" # Poll execution (use execution_id from response) curl https://www.datalab.to/api/v1/pipelines/executions/EXECUTION_ID \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import os, time, requests, json BASE = "https://www.datalab.to/api/v1" headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} # Create pipeline resp = requests.post(f"{BASE}/pipelines", headers={ **headers, "Content-Type": "application/json" }, json={ "steps": [ {"type": "convert", "settings": {"mode": "balanced"}}, {"type": "extract", "settings": { "page_schema": json.dumps({ "type": "object", "properties": { "invoice_number": {"type": "string"}, "total_amount": {"type": "number"} } }) }} ] }) pipeline_id = resp.json()["pipeline_id"] # Save requests.put(f"{BASE}/pipelines/{pipeline_id}/save", headers={**headers, "Content-Type": "application/json"}, json={"name": "Invoice Extractor"}) # Run with open("invoice.pdf", "rb") as f: resp = requests.post(f"{BASE}/pipelines/{pipeline_id}/run", headers=headers, files={"file": ("invoice.pdf", f, "application/pdf")}) execution_id = resp.json()["execution_id"] # Poll for _ in range(300): resp = requests.get(f"{BASE}/pipelines/executions/{execution_id}", headers=headers) data = resp.json() if data["status"] in ("completed", "failed"): break time.sleep(2) # Get step result resp = requests.get( f"{BASE}/pipelines/executions/{execution_id}/steps/1/result", headers=headers) print(resp.json()) ``` ## Pipelines vs Individual Endpoints | | Individual Endpoints | Pipelines | | ----------------- | ------------------------- | -------------------------------- | | **Processors** | One at a time | Chain multiple processors | | **Versioning** | None | Draft, saved, published versions | | **Configuration** | Pass options per request | Configure once, reuse | | **Forge UI** | Playground | Full pipeline builder | | **Best for** | Quick tests, simple tasks | Production integrations | Individual endpoints (`/convert`, `/extract`, `/segment`) are not going away. Use them for simple, one-off processing. Use Pipelines when you need repeatability, versioning, or multi-processor chains. ## Next Steps Build your first pipeline with Forge or the SDK. Manage drafts, versions, and production deployments. Execute pipelines with overrides, polling, and webhooks. Full SDK reference for all pipeline methods. # Pipeline Versioning Source: https://documentation.datalab.to/docs/recipes/pipelines/pipeline-versioning Manage pipeline drafts, publish immutable versions, and pin production deployments. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## Version Lifecycle Every pipeline goes through a predictable lifecycle: | State | `active_version` | Description | | --------- | ---------------- | ------------------------------------------- | | Draft | `0` | Edits auto-save. No published version yet. | | Saved | `0` | Named pipeline, still no published version. | | Published | `1`, `2`, ... | Immutable version snapshots exist. | When you edit a published pipeline, your changes go into a draft. The published version is untouched until you explicitly publish again. ## Publish a Version Create an immutable snapshot of the current pipeline steps: ```python Python SDK theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Publish version 1 version = client.create_pipeline_version( "pl_abc123", description="Initial production release" ) print(f"Published v{version.version}") # v1 ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/versions \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{"description": "Initial production release"}' ``` ```python Python (requests) theme={null} import os, requests BASE = "https://www.datalab.to/api/v1" headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} resp = requests.post(f"{BASE}/pipelines/pl_abc123/versions", headers={**headers, "Content-Type": "application/json"}, json={"description": "Initial production release"}) print(resp.json()) ``` Each call increments the version number. Published versions are immutable — their steps cannot be changed. ## Edit and Iterate After publishing, any edits create a draft that is separate from the published version: ```python theme={null} from datalab_sdk import PipelineProcessor # Edit steps — this creates a draft client.update_pipeline("pl_abc123", steps=[ PipelineProcessor(type="convert", settings={"mode": "accurate"}), # Changed PipelineProcessor(type="extract", settings={ "page_schema": {"type": "object", "properties": { "title": {"type": "string"}, "author": {"type": "string"} # Added field }} }) ]) # Test the draft execution = client.run_pipeline("pl_abc123", file_path="test.pdf", version=0) # Happy with changes? Publish a new version version = client.create_pipeline_version("pl_abc123", description="Added author field") print(f"Published v{version.version}") # v2 ``` `version=0` explicitly runs the draft. Omitting `version` runs the active published version. See [Run a Pipeline](/docs/recipes/pipelines/run-pipeline) for version parameter details. ## Discard a Draft Revert unsaved changes and restore the published version's steps: ```python Python SDK theme={null} # Discard draft, revert to active version pipeline = client.discard_pipeline_draft("pl_abc123") # Or revert to a specific version pipeline = client.discard_pipeline_draft("pl_abc123", version=1) ``` ```bash cURL theme={null} # Discard draft, revert to active version curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/discard \ -H "X-API-Key: $DATALAB_API_KEY" # Revert to specific version curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/discard \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{"version": 1}' ``` ```python Python (requests) theme={null} import os, requests BASE = "https://www.datalab.to/api/v1" headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} resp = requests.post(f"{BASE}/pipelines/pl_abc123/discard", headers=headers) print(resp.json()) ``` ## Browse Version History List all published versions for a pipeline: ```python theme={null} result = client.list_pipeline_versions("pl_abc123") for v in result["versions"]: print(f"v{v.version}: {v.description} (created {v.created})") print(f" Steps: {[s['type'] for s in v.steps]}") ``` Versions are returned newest-first. ## Best Practices **Pin production integrations to a specific version.** When calling `run_pipeline()` from production code, pass an explicit `version` number. This protects you from accidental changes: ```python theme={null} # Production code — pinned to v2 execution = client.run_pipeline( "pl_abc123", file_path="document.pdf", version=2 # Always runs v2, even if v3 is published later ) ``` **Test drafts before publishing.** Use `version=0` to run the draft version against test documents: ```python theme={null} # Test draft changes execution = client.run_pipeline( "pl_abc123", file_path="test_document.pdf", version=0 # Runs draft ) ``` **Use descriptions.** Include a meaningful description when publishing so your team can understand what changed: ```python theme={null} client.create_pipeline_version( "pl_abc123", description="Switch to accurate mode, add line_items extraction" ) ``` **Archive unused pipelines.** Keep your pipeline list clean: ```python theme={null} client.archive_pipeline("pl_old123") # List includes archived if you need them result = client.list_pipelines(include_archived=True) ``` ## Next Steps Execute pipelines with version selection, overrides, and polling. Build pipelines with Forge or the SDK. Processor types, composition rules, and when to use pipelines. Full SDK reference for all pipeline methods. # Run a Pipeline Source: https://documentation.datalab.to/docs/recipes/pipelines/run-pipeline Execute pipelines with version selection, overrides, polling, and per-processor result retrieval. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## Basic Execution Run a pipeline on a document: ```python Python SDK theme={null} from datalab_sdk import DatalabClient client = DatalabClient() execution = client.run_pipeline( "pl_abc123", file_path="document.pdf" ) # Poll until complete execution = client.get_pipeline_execution( execution.execution_id, max_polls=300, poll_interval=2 ) print(f"Status: {execution.status}") ``` ```bash cURL theme={null} # Start execution curl -X POST https://www.datalab.to/api/v1/pipelines/PIPELINE_ID/run \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" # Poll for completion (use execution_id from response) curl https://www.datalab.to/api/v1/pipelines/executions/EXECUTION_ID \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import os, time, requests BASE = "https://www.datalab.to/api/v1" headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} # Start execution with open("document.pdf", "rb") as f: resp = requests.post(f"{BASE}/pipelines/pl_abc123/run", headers=headers, files={"file": ("document.pdf", f, "application/pdf")}) execution_id = resp.json()["execution_id"] # Poll for _ in range(300): resp = requests.get(f"{BASE}/pipelines/executions/{execution_id}", headers=headers) data = resp.json() if data["status"] in ("completed", "completed_with_errors", "failed"): break time.sleep(2) print(f"Status: {data['status']}") ``` You can also pass a URL instead of a file: ```python theme={null} execution = client.run_pipeline( "pl_abc123", file_url="https://example.com/document.pdf" ) ``` ## Version Selection The `version` parameter controls which pipeline configuration runs: | Value | Behavior | | ---------------- | ---------------------------------------------------------------------------------- | | Omitted / `None` | Runs the **active published version**. If no version is published, runs the draft. | | `0` | Explicitly runs the **draft** (current unpublished edits). | | `1`, `2`, ... | Runs a **specific published version**. | ```python theme={null} # Run active published version (recommended for production) execution = client.run_pipeline("pl_abc123", file_path="doc.pdf") # Run draft for testing execution = client.run_pipeline("pl_abc123", file_path="doc.pdf", version=0) # Pin to specific version execution = client.run_pipeline("pl_abc123", file_path="doc.pdf", version=2) ``` If you omit `version` and no version has been published, the draft runs. Publish a version before using a pipeline in production to avoid running unfinished drafts. ## Run-Level Overrides Override pipeline behavior per execution without changing the pipeline configuration: ```python theme={null} execution = client.run_pipeline( "pl_abc123", file_path="document.pdf", page_range="0-5", # Process specific pages output_format="json", # Override output format skip_cache=True, # Force reprocessing (skip cached results) run_evals=True, # Run evaluation rubrics defined on steps webhook_url="https://example.com/webhook", # Notify on completion version=2, # Pin to version 2 ) ``` ### Override Reference | Parameter | Type | Default | Description | | --------------- | ---- | ------- | ------------------------------------------------------------ | | `file_path` | str | - | Local file to process (mutually exclusive with `file_url`) | | `file_url` | str | - | URL to document | | `page_range` | str | - | Pages to process (e.g., `"0-5,10"`, 0-indexed) | | `output_format` | str | - | Override output format: `markdown`, `html`, `json`, `chunks` | | `skip_cache` | bool | `False` | Skip cached results, reprocess from scratch | | `run_evals` | bool | `False` | Run evaluation rubrics configured on steps | | `webhook_url` | str | - | URL to POST when execution completes | | `version` | int | - | Pipeline version to run (see above) | | `max_polls` | int | `1` | Polling attempts after submission | | `poll_interval` | int | `1` | Seconds between polls | ## Execution Status Poll for status using `get_pipeline_execution()`: ```python theme={null} execution = client.get_pipeline_execution( execution.execution_id, max_polls=300, # Keep polling until complete poll_interval=2 # Check every 2 seconds ) print(f"Status: {execution.status}") print(f"Version: {execution.pipeline_version}") print(f"Started: {execution.started_at}") print(f"Completed: {execution.completed_at}") ``` ### Status Values | Status | Description | | ----------------------- | --------------------------------- | | `pending` | Queued, not started | | `running` | Processors are executing | | `completed` | All steps finished successfully | | `completed_with_errors` | Some steps completed, some failed | | `failed` | Execution failed | ### Per-Processor Tracking Each processor in the execution reports its own status: ```python theme={null} for step in execution.steps: print(f"Step {step.step_index} ({step.step_type}): {step.status}") if step.error_message: print(f" Error: {step.error_message}") if step.result_url: print(f" Result available") ``` Step status values: `pending`, `dispatched`, `running`, `completed`, `failed`, `skipped`. ## Retrieve Processor Results Fetch the output of a specific processor: ```python theme={null} # Get result for step at index 1 (e.g., extract step) result = client.get_step_result(execution.execution_id, step_index=1) print(result) ``` ```bash cURL theme={null} curl https://www.datalab.to/api/v1/pipelines/executions/EXECUTION_ID/steps/1/result \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} resp = requests.get( f"{BASE}/pipelines/executions/{execution_id}/steps/1/result", headers=headers) print(resp.json()) ``` Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly. ## Webhooks Get notified when a pipeline execution completes instead of polling: ```python theme={null} execution = client.run_pipeline( "pl_abc123", file_path="document.pdf", webhook_url="https://your-server.com/pipeline-webhook" ) ``` Datalab sends a POST request to your webhook URL when the execution reaches a terminal status. See [Webhooks](/platform/webhooks) for payload details. ## List Executions View recent executions for a pipeline: ```python theme={null} result = client.list_pipeline_executions("pl_abc123", limit=20) for ex in result["executions"]: print(f"{ex.execution_id}: {ex.status} (v{ex.pipeline_version})") ``` ## Billing Pipeline execution is billed per page, with rates additive across processors. Each processor type has its own per-page rate. Check a pipeline's rate before running: ```python theme={null} rate = client.get_pipeline_rate("pl_abc123") print(f"Rate per 1000 pages: {rate['rate_per_1000_pages_cents']} cents") print(f"Breakdown: {rate['rate_breakdown']}") ``` ## End-to-End Example Create a pipeline, publish it, and run it in production: ```python theme={null} from datalab_sdk import DatalabClient, PipelineProcessor client = DatalabClient() # 1. Create and save pipeline = client.create_pipeline(steps=[ PipelineProcessor(type="convert", settings={"mode": "balanced"}), PipelineProcessor(type="extract", settings={ "page_schema": { "type": "object", "properties": { "vendor": {"type": "string", "description": "Vendor name"}, "amount": {"type": "number", "description": "Total amount"}, "date": {"type": "string", "description": "Invoice date"} } } }) ]) pipeline = client.save_pipeline(pipeline.pipeline_id, name="Invoice Parser") # 2. Test the draft test_exec = client.run_pipeline( pipeline.pipeline_id, file_path="test_invoice.pdf", version=0 ) test_exec = client.get_pipeline_execution(test_exec.execution_id, max_polls=300) test_result = client.get_step_result(test_exec.execution_id, step_index=1) print(f"Test result: {test_result}") # 3. Publish version = client.create_pipeline_version( pipeline.pipeline_id, description="Initial release — balanced mode, basic fields" ) # 4. Run in production (pinned to version) execution = client.run_pipeline( pipeline.pipeline_id, file_path="real_invoice.pdf", version=version.version ) execution = client.get_pipeline_execution(execution.execution_id, max_polls=300) if execution.status == "completed": result = client.get_step_result(execution.execution_id, step_index=1) print(f"Extracted: {result}") else: for step in execution.steps: if step.error_message: print(f"Step {step.step_index} failed: {step.error_message}") ``` ## Next Steps Processor types, composition rules, and when to use pipelines. Manage drafts, versions, and production pinning. Configure webhook notifications for pipeline executions. Full SDK reference for all pipeline methods. # Structured Extraction Source: https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview Extract structured data from documents using JSON schemas. Extract specific fields from documents by providing a JSON schema. Marker parses the document and fills in your schema with extracted values. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set **Building for production?** Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) to chain processors, version your configuration, and deploy with a single API call. ## Quick Start ```python Python SDK theme={null} import json from datalab_sdk import DatalabClient, ExtractOptions client = DatalabClient() schema = { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice ID or number"}, "total_amount": {"type": "number", "description": "Total amount due"}, "vendor_name": {"type": "string", "description": "Company or vendor name"} }, "required": ["invoice_number", "total_amount"] } options = ExtractOptions( page_schema=json.dumps(schema), mode="balanced" ) result = client.extract("invoice.pdf", options=options) extracted = json.loads(result.extraction_schema_json) print(f"Invoice: {extracted['invoice_number']}") print(f"Total: ${extracted['total_amount']}") ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/extract \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@invoice.pdf" \ -F "mode=balanced" \ -F 'page_schema={"type":"object","properties":{"invoice_number":{"type":"string","description":"Invoice ID"},"total_amount":{"type":"number","description":"Total due"}}}' # Poll request_check_url from response until status is "complete" ``` ```python Python (requests) theme={null} import requests, json, time, os headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} schema = { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice ID"}, "total_amount": {"type": "number", "description": "Total due"} } } with open("invoice.pdf", "rb") as f: response = requests.post( "https://www.datalab.to/api/v1/extract", files={"file": ("invoice.pdf", f, "application/pdf")}, data={"page_schema": json.dumps(schema), "mode": "balanced"}, headers=headers ) check_url = response.json()["request_check_url"] while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": extracted = json.loads(result["extraction_schema_json"]) print(extracted) break elif result["status"] == "failed": print(f"Error: {result.get('error')}") break time.sleep(2) ``` ## Extraction Modes The `extraction_mode` form parameter controls how extraction runs. This is separate from `mode`, which controls document parsing quality. | Mode | Description | Price | Latency | | ---------------------- | ---------------------------------------------------------------------------- | --------------- | ----------------------------------------- | | **fast** | Extraction with per-field citations | \$6 / 1K pages | Lowest | | **balanced** (default) | Extraction with independent verification, per-field reasoning, and citations | \$25 / 1K pages | Slower — trades speed for higher accuracy | Both modes return citations for every extracted field. Balanced mode additionally returns `_meta` per field with `extraction_status`, `reasoning`, and `verification` results. `balanced` is the default. Teams that made an extraction request in the 30 days before June 4, 2026 default to `fast` instead. Pass `extraction_mode` explicitly to override the default in either case. ```bash cURL theme={null} # Fast extraction mode curl -X POST https://www.datalab.to/api/v1/extract \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@invoice.pdf" \ -F 'page_schema={"type":"object","properties":{"invoice_number":{"type":"string"}}}' \ -F "extraction_mode=fast" ``` ```python theme={null} # The SDK's ExtractOptions controls document parse mode via `mode`. # To set extraction_mode, use the REST API directly (see cURL tab above) # or pass it as a raw form field via requests. options = ExtractOptions(page_schema=json.dumps(schema)) # defaults to balanced extraction ``` See [Balanced Extraction Mode](/docs/recipes/structured-extraction/balanced-mode) for a full guide on the balanced mode response format and building workflows with verification metadata. ## Schema Format Use JSON Schema format to define what you want to extract: ```json theme={null} { "type": "object", "properties": { "field_name": { "type": "string", "description": "Describe what this field contains" }, "numeric_field": { "type": "number", "description": "A numeric value" }, "list_field": { "type": "array", "items": { "type": "object", "properties": { "nested_field": {"type": "string"} } } } }, "required": ["field_name"] } ``` ### Tips for Better Extraction 1. **Use descriptive field names** - `invoice_number` is clearer than `id` 2. **Add descriptions** - The `description` field helps the model understand context 3. **Specify types correctly** - Use `number` for numeric values, `string` for text 4. **Use arrays for repeating data** - Line items, table rows, etc. **Common schema pitfalls:** * Using vague field names like `data` or `info` — be specific (e.g., `invoice_number`, `total_amount`) * Forgetting `description` fields — these help the model understand what to extract * Setting `type: "string"` for numeric values — use `type: "number"` for amounts, quantities, etc. * Deeply nested schemas — keep schemas as flat as possible for better extraction accuracy ## Response The extracted data is returned in `extraction_schema_json`: ```json theme={null} { "status": "complete", "success": true, "json": {...}, "extraction_schema_json": "{\"invoice_number\": \"INV-2024-001\", \"total_amount\": 1500.00, ...}", "page_count": 2 } ``` ### Citation Tracking Each extracted field includes citations to the source blocks: ```json theme={null} { "invoice_number": "INV-2024-001", "invoice_number_citations": ["block_123", "block_124"], "total_amount": 1500.00, "total_amount_citations": ["block_456"] } ``` Use these block IDs with the `json` output to trace extracted values back to the source document. ## Schema Examples ### Financial Document ```python theme={null} schema = { "type": "object", "properties": { "company_name": {"type": "string", "description": "Company name"}, "fiscal_year": {"type": "string", "description": "Fiscal year"}, "total_revenue": {"type": "number", "description": "Total revenue in dollars"}, "net_income": {"type": "number", "description": "Net income in dollars"}, "eps": {"type": "number", "description": "Earnings per share"} } } ``` ### Scientific Paper ```python theme={null} schema = { "type": "object", "properties": { "title": {"type": "string", "description": "Paper title"}, "authors": { "type": "array", "items": {"type": "string"}, "description": "List of author names" }, "abstract": {"type": "string", "description": "Paper abstract"}, "keywords": { "type": "array", "items": {"type": "string"}, "description": "Keywords or tags" } } } ``` ### Contract ```python theme={null} schema = { "type": "object", "properties": { "parties": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"}, "role": {"type": "string"} } } }, "effective_date": {"type": "string", "description": "Contract start date"}, "termination_date": {"type": "string", "description": "Contract end date"}, "total_value": {"type": "number", "description": "Total contract value"} } } ``` ## Using Checkpoints If you already converted a document with `save_checkpoint=True` using the [Convert API](/docs/recipes/conversion/conversion-api-overview), pass the `checkpoint_id` to `ExtractOptions` to skip re-parsing. This saves time and cost when running extraction on a previously converted document. ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions import json client = DatalabClient() # Step 1: Convert and save checkpoint convert_result = client.convert("invoice.pdf", options=ConvertOptions(save_checkpoint=True)) checkpoint_id = convert_result.checkpoint_id # Step 2: Extract using checkpoint (no re-parsing needed) schema = { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice ID"}, "total_amount": {"type": "number", "description": "Total due"} } } options = ExtractOptions( page_schema=json.dumps(schema), checkpoint_id=checkpoint_id ) result = client.extract("invoice.pdf", options=options) extracted = json.loads(result.extraction_schema_json) ``` The extract endpoint accepts the following parameters: `file`, `page_schema` or `schema_id` (one is required), `schema_version`, `mode`, `max_pages`, `page_range`, `save_checkpoint`, `checkpoint_id`, `webhook_url`, and `processing_location` (e.g. `"eu"` — routes processing and storage to EU infrastructure; requires `file_url` or a pre-uploaded `datalab://` reference instead of a multipart upload). ### Using Saved Schemas Instead of passing `page_schema` inline, you can save schemas to Datalab and reference them by ID. This avoids repeating the schema in every request and enables versioning. ```bash theme={null} curl -X POST https://www.datalab.to/api/v1/extract \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@invoice.pdf" \ -F "schema_id=sch_k8Hx9mP2nQ4v" ``` Pass `schema_version` to pin to a specific schema version; omit it to always use the latest. See [Saved Schemas](/docs/recipes/structured-extraction/saved-schemas) for full CRUD API reference. ## Confidence Scoring **Extraction scoring is in beta.** We'd love your feedback — reach out at [support@datalab.to](mailto:support@datalab.to). Scoring is free. Scoring runs automatically after every extraction. When you poll `request_check_url`, the response initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include `_score` fields and an `extraction_score_average` once scoring completes. No extra parameters or endpoints are needed. Each `_score` field is a `{"score": int, "reasoning": str}` object explaining what evidence was found or missing. ### Score response format Without scoring complete, `extraction_schema_json` contains fields and citations: ```json theme={null} { "invoice_number": "INV-2024-001", "invoice_number_citations": ["block_123"], "total_amount": 1500.00, "total_amount_citations": ["block_456"] } ``` Once scoring finishes, each field also gets a `_score` object, and the top-level response includes an `extraction_score_average`: ```json theme={null} { "invoice_number": "INV-2024-001", "invoice_number_citations": ["block_123"], "invoice_number_score": { "score": 5, "reasoning": "Value found verbatim in the document header with a matching citation." }, "total_amount": 1500.00, "total_amount_citations": ["block_456"], "total_amount_score": { "score": 4, "reasoning": "Amount found in the totals row; minor ambiguity due to a subtotal nearby." } } ``` The top-level response also includes `extraction_score_average` (4.5 in this case), averaging all field scores. **Score rubric:** | Score | Meaning | | ----- | ---------------------------------------------------------- | | 5 | High confidence — clear match with strong citation support | | 4 | Good confidence — match found with minor ambiguity | | 3 | Moderate confidence — partial match or uncertain citation | | 2 | Low confidence — match is inferred or weakly supported | | 1 | Very low confidence — no clear evidence found | See [Confidence Scoring](/docs/recipes/structured-extraction/confidence-scoring) for a full walkthrough with code examples. ## Auto-Generate Schemas Don't want to write schemas by hand? Use the schema generation endpoint to automatically suggest schemas for your document. This requires a checkpoint from a previous conversion: ```python theme={null} import os, requests, json, time headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} # Step 1: Convert with checkpoint with open("invoice.pdf", "rb") as f: resp = requests.post( "https://www.datalab.to/api/v1/convert", files={"file": ("invoice.pdf", f, "application/pdf")}, data={"save_checkpoint": "true", "output_format": "markdown"}, headers=headers ) check_url = resp.json()["request_check_url"] # Poll until complete while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": checkpoint_id = result["checkpoint_id"] break time.sleep(2) # Step 2: Generate schemas resp = requests.post( "https://www.datalab.to/api/v1/marker/extraction/gen_schemas", json={"checkpoint_id": checkpoint_id}, headers=headers ) gen_check_url = resp.json()["request_check_url"] while True: result = requests.get(gen_check_url, headers=headers).json() if result["status"] == "complete": suggestions = result["suggestions"] print("Simple schema:", suggestions["simple_schema"]) print("Moderate schema:", suggestions["moderate_schema"]) print("Complex schema:", suggestions["complex_schema"]) break time.sleep(2) ``` The endpoint returns three schema options at different complexity levels — use the one that best matches your needs, then customize it. ## Using Forge Playground Create and test schemas visually in [Forge Playground](https://www.datalab.to/app/playground): 1. Upload a sample document 2. Define fields in the visual editor 3. Switch to JSON Editor to copy the schema 4. Test extraction before deploying ## Next Steps Per-field verification, reasoning, and extraction status for compliance workflows Create reusable schemas and reference them by ID — no need to repeat the schema in each request Score extraction results with per-field confidence ratings Strategies for extracting from 100+ page documents # Balanced Extraction Mode Source: https://documentation.datalab.to/docs/recipes/structured-extraction/balanced-mode Extraction with per-field verification, reasoning, and citations. Balanced mode runs an extraction pipeline with independent verification. Every extracted field includes an audit trail: where the value came from, how it was derived, and whether an independent check confirmed it. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## When to Use Fast vs Balanced | | Fast | Balanced (default) | | ---------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | **Price** | \$6 / 1K pages | \$25 / 1K pages | | **Latency** | Lowest | Slower — trades speed for accuracy via independent verification | | **Per-field citations** | Yes | Yes | | **Extraction status** | No | Yes (EXTRACTED / NOT\_RESOLVABLE) | | **Per-field reasoning** | No | Yes | | **Independent verification** | No | Yes (PASS / FAIL) | | **Best for** | High-volume workflows: invoices, forms, bank statements | Compliance, financial, legal, and medical workflows where every field needs an audit trail | Use **fast** when speed and cost matter most. Use **balanced** when you need to trust every field and want metadata to power downstream decisions. ## Schema size on short documents For shorter documents (**under 20 pages**), balanced mode limits how large your schema can be. Documents of **20+ pages have no schema-size limit**. If a schema is too large for a short document, the request fails with a clear error telling you your field count and your options — and **you aren't charged**. ### Count your fields A **field** is one value you get back — a string, number, date, true/false, or one choice from a fixed list. **Objects and lists are containers, not fields** — count the fields inside them. A list of repeated items counts its fields **once**, no matter how many rows the document has. **4 fields:** ```json theme={null} { "invoice_number": "string", "invoice_date": "string", "total_amount": "number", "currency": "string" } ``` **5 fields** — the object is a container, so count what's inside it: ```json theme={null} { "vendor": { "name": "string", "address": "string" }, "total": "number", "due_date": "string", "paid": "boolean" } ``` **4 fields** — a list's columns count once, not once per row: ```json theme={null} { "invoice_number": "string", "line_items": [ { "description": "string", "quantity": "number", "unit_price": "number" } ] } ``` ### How many fields can I use? About **25 fields** is a comfortable limit for any schema on a short document. Larger schemas often work too — especially flat ones without deep nesting — but the more fields you add, and the more deeply they're nested (lists of objects several levels down), the more likely you are to reach the limit. You don't have to guess: if a schema is too large for a short document, the request fails with a clear error (and you aren't charged), so it's safe to try a larger one. If you need a bigger schema on a short document: 1. **Split it into multiple extractions** — for example, header fields in one request and a large list in another. 2. **Use `fast` mode** — it supports larger schemas and costs less, without the per-field verification metadata. 3. **Trim and flatten** — drop fields you don't use and reduce nesting. This applies only to balanced mode on documents under 20 pages. Documents of 20+ pages support schemas of any size. ## Quick Start ```python Python SDK theme={null} import json from datalab_sdk import DatalabClient, ExtractOptions client = DatalabClient() schema = { "type": "object", "properties": { "company_name": {"type": "string", "description": "Full legal name of the company"}, "fiscal_year_end": {"type": "string", "description": "End date of the fiscal year (YYYY-MM-DD)"}, "total_revenue": {"type": "number", "description": "Total revenue in the reporting currency"}, "auditor_name": {"type": "string", "description": "Name of the external audit firm"} }, "required": ["company_name", "fiscal_year_end"] } options = ExtractOptions( page_schema=json.dumps(schema), ) result = client.extract("annual_report.pdf", options=options) extracted = json.loads(result.extraction_schema_json) # Each field comes with citations and metadata print(f"Company: {extracted['company_name']}") print(f"Citations: {extracted['company_name_citations']}") print(f"Status: {extracted['company_name_meta']['extraction_status']}") print(f"Verified: {extracted['company_name_meta']['verification']['status']}") ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/extract \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@annual_report.pdf" \ -F "extraction_mode=balanced" \ -F 'page_schema={"type":"object","properties":{"company_name":{"type":"string","description":"Full legal name"},"total_revenue":{"type":"number","description":"Total revenue"}}}' # Poll request_check_url from response until status is "complete" ``` ```python Python (requests) theme={null} import requests, json, time, os headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} schema = { "type": "object", "properties": { "company_name": {"type": "string", "description": "Full legal name"}, "total_revenue": {"type": "number", "description": "Total revenue"} } } with open("annual_report.pdf", "rb") as f: response = requests.post( "https://www.datalab.to/api/v1/extract", files={"file": ("annual_report.pdf", f, "application/pdf")}, data={ "page_schema": json.dumps(schema), "extraction_mode": "balanced" }, headers=headers ) check_url = response.json()["request_check_url"] while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": extracted = json.loads(result["extraction_schema_json"]) print(json.dumps(extracted, indent=2)) break elif result["status"] == "failed": print(f"Error: {result.get('error')}") break time.sleep(2) ``` `extraction_mode` controls the extraction pipeline (`fast` or `balanced`). This is separate from `mode`, which controls the document parsing stage (`fast`, `balanced`, or `accurate`). You can combine them independently — for example, `mode="fast"` with `extraction_mode="balanced"`. `extraction_mode` is not yet exposed in the Python SDK's `ExtractOptions`. To set it explicitly, use the cURL or Python requests examples above. When omitted, the team's configured default applies (see [Changelog](/platform/changelog) — 6/4/2026 for default rules). ## Response Format In balanced mode, each extracted field includes three sibling keys. The `_citations` sibling is the same format as fast mode for compatibility — balanced mode adds `_meta` with richer metadata on top: ```json theme={null} { "company_name": "Whitbread PLC", "company_name_citations": ["/page/0/Text/3", "/page/2/Table/1"], "company_name_meta": { "extraction_status": "EXTRACTED", "reasoning": "The company name 'Whitbread PLC' appears in the document header on the cover page (/page/0/Text/3) and is confirmed in the directors' report (/page/2/Table/1).", "citations": ["/page/0/Text/3", "/page/2/Table/1"], "verification": { "status": "PASS", "feedback": "The company name 'Whitbread PLC' is printed on the cover page (/page/0/Text/3) and confirmed in the directors' report. No conflicting name appears in the document." } } } ``` The `_citations` key is shared with fast mode — if you switch between modes, citation-consuming code continues to work. The `_meta` key is balanced-mode-only and contains the full audit trail. ### Field Metadata Each `_meta` object contains: | Field | Description | | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | `extraction_status` | How the value was produced: `EXTRACTED` (value found in the document) or `NOT_RESOLVABLE` (document doesn't contain this information) | | `reasoning` | Audit-ready prose explaining how the value was produced, with block ID citations | | `citations` | Block IDs from the source document that support the value | | `verification` | Independent verification result with `status` and `feedback` | ### Extraction Status | Status | Meaning | Value | | ---------------- | --------------------------------------------------- | ------------------- | | `EXTRACTED` | The value was found in or derived from the document | The extracted value | | `NOT_RESOLVABLE` | The document does not contain or imply this value | `null` | ### Verification Status | Status | Meaning | | ------------------- | ------------------------------------------------------------------------------------------------ | | `PASS` | The value and citations were independently confirmed against the source document | | `FAIL_UNRESOLVABLE` | The document does not support a value for this field | | `FAIL_FIX` | The value was flagged as incorrect during verification — the document supports a different value | | `FAIL_CITATIONS` | The value is correct but the citations are wrong or insufficient | | `ITEMS_MISSING` | (List fields only) The document contains entries that are not present in the extraction | In practice, most fields will be `PASS` or `FAIL_UNRESOLVABLE` after verification. The other statuses indicate cases where the verifier flagged an issue that could not be fully resolved automatically. ## Building Workflows with Verification Metadata The per-field metadata enables automated quality gates: ```python theme={null} import json extracted = json.loads(result.extraction_schema_json) # Separate fields by verification status auto_approved = [] needs_review = [] # Walk all fields and check their _meta for key, value in extracted.items(): if key.endswith("_meta"): field_name = key.removesuffix("_meta") meta = value verification = meta.get("verification", {}) if verification.get("status") == "PASS": auto_approved.append(field_name) else: needs_review.append({ "field": field_name, "extraction_status": meta.get("extraction_status"), "reasoning": meta.get("reasoning"), "verification_feedback": verification.get("feedback"), }) print(f"Auto-approved: {len(auto_approved)} fields") print(f"Needs review: {len(needs_review)} fields") # Route to human review queue for item in needs_review: print(f" {item['field']}: {item['extraction_status']}") print(f" Reason: {item['reasoning'][:100]}...") ``` ### Common Workflow Patterns * **Auto-approve** when all fields have `verification.status == "PASS"` — no human review needed * **Flag for review** when any field is `NOT_RESOLVABLE` or has a `FAIL_*` verification status — the document may be missing information or the extraction needs a human check * **Show citations** to reviewers so they can verify in seconds — each field links back to specific blocks in the document * **Use reasoning as an audit trail** — for compliance workflows, the per-field reasoning documents exactly how each value was produced, with block-level citations back to the source document ## Next Steps Schema format, response structure, and extraction tips Additional per-field confidence scores (works with both modes) Save and version schemas for reuse across requests Tips for extracting from 100+ page documents # Extraction Confidence Scoring Source: https://documentation.datalab.to/docs/recipes/structured-extraction/confidence-scoring Score extraction results with per-field confidence ratings and reasoning. Score your structured extraction results to get per-field confidence ratings (1–5) with reasoning that explains what evidence was found or missing. **Extraction scoring is in beta.** We'd love your feedback — reach out at [support@datalab.to](mailto:support@datalab.to). Scoring is free. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## How It Works **Confidence scoring runs in `fast` extraction mode only**, and `extraction_mode` defaults to `balanced`. To receive scores you must request `fast` mode explicitly: `extraction_mode="fast"`. In `balanced` and `turbo` modes, `_score` fields and `extraction_score_average` are **never** returned no matter how long you poll (see the note below). When you run extraction with `extraction_mode="fast"`, scoring runs automatically afterward. When you poll `request_check_url`, the extraction result initially contains just the extracted fields and citations. If you continue polling the same URL, the response will eventually include `_score` fields and an `extraction_score_average` once scoring completes (typically within a minute of `status` becoming `complete`). Each scored field receives: * A **score** from 1 (very low confidence) to 5 (high confidence) * A **reasoning** string explaining what evidence supports or undermines the extracted value Beyond setting `extraction_mode="fast"`, no extra parameters or endpoints are needed — just keep polling until scores appear. **Using balanced extraction mode?** Balanced mode does **not** produce `_score` fields or `extraction_score_average`. Instead it includes its own per-field verification (`_meta.verification`, with a `status` of `PASS`/`FAIL_*` and `feedback`) that runs inline as part of the extraction pipeline — a richer, different signal than the numeric confidence scores described here. The two mechanisms are mutually exclusive: use `fast` mode for numeric `_score`s, or `balanced` mode for inline verification. See [Balanced Mode](/docs/recipes/structured-extraction/balanced-mode). ## Example ```python Python (requests) theme={null} import requests, json, time, os headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} schema = { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice ID or number"}, "total_amount": {"type": "number", "description": "Total amount due"}, "vendor_name": {"type": "string", "description": "Vendor or company name"} }, "required": ["invoice_number", "total_amount"] } with open("invoice.pdf", "rb") as f: resp = requests.post( "https://www.datalab.to/api/v1/extract", files={"file": ("invoice.pdf", f, "application/pdf")}, data={ "page_schema": json.dumps(schema), "extraction_mode": "fast" # scoring runs in fast mode only }, headers=headers ) check_url = resp.json()["request_check_url"] # Poll until extraction is complete while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": extracted = json.loads(result["extraction_schema_json"]) print("Extraction:", extracted) break time.sleep(2) # Scores are enriched asynchronously after completion. Keep polling the same # URL until extraction_score_average appears (bounded so we don't loop forever). for _ in range(30): if "extraction_score_average" in result: break time.sleep(2) result = requests.get(check_url, headers=headers).json() scored = json.loads(result["extraction_schema_json"]) for key, value in scored.items(): if key.endswith("_score"): field = key.replace("_score", "") print(f"{field}: score={value['score']}, reasoning={value['reasoning']}") ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/extract \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@invoice.pdf" \ -F 'page_schema={"type":"object","properties":{"invoice_number":{"type":"string","description":"Invoice ID"},"total_amount":{"type":"number","description":"Total due"},"vendor_name":{"type":"string","description":"Vendor or company name"}}}' \ -F "extraction_mode=fast" # Poll request_check_url until status is "complete" for extraction results. # Keep polling the same URL — scores (_score fields + extraction_score_average) # appear once scoring finishes. Scoring runs in fast mode only. ``` ## Response Format Without scoring, `extraction_schema_json` contains fields and citations: ```json theme={null} { "invoice_number": "INV-2024-001", "invoice_number_citations": ["block_123"], "total_amount": 1500.00, "total_amount_citations": ["block_456"] } ``` With scoring, each field also gets a `_score` object, and the top-level response includes an `extraction_score_average`: ```json theme={null} { "invoice_number": "INV-2024-001", "invoice_number_citations": ["block_123"], "invoice_number_score": { "score": 5, "reasoning": "Value found verbatim in the document header with a matching citation." }, "total_amount": 1500.00, "total_amount_citations": ["block_456"], "total_amount_score": { "score": 4, "reasoning": "Amount found in the totals row; minor ambiguity due to a subtotal nearby." } } ``` The top-level response also includes `extraction_score_average` (4.5 in this case), averaging all field scores. ### Score Rubric | Score | Meaning | | ----- | ---------------------------------------------------------- | | 5 | High confidence — clear match with strong citation support | | 4 | Good confidence — match found with minor ambiguity | | 3 | Moderate confidence — partial match or uncertain citation | | 2 | Low confidence — match is inferred or weakly supported | | 1 | Very low confidence — no clear evidence found | ## Using Scores in Practice Use `extraction_score_average` for a quick quality check, then inspect individual `_score` fields to flag low-confidence results: ```python theme={null} import json # After getting scored result (from either approach) avg = result["extraction_score_average"] print(f"Average score: {avg}") scored = json.loads(result["extraction_schema_json"]) for key, value in scored.items(): if not key.endswith("_score"): continue field = key.replace("_score", "") if value["score"] <= 2: print(f"Low confidence for '{field}': {value['reasoning']}") elif value["score"] >= 4: print(f"'{field}' = {scored[field]}") ``` This is useful for building review workflows — auto-accept high-confidence fields and route low-confidence ones to a human reviewer. ## Next Steps Full extraction API reference and schema examples Strategies for extracting from 100+ page documents Chain processors into versioned, reusable pipelines. Convert documents to various formats # Handling Long Documents Source: https://documentation.datalab.to/docs/recipes/structured-extraction/handling-long-documents Tips for structured extraction on documents with 50+ pages. For long documents, use page ranges and document segmentation to improve speed and accuracy. ## Restrict to Specific Pages If you know which pages contain the data you need, use `page_range`: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions import json client = DatalabClient() schema = { "type": "object", "properties": { "executive_summary": {"type": "string", "description": "Executive summary text"} } } # Only process pages 0-5 (first 6 pages) options = ConvertOptions( page_schema=json.dumps(schema), page_range="0-5", mode="balanced" ) result = client.convert("long_document.pdf", options=options) ``` You're only charged for the pages you process. ## Segment and Chain Extractions For documents with distinct sections (like financial reports or contracts), extract the table of contents first, then process each section separately. ### Step 1: Extract Table of Contents ```python theme={null} import json from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() toc_schema = { "type": "object", "properties": { "table_of_contents": { "type": "array", "items": { "type": "object", "properties": { "section_name": {"type": "string"}, "page_number": {"type": "number"} } } } } } # Extract TOC from first few pages options = ConvertOptions( page_schema=json.dumps(toc_schema), page_range="0-5", mode="balanced" ) result = client.convert("report.pdf", options=options) toc = json.loads(result.extraction_schema_json) print("Sections found:") for item in toc["table_of_contents"]: print(f" {item['section_name']}: page {item['page_number']}") ``` ### Step 2: Extract Each Section ```python theme={null} # Define schemas for different sections section_schemas = { "Financial Highlights": { "type": "object", "properties": { "revenue": {"type": "number"}, "net_income": {"type": "number"}, "eps": {"type": "number"} } }, "Risk Factors": { "type": "object", "properties": { "risks": { "type": "array", "items": {"type": "string"} } } } } # Build page ranges from TOC sections = toc["table_of_contents"] results = {} for i, section in enumerate(sections): section_name = section["section_name"] start_page = section["page_number"] # End page is start of next section (or end of document) end_page = sections[i + 1]["page_number"] - 1 if i + 1 < len(sections) else None # Get schema for this section if we have one schema = section_schemas.get(section_name) if schema: page_range = f"{start_page}-{end_page}" if end_page else str(start_page) options = ConvertOptions( page_schema=json.dumps(schema), page_range=page_range, mode="balanced" ) result = client.convert("report.pdf", options=options) results[section_name] = json.loads(result.extraction_schema_json) print(results) ``` ## Use Document Segmentation For documents without a clear table of contents, use [Document Segmentation](/docs/recipes/document-segmentation/auto-segmentation) to automatically split by section headers. ```python theme={null} segmentation_schema = { "type": "object", "properties": { "sections": { "type": "array", "items": { "type": "object", "properties": { "title": {"type": "string"}, "type": {"type": "string", "enum": ["introduction", "methods", "results", "conclusion"]} } } } } } options = ConvertOptions( segmentation_schema=json.dumps(segmentation_schema), mode="balanced" ) result = client.convert("paper.pdf", options=options) # Access segmentation results segments = result.segmentation_results ``` ## Full Example Complete workflow for processing a 100+ page financial report: ```python theme={null} import json from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() def extract_with_toc(pdf_path: str, section_schemas: dict) -> dict: """Extract data from a long document using TOC-based segmentation.""" # Step 1: Extract table of contents toc_schema = { "type": "object", "properties": { "table_of_contents": { "type": "array", "items": { "type": "object", "properties": { "section_name": {"type": "string"}, "page_number": {"type": "number"} } } } } } options = ConvertOptions( page_schema=json.dumps(toc_schema), page_range="0-6", mode="balanced" ) result = client.convert(pdf_path, options=options) toc = json.loads(result.extraction_schema_json) sections = toc.get("table_of_contents", []) # Step 2: Extract each section with its schema results = {} for i, section in enumerate(sections): section_name = section["section_name"] start_page = int(section["page_number"]) # Calculate page range if i + 1 < len(sections): end_page = int(sections[i + 1]["page_number"]) - 1 page_range = f"{start_page}-{end_page}" else: page_range = str(start_page) # Check if we have a schema for this section schema = section_schemas.get(section_name) if not schema: continue options = ConvertOptions( page_schema=json.dumps(schema), page_range=page_range, mode="balanced" ) try: result = client.convert(pdf_path, options=options) results[section_name] = json.loads(result.extraction_schema_json) print(f"Extracted: {section_name}") except Exception as e: print(f"Error extracting {section_name}: {e}") return results # Define schemas for sections you care about schemas = { "Financial Highlights": { "type": "object", "properties": { "total_revenue": {"type": "number", "description": "Total revenue"}, "net_income": {"type": "number", "description": "Net income"}, "year": {"type": "string", "description": "Fiscal year"} } }, "Business Overview": { "type": "object", "properties": { "description": {"type": "string", "description": "Business description"}, "products": {"type": "array", "items": {"type": "string"}} } } } results = extract_with_toc("annual_report.pdf", schemas) print(json.dumps(results, indent=2)) ``` ## Tips 1. **Process pages you need** - Use `page_range` to avoid processing unnecessary pages 2. **Extract TOC first** - Build page ranges dynamically from the document structure 3. **Use appropriate modes** - `balanced` is usually sufficient; use `accurate` for complex tables 4. **Handle errors** - Some sections may not match your schema exactly ## Next Steps Learn the full structured extraction API and schema options. Automatically split documents by section headers. Process multiple long documents efficiently in parallel. Chain processors into versioned, reusable pipelines. # Saved Schemas Source: https://documentation.datalab.to/docs/recipes/structured-extraction/saved-schemas Create and manage reusable extraction schemas in the Datalab UI. Reference saved schemas by ID instead of sending the full schema with every request. **Before you begin**, make sure you have: 1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits) 2. Python 3.10+ installed 3. The Datalab SDK: `pip install datalab-python-sdk` 4. Your `DATALAB_API_KEY` environment variable set ## Overview Saved Schemas let you store extraction schemas in Datalab and reference them by ID (`schema_id`) when calling `/api/v1/extract`. Instead of sending a full JSON schema with every request, you save it once and reference it by its stable ID. Saved schemas also support **versioning** — you can update a schema while keeping a history of previous versions and pin extractions to a specific version using `schema_version`. ## Create a Schema Create schemas via the SDK or the [Datalab UI](https://www.datalab.to/app/schemas). Each schema is assigned a `schema_id` (e.g. `sch_k8Hx9mP2nQ4v`) that you can reference in extraction requests. ```python Python SDK theme={null} from datalab_sdk import DatalabClient client = DatalabClient() schema = client.create_extraction_schema( name="Invoice Schema", description="Extracts key fields from invoices", schema_json={ "properties": { "invoice_number": {"type": "string", "description": "Invoice ID"}, "total_amount": {"type": "number", "description": "Total amount due"}, "vendor_name": {"type": "string", "description": "Vendor or supplier name"}, "due_date": {"type": "string", "description": "Payment due date"}, } }, ) print(schema.schema_id) # e.g. sch_k8Hx9mP2nQ4v ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/extraction_schemas \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "name": "Invoice Schema", "description": "Extracts key fields from invoices", "schema_json": { "properties": { "invoice_number": {"type": "string", "description": "Invoice ID"}, "total_amount": {"type": "number", "description": "Total amount due"}, "vendor_name": {"type": "string", "description": "Vendor or supplier name"}, "due_date": {"type": "string", "description": "Payment due date"} } } }' ``` ```python Python (requests) theme={null} import os, requests resp = requests.post( "https://www.datalab.to/api/v1/extraction_schemas", headers={"X-API-Key": os.getenv("DATALAB_API_KEY")}, json={ "name": "Invoice Schema", "schema_json": { "properties": { "invoice_number": {"type": "string"}, "total_amount": {"type": "number"}, } }, }, ) schema_id = resp.json()["schema_id"] print(schema_id) ``` ## Extract Using a Saved Schema Pass `schema_id` to `/api/v1/extract` instead of `page_schema`: ```python Python SDK theme={null} from datalab_sdk import DatalabClient, ExtractOptions import json client = DatalabClient() result = client.extract( "invoice.pdf", options=ExtractOptions( schema_id="sch_k8Hx9mP2nQ4v", mode="balanced", ), ) extracted = json.loads(result.extraction_schema_json) print(extracted) ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/extract \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@invoice.pdf" \ -F "schema_id=sch_k8Hx9mP2nQ4v" \ -F "mode=balanced" # Poll request_check_url from response until status is "complete" ``` ```python Python (requests) theme={null} import requests, time, os headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")} with open("invoice.pdf", "rb") as f: resp = requests.post( "https://www.datalab.to/api/v1/extract", files={"file": ("invoice.pdf", f, "application/pdf")}, data={"schema_id": "sch_k8Hx9mP2nQ4v", "mode": "balanced"}, headers=headers ) check_url = resp.json()["request_check_url"] while True: result = requests.get(check_url, headers=headers).json() if result["status"] == "complete": import json extracted = json.loads(result["extraction_schema_json"]) print(extracted) break elif result["status"] == "failed": print(f"Error: {result.get('error')}") break time.sleep(2) ``` `page_schema` and `schema_id` are mutually exclusive — provide exactly one. If you pass both, the API returns a `400` error. ## Schema Versioning When you update a schema in the [Datalab UI](https://www.datalab.to/app/schemas), you can choose to create a new version. This saves the current state to version history and increments the version number. ### Pin to a specific version Pass `schema_version` alongside `schema_id` to use a specific version: ```bash theme={null} curl -X POST https://www.datalab.to/api/v1/extract \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@invoice.pdf" \ -F "schema_id=sch_k8Hx9mP2nQ4v" \ -F "schema_version=1" ``` Omitting `schema_version` always uses the latest version. We recommend always specifying `schema_version` alongside `schema_id`. This ensures your extractions produce consistent results even if the schema is updated later. ## List Schemas ```python Python SDK theme={null} result = client.list_extraction_schemas(limit=50, include_archived=False) for s in result["schemas"]: print(f"{s.schema_id}: {s.name} (v{s.version})") ``` ```bash cURL theme={null} # List active schemas curl "https://www.datalab.to/api/v1/extraction_schemas" \ -H "X-API-Key: $DATALAB_API_KEY" # Include archived schemas curl "https://www.datalab.to/api/v1/extraction_schemas?include_archived=true" \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import os, requests resp = requests.get( "https://www.datalab.to/api/v1/extraction_schemas", headers={"X-API-Key": os.getenv("DATALAB_API_KEY")}, ) for s in resp.json()["schemas"]: print(s["schema_id"], s["name"]) ``` The response includes `schemas` (array) and `total` (count). Schemas are ordered by creation date, newest first. ## Get a Schema ```python Python SDK theme={null} schema = client.get_extraction_schema("sch_k8Hx9mP2nQ4v") print(schema.name, schema.version) ``` ```bash cURL theme={null} curl "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import os, requests resp = requests.get( "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v", headers={"X-API-Key": os.getenv("DATALAB_API_KEY")}, ) print(resp.json()) ``` ## Update a Schema Update schema fields. Pass `create_new_version=True` to save the current state to version history before updating: ```python Python SDK theme={null} # Update schema fields and create a new version schema = client.update_extraction_schema( "sch_k8Hx9mP2nQ4v", schema_json={ "properties": { "invoice_number": {"type": "string"}, "total_amount": {"type": "number"}, "line_items": {"type": "array", "items": {"type": "string"}}, # New field } }, create_new_version=True, ) print(f"Now at v{schema.version}") ``` ```bash cURL theme={null} curl -X PUT "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \ -H "X-API-Key: $DATALAB_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "schema_json": { "properties": { "invoice_number": {"type": "string"}, "total_amount": {"type": "number"}, "line_items": {"type": "array", "items": {"type": "string"}} } }, "create_new_version": true }' ``` ```python Python (requests) theme={null} import os, requests resp = requests.put( "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v", headers={"X-API-Key": os.getenv("DATALAB_API_KEY")}, json={ "schema_json": {"properties": {"invoice_number": {"type": "string"}}}, "create_new_version": True, }, ) print(resp.json()["version"]) ``` ## Archive a Schema Archiving soft-deletes a schema — it no longer appears in list results (unless `include_archived=true`) and cannot be used for new extractions: ```python Python SDK theme={null} client.delete_extraction_schema("sch_k8Hx9mP2nQ4v") ``` ```bash cURL theme={null} curl -X DELETE "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \ -H "X-API-Key: $DATALAB_API_KEY" ``` ```python Python (requests) theme={null} import os, requests requests.delete( "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v", headers={"X-API-Key": os.getenv("DATALAB_API_KEY")}, ) ``` ## API Reference ### Schema Object | Field | Type | Description | | ----------------- | ------------ | ------------------------------------------------------- | | `schema_id` | string | Stable string ID (e.g. `sch_k8Hx9mP2nQ4v`) | | `name` | string | Human-readable name (max 200 chars) | | `description` | string\|null | Optional description | | `schema_json` | object | JSON schema with a `properties` key | | `version` | int | Current version number (starts at 1) | | `version_history` | array | Previous versions saved with `create_new_version: true` | | `archived` | bool | Whether the schema is archived | | `created` | datetime | Creation timestamp | | `updated` | datetime | Last update timestamp | ### `/extract` Parameters (schema-related) | Parameter | Type | Description | | ---------------- | ------ | ---------------------------------------------------------------- | | `schema_id` | string | ID of a saved schema. Mutually exclusive with `page_schema`. | | `schema_version` | int | Version to use. Only valid with `schema_id`. Defaults to latest. | ## Next Steps Full guide to extraction with inline schemas, checkpoints, and options. Score extraction results with per-field confidence ratings. Compare extraction results across configurations using saved schemas. Strategies for extracting from 100+ page documents. # Table Recognition Source: https://documentation.datalab.to/docs/recipes/table-recognition/table-rec-api-overview Extract tables from documents. **Deprecated:** The standalone Table Recognition endpoint (`/api/v1/table_rec`) is deprecated. Table extraction is now integrated into the Convert API. Use the Convert API with `output_format: "json"` to get structured table data with bounding boxes. ## Recommended Approach Use the Convert API for table extraction: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() options = ConvertOptions( output_format="json", mode="balanced" ) result = client.convert("document.pdf", options=options) # Tables are in the JSON output with block_type: "Table" for block in result.json.get("children", []): if block.get("block_type") == "Table": print(f"Table found: {block['id']}") print(f"Bounding box: {block['bbox']}") # Access table cells in block['children'] ``` ### REST API ```bash theme={null} curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: YOUR_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=json" \ -F "mode=balanced" ``` The JSON response includes `Table` and `TableCell` blocks with bounding boxes. ## Why Use Marker Instead? * **Single endpoint** - No need for a separate table-specific call * **Better integration** - Tables are extracted in context with the full document * **More features** - Access processing modes, structured extraction, and more * **Consistent API** - Same patterns as all other document processing ## Related * [Document Conversion](/docs/recipes/conversion/conversion-api-overview) - Full Convert API documentation * [Structured Extraction](/docs/recipes/structured-extraction/api-overview) - Extract specific data from tables Get started with our API in less than a minute. We include free credits. # API Overview Source: https://documentation.datalab.to/docs/welcome/api REST API reference for document conversion, form filling, and file management. Datalab provides REST APIs for document conversion, structured extraction, form filling, and file management. All APIs use the same authentication and follow similar patterns. For the simplest integration, use the [Python SDK](/docs/welcome/sdk). The SDK handles authentication, polling, and provides typed responses. ## Authentication All requests require an API key in the `X-API-Key` header: ```bash theme={null} curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: YOUR_API_KEY" \ -F "file=@document.pdf" ``` Get your API key from the [API Keys dashboard](https://www.datalab.to/app/keys). ## Request Pattern All processing endpoints follow this pattern: 1. **Submit** a document for processing (returns immediately with a `request_id`) 2. **Poll** the status endpoint until processing completes 3. **Retrieve** results from the completed response ### Submit Request ```bash theme={null} POST /api/v1/{endpoint} ``` Response: ```json theme={null} { "success": true, "request_id": "abc123", "request_check_url": "https://www.datalab.to/api/v1/{endpoint}/abc123" } ``` ### Poll for Results ```bash theme={null} GET /api/v1/{endpoint}/{request_id} ``` Response while processing: ```json theme={null} { "status": "processing" } ``` Response when complete: ```json theme={null} { "status": "complete", "success": true, ...results... } ``` Results are deleted from Datalab servers one hour after processing completes. Retrieve your results promptly. ## Document Conversion Convert documents to Markdown, HTML, JSON, or chunks. **Endpoint:** `POST /api/v1/convert` ### Request ```python theme={null} import requests url = "https://www.datalab.to/api/v1/convert" headers = {"X-API-Key": "YOUR_API_KEY"} with open("document.pdf", "rb") as f: response = requests.post( url, files={"file": ("document.pdf", f, "application/pdf")}, data={ "output_format": "markdown", "mode": "balanced", }, headers=headers ) data = response.json() check_url = data["request_check_url"] ``` ### Parameters | Parameter | Type | Default | Description | | ---------------------------- | ------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `file` | file | - | Document file (multipart upload) | | `file_url` | string | - | URL to document (alternative to file upload) | | `output_format` | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks` | | `mode` | string | `fast` | Processing mode: `fast`, `balanced`, `accurate` | | `max_pages` | int | - | Maximum pages to process | | `page_range` | string | - | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index. | | `paginate` | bool | `false` | Add page delimiters to output | | `skip_cache` | bool | `false` | Skip cached results | | `disable_image_extraction` | bool | `false` | Don't extract images | | `disable_image_captions` | bool | `false` | Don't generate image captions | | `save_checkpoint` | bool | `false` | Save checkpoint for reuse | | `word_bboxes` | bool | `false` | Predict per-word bounding boxes. Each word is inlined as `` in HTML output. Billed at \$0.30/1K pages. | | `extras` | string | - | Comma-separated: `track_changes`, `chart_understanding`, `extract_links`, `table_cell_bboxes`, `list_item_bboxes`, `infographic`, `new_block_types`. (`table_row_bboxes` is deprecated — use `table_cell_bboxes`.) | | `add_block_ids` | bool | `false` | Add block IDs to HTML for citations | | `include_markdown_in_chunks` | bool | `false` | Include markdown content in chunks output | | `token_efficient_markdown` | bool | `false` | Optimize markdown for LLM token efficiency | | `fence_synthetic_captions` | bool | `false` | Wrap synthetic image captions in HTML comments | | `additional_config` | string | - | JSON with extra config options | | `webhook_url` | string | - | Override webhook URL for this request | ### Processing Modes | Mode | Description | | ---------- | --------------------------------------------------- | | `fast` | Lowest latency, good for simple documents (default) | | `balanced` | Balance of speed and accuracy | | `accurate` | Highest accuracy, best for complex layouts | ### Response Poll `request_check_url` until `status` is `complete`: ```python theme={null} import time while True: response = requests.get(check_url, headers=headers) result = response.json() if result["status"] == "complete": break time.sleep(2) print(result["markdown"]) ``` Response fields: | Field | Type | Description | | --------------------- | ------ | ---------------------------------------- | | `status` | string | `processing`, `complete`, or `failed` | | `success` | bool | Whether conversion succeeded | | `markdown` | string | Markdown output (if format is markdown) | | `html` | string | HTML output (if format is html) | | `json` | object | JSON output (if format is json) | | `chunks` | object | Chunked output (if format is chunks) | | `images` | object | Extracted images as `{filename: base64}` | | `metadata` | object | Document metadata | | `page_count` | int | Number of pages processed | | `parse_quality_score` | float | Quality score (0-5) | | `cost_breakdown` | object | Cost in cents | | `error` | string | Error message if failed | For structured data extraction, see the [Extract endpoint](#structured-extraction). For document segmentation, see the [Segment endpoint](#document-segmentation). ## Structured Extraction Extract structured data from documents using a JSON schema. **Endpoint:** `POST /api/v1/extract` ### Request ```python theme={null} import requests import json headers = {"X-API-Key": "YOUR_API_KEY"} schema = { "invoice_number": {"type": "string", "description": "Invoice ID"}, "total": {"type": "number", "description": "Total amount"}, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": {"type": "string"}, "amount": {"type": "number"} } } } } response = requests.post( "https://www.datalab.to/api/v1/extract", files={"file": ("invoice.pdf", open("invoice.pdf", "rb"), "application/pdf")}, data={ "page_schema": json.dumps(schema), "mode": "balanced" }, headers=headers ) data = response.json() check_url = data["request_check_url"] ``` ### Parameters | Parameter | Type | Default | Description | | ----------------- | ------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | `file` | file | - | Document file (multipart upload) | | `file_url` | string | - | URL to document (alternative to file upload) | | `page_schema` | string | - | JSON schema defining the data to extract. Required unless `schema_id` is provided. | | `schema_id` | string | - | ID of a [saved extraction schema](/docs/recipes/structured-extraction/saved-schemas) (e.g. `sch_k8Hx9mP2nQ4v`). Mutually exclusive with `page_schema`. | | `schema_version` | int | - | Version of the saved schema to use. Only valid with `schema_id`; defaults to the latest version. | | `checkpoint_id` | string | - | Checkpoint ID from a previous `/convert` call (with `save_checkpoint=true`). Skips re-parsing. | | `mode` | string | `fast` | Processing mode: `fast`, `balanced`, `accurate` | | `output_format` | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks` | | `max_pages` | int | - | Maximum pages to process | | `page_range` | string | - | Specific pages (e.g., `"0-5,10"`, 0-indexed). For spreadsheets, filters by sheet index. | | `save_checkpoint` | bool | `false` | Save a checkpoint after processing for reuse with subsequent calls | | `webhook_url` | string | - | Override webhook URL for this request | The extracted data is returned in `extraction_schema_json` in the poll response. See [Structured Extraction](/docs/recipes/structured-extraction/api-overview) for detailed examples. ## Document Segmentation Segment documents into structured sections using a JSON schema. **Endpoint:** `POST /api/v1/segment` ### Parameters | Parameter | Type | Default | Description | | --------------------- | ------ | ------------ | ---------------------------------------------------------------------------------------------- | | `file` | file | - | Document file (multipart upload) | | `file_url` | string | - | URL to document (alternative to file upload) | | `segmentation_schema` | string | **required** | JSON schema defining the segments to extract | | `checkpoint_id` | string | - | Checkpoint ID from a previous `/convert` call (with `save_checkpoint=true`). Skips re-parsing. | | `mode` | string | `fast` | Processing mode: `fast`, `balanced`, `accurate` | See [Document Segmentation](/docs/recipes/document-segmentation/auto-segmentation) for detailed examples. ## Track Changes Extract tracked changes (insertions and deletions) from DOCX files. **Endpoint:** `POST /api/v1/track-changes` ```python theme={null} response = requests.post( "https://www.datalab.to/api/v1/track-changes", files={"file": ("document.docx", open("document.docx", "rb"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")}, headers=headers ) ``` See [Track Changes](/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents) for detailed examples. ## Custom Processor Execute custom AI-powered processors on documents. **Endpoint:** `POST /api/v1/custom-processor` `POST /api/v1/custom-pipeline` is deprecated (sunset: September 30, 2026). Migrate to `POST /api/v1/custom-processor`. ### Parameters | Parameter | Type | Default | Description | | --------------- | ------ | ------------ | --------------------------------------------------- | | `file` | file | - | Document file (multipart upload) | | `file_url` | string | - | URL to document | | `pipeline_id` | string | **required** | Custom processor ID (`cp_XXXXX`) | | `version` | int | - | Processor version to run (default: active version) | | `run_eval` | bool | `false` | Run evaluation rules defined for the processor | | `mode` | string | `fast` | Processing mode: `fast`, `balanced`, `accurate` | | `output_format` | string | `markdown` | Output format: `markdown`, `html`, `json`, `chunks` | | `webhook_url` | string | - | URL to POST when complete | ## Form Filling Fill forms in PDFs and images. **Endpoint:** `POST /api/v1/fill` ### Request ```python theme={null} import json field_data = { "full_name": {"value": "John Doe", "description": "Full legal name"}, "date": {"value": "2024-01-15", "description": "Today's date"}, "signature": {"value": "John Doe", "description": "Signature field"} } response = requests.post( "https://www.datalab.to/api/v1/fill", files={"file": ("form.pdf", open("form.pdf", "rb"), "application/pdf")}, data={ "field_data": json.dumps(field_data), "confidence_threshold": "0.5" }, headers=headers ) ``` ### Parameters | Parameter | Type | Default | Description | | ---------------------- | ------ | ------- | ------------------------------------- | | `file` | file | - | Form file (PDF or image) | | `file_url` | string | - | URL to form | | `field_data` | string | - | JSON mapping field names to values | | `context` | string | - | Additional context for field matching | | `confidence_threshold` | float | `0.5` | Minimum confidence for matching (0-1) | | `page_range` | string | - | Specific pages to process | | `skip_cache` | bool | `false` | Skip cached results | ### Field Data Format ```json theme={null} { "field_key": { "value": "The value to fill", "description": "Description to help match the field" } } ``` ### Response | Field | Type | Description | | ------------------ | ------ | ------------------------------- | | `status` | string | Processing status | | `success` | bool | Whether filling succeeded | | `output_format` | string | `pdf` or `png` | | `output_base64` | string | Base64-encoded filled form | | `fields_filled` | array | Successfully filled field names | | `fields_not_found` | array | Unmatched field names | | `page_count` | int | Pages processed | | `cost_breakdown` | object | Cost details | See [Form Filling](/docs/recipes/form-filling/form-filling-api-overview) for more examples. ## File Management Upload and manage files for use in pipelines. ### Upload File **Step 1:** Request an upload URL ```bash theme={null} POST /api/v1/files/upload Content-Type: application/json { "filename": "document.pdf", "content_type": "application/pdf" } ``` Response: ```json theme={null} { "file_id": 123, "upload_url": "https://...", "reference": "datalab://file-abc123" } ``` **Step 2:** Upload directly to the presigned URL ```bash theme={null} PUT {upload_url} Content-Type: application/pdf ``` **Step 3:** Confirm upload ```bash theme={null} GET /api/v1/files/{file_id}/confirm ``` ### List Files ```bash theme={null} GET /api/v1/files?limit=50&offset=0 ``` ### Get File Metadata ```bash theme={null} GET /api/v1/files/{file_id} ``` ### Get Download URL ```bash theme={null} GET /api/v1/files/{file_id}/download?expires_in=3600 ``` ### Delete File ```bash theme={null} DELETE /api/v1/files/{file_id} ``` See [File Management](/docs/recipes/file-management/file-upload-api) for detailed examples. ## Thumbnails Generate page thumbnails from a previously processed document: ```bash theme={null} GET /api/v1/thumbnails/{lookup_key}?thumb_width=300&page_range=0-2 ``` | Parameter | Type | Default | Description | | ------------- | ------ | --------- | ----------------------------------------- | | `lookup_key` | string | Required | The request ID from a previous conversion | | `thumb_width` | int | 300 | Thumbnail width in pixels | | `page_range` | string | All pages | Pages to generate (e.g., `"0,2-4"`) | Response: ```json theme={null} { "success": true, "thumbnails": ["base64_encoded_jpg_1", "base64_encoded_jpg_2"] } ``` Thumbnails are returned as base64-encoded JPG images. ## Create Document Generate DOCX files from markdown with track changes support: ```bash theme={null} POST /api/v1/create-document Content-Type: application/json { "markdown": "# Title\n\nThis is newly added text.", "output_format": "docx" } ``` See [Create Document](/docs/recipes/create-document/create-document-api-overview) for detailed examples. ## Webhooks Configure webhooks to receive notifications when processing completes instead of polling. Set a default webhook URL in your [account settings](https://www.datalab.to/settings), or override per-request with the `webhook_url` parameter. See [Webhooks](/platform/webhooks) for configuration details. ## Rate Limits Default rate limits apply per API key. If you exceed limits, you'll receive a `429` response. See [Rate Limits](/docs/common/limits) for details and how to request higher limits. ## Next Steps Use the Python SDK for a simpler integration with typed responses. Receive notifications when processing completes instead of polling. Understand file size limits, page limits, and rate limiting. Detailed guide to converting documents to Markdown, HTML, or JSON. # Quickstart Source: https://documentation.datalab.to/docs/welcome/quickstart Get started with Datalab to convert PDFs, images, and documents into Markdown, HTML, or JSON in minutes. ## Get Your API Key Sign up at [datalab.to/auth/sign\_up](https://www.datalab.to/auth/sign_up) — new accounts include a **free monthly usage allowance** (no credit card required), enough to run a full proof of concept on your own documents. Then grab your API key from the [API Keys dashboard](https://www.datalab.to/app/keys). **Want to try before writing code?** Upload a document to the [Forge Playground](https://www.datalab.to/app/playground) to see results instantly — no API key required. ## Installation Install the Datalab SDK: ```bash theme={null} pip install datalab-python-sdk ``` Set your API key as an environment variable: ```bash theme={null} export DATALAB_API_KEY=your_api_key_here ``` ## Convert a Document The SDK provides a simple interface to convert documents to Markdown, HTML, JSON, or chunks. ```python SDK theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Uses DATALAB_API_KEY env var # Convert PDF to markdown result = client.convert("document.pdf") print(result.markdown) # Save output and images result.save_output("output/") ``` ```python Python (requests) theme={null} import requests import time url = "https://www.datalab.to/api/v1/convert" headers = {"X-API-Key": "YOUR_API_KEY"} with open("document.pdf", "rb") as f: response = requests.post( url, files={"file": ("document.pdf", f, "application/pdf")}, data={"output_format": "markdown"}, headers=headers ) data = response.json() check_url = data["request_check_url"] # Poll for completion while True: response = requests.get(check_url, headers=headers) result = response.json() if result["status"] == "complete": print(result["markdown"]) break time.sleep(2) ``` ```bash cURL theme={null} # Submit document curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: YOUR_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=markdown" # Poll for results (use request_check_url from response) curl -X GET "https://www.datalab.to/api/v1/convert/{request_id}" \ -H "X-API-Key: YOUR_API_KEY" ``` **Common mistakes:** * Forgetting to set the `DATALAB_API_KEY` environment variable * Using `file_url` with a private/authenticated URL (must be publicly accessible) * Not polling for results — the initial response only contains a `request_id`, not the actual output ## Conversion Options Control the conversion with options: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() options = ConvertOptions( output_format="markdown", # "markdown", "html", "json", "chunks" mode="balanced", # "fast", "balanced", "accurate" paginate=True, # Add page delimiters page_range="0-10", # Process specific pages (0-indexed) ) result = client.convert("document.pdf", options=options) ``` ### Processing Modes | Mode | Description | | ---------- | ------------------------------------------------------- | | `fast` | Lowest latency, good for simple documents (SDK default) | | `balanced` | Balance of speed and accuracy | | `accurate` | Highest accuracy, best for complex layouts | ## Fill PDF Forms Fill forms in PDFs or images with structured data: ```python SDK theme={null} from datalab_sdk import DatalabClient, FormFillingOptions client = DatalabClient() options = FormFillingOptions( field_data={ "full_name": {"value": "John Doe", "description": "Full legal name"}, "date": {"value": "2024-01-15", "description": "Today's date"}, "signature": {"value": "John Doe", "description": "Signature field"}, } ) result = client.fill("form.pdf", options=options) result.save_output("filled_form.pdf") ``` ```python Python (requests) theme={null} import requests import json url = "https://www.datalab.to/api/v1/fill" headers = {"X-API-Key": "YOUR_API_KEY"} field_data = { "full_name": {"value": "John Doe", "description": "Full legal name"}, "date": {"value": "2024-01-15", "description": "Today's date"}, } with open("form.pdf", "rb") as f: response = requests.post( url, files={"file": ("form.pdf", f, "application/pdf")}, data={"field_data": json.dumps(field_data)}, headers=headers ) # Poll for completion using request_check_url ``` ## Upload and Manage Files Upload files to Datalab for use in pipelines: ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Upload files uploaded = client.upload_files(["doc1.pdf", "doc2.pdf"]) for file in uploaded: print(f"{file.original_filename}: {file.reference}") # Output: doc1.pdf: datalab://file-abc123 # List your files files = client.list_files(limit=50) print(f"Total files: {files['total']}") ``` ## CLI The SDK includes a command-line interface: ```bash theme={null} # Convert a single document datalab convert document.pdf --format markdown # Convert with options datalab convert document.pdf --mode accurate --paginate # Convert a directory datalab convert ./documents/ --output_dir ./output/ ``` ## Run a Pipeline Pipelines chain processors (convert, extract, segment) into a single reusable call. Create them in [Forge](https://www.datalab.to/app/playground) or via the SDK: ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Run an existing pipeline execution = client.run_pipeline( "pl_abc123", # Your pipeline ID file_path="document.pdf" ) # Poll until complete execution = client.get_pipeline_execution( execution.execution_id, max_polls=300 ) # Get extraction results (step index 1 = extract step) result = client.get_step_result(execution.execution_id, step_index=1) print(result) ``` See [Pipelines](/docs/recipes/pipelines/pipeline-overview) for creating, versioning, and running pipelines. ## Async Support For high-throughput applications, use the async client: ```python theme={null} import asyncio from datalab_sdk import AsyncDatalabClient async def convert_documents(): async with AsyncDatalabClient() as client: result = await client.convert("document.pdf") print(result.markdown) asyncio.run(convert_documents()) ``` ## Next Steps Full Python SDK documentation with typed clients and async support. REST API reference for document conversion, form filling, and file management. Chain processors into versioned, reusable pipelines. Detailed guide to converting PDFs and documents to Markdown, HTML, or JSON. # Python SDK Source: https://documentation.datalab.to/docs/welcome/sdk The Datalab Python SDK provides a simple interface for document conversion, pipelines, structured extraction, form filling, and file management. ## Installation ```bash theme={null} pip install datalab-python-sdk ``` Requires Python 3.10 or higher. ## Authentication Set your API key as an environment variable (recommended): ```bash theme={null} export DATALAB_API_KEY=your_api_key_here ``` Or pass it directly to the client: ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient(api_key="your_api_key_here") ``` Get your API key from the [API Keys dashboard](https://www.datalab.to/app/keys). ## Quick Example ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Convert a document to markdown result = client.convert("document.pdf") print(result.markdown) # Save output with images result.save_output("output/") ``` ## Client Options Both sync and async clients accept the same configuration options: ```python theme={null} from datalab_sdk import DatalabClient, AsyncDatalabClient # Synchronous client (blocking) client = DatalabClient( api_key="your_key", # Or use DATALAB_API_KEY env var base_url="https://www.datalab.to", # API endpoint timeout=300, # Request timeout in seconds ) # Asynchronous client (non-blocking) async_client = AsyncDatalabClient( api_key="your_key", base_url="https://www.datalab.to", timeout=300, ) ``` | Parameter | Type | Default | Description | | ---------- | ---- | ------------------------- | -------------------------- | | `api_key` | str | `DATALAB_API_KEY` env var | Your Datalab API key | | `base_url` | str | `https://www.datalab.to` | API base URL | | `timeout` | int | `300` | Request timeout in seconds | ## Async Support For high-throughput applications, use `AsyncDatalabClient`: ```python theme={null} import asyncio from datalab_sdk import AsyncDatalabClient async def process_documents(): async with AsyncDatalabClient() as client: result = await client.convert("document.pdf") print(result.markdown) asyncio.run(process_documents()) ``` The async client is recommended when processing multiple documents concurrently. ## Error Handling The SDK raises specific exceptions for different error types: ```python theme={null} from datalab_sdk import DatalabClient from datalab_sdk.exceptions import ( DatalabAPIError, DatalabTimeoutError, DatalabFileError, DatalabValidationError, ) client = DatalabClient() try: result = client.convert("document.pdf") except DatalabAPIError as e: print(f"API error {e.status_code}: {e.response_data}") except DatalabTimeoutError: print("Request timed out") except DatalabFileError as e: print(f"File error: {e}") except DatalabValidationError as e: print(f"Invalid input: {e}") ``` | Exception | Description | | ------------------------ | --------------------------------------------------------------------------- | | `DatalabAPIError` | API returned an error response (includes `status_code` and `response_data`) | | `DatalabTimeoutError` | Request exceeded timeout | | `DatalabFileError` | File not found or cannot be read | | `DatalabValidationError` | Invalid parameters provided | ## Automatic Retries The SDK automatically retries requests for: * `408` Request Timeout * `429` Rate Limit Exceeded * `5xx` Server Errors Retries use exponential backoff. You can control polling behavior with `max_polls` and `poll_interval` parameters on individual methods. ## SDK Features Convert PDFs, images, and documents to Markdown, HTML, JSON, or chunks. Extract structured data from documents using JSON schemas. Segment documents into logical sections. Fill PDF and image forms with structured field data. Chain processors into versioned, reusable pipelines. Upload, list, and manage files in Datalab storage. Command-line interface for document conversion. ## Method Summary | Method | Description | | ---------------------------------- | ----------------------------------------------------------- | | `convert()` | Convert documents to markdown, HTML, JSON, or chunks | | `extract()` | Extract structured data from documents using JSON schemas | | `segment()` | Segment documents into sections using a schema | | `track_changes()` | Extract tracked changes from DOCX documents | | `create_document()` | Create DOCX from markdown with track changes | | `run_custom_processor()` | Execute a custom processor on a document | | `fill()` | Fill PDF or image forms with field data | | `upload_files()` | Upload files to Datalab storage | | `list_files()` | List uploaded files | | `get_file_metadata()` | Get metadata for a specific file | | `get_file_download_url()` | Generate presigned download URL | | `delete_file()` | Delete an uploaded file | | `create_pipeline()` | Create a new pipeline | | `list_pipelines()` | List pipelines for your team | | `get_pipeline()` | Get a pipeline by ID | | `update_pipeline()` | Update pipeline steps (creates a draft) | | `save_pipeline()` | Promote a pipeline draft to a named, published version | | `archive_pipeline()` | Archive a pipeline | | `unarchive_pipeline()` | Restore an archived pipeline | | `create_pipeline_version()` | Snapshot the current pipeline steps as an immutable version | | `list_pipeline_versions()` | List all versions of a pipeline | | `discard_pipeline_draft()` | Discard draft changes and revert to a published version | | `get_pipeline_rate()` | Get per-page rate for a pipeline | | `run_pipeline()` | Execute a pipeline on a file | | `get_pipeline_execution()` | Poll pipeline execution status | | `list_pipeline_executions()` | List recent executions for a pipeline | | `get_step_result()` | Fetch the result of a specific pipeline step | | `list_custom_processors()` | List custom processors for your team | | `get_custom_processor_status()` | Check custom processor generation status | | `list_custom_processor_versions()` | List versions of a custom processor | | `set_active_processor_version()` | Set the active version of a custom processor | | `archive_custom_processor()` | Archive a custom processor | | `create_extraction_schema()` | Create a reusable extraction schema | | `list_extraction_schemas()` | List saved extraction schemas | | `get_extraction_schema()` | Get a schema by ID | | `update_extraction_schema()` | Update schema fields or create a new version | | `delete_extraction_schema()` | Archive (soft-delete) an extraction schema | | `delete_workflow()` | Delete a workflow definition | | `run_custom_pipeline()` | *(Deprecated)* Use `run_custom_processor()` instead | | `ocr()` | *(Deprecated)* Use `convert()` instead | ## Next Steps Convert PDFs, images, and documents to Markdown, HTML, JSON, or chunks. Extract structured data from documents using JSON schemas. Segment documents into logical sections. Fill PDF and image forms with structured field data. Chain processors into versioned, reusable pipelines. Upload, list, and manage files in Datalab storage. # Command Line Interface Source: https://documentation.datalab.to/docs/welcome/sdk/cli Use the Datalab CLI to convert documents from the command line. ## Installation The CLI is included with the SDK: ```bash theme={null} pip install datalab-python-sdk ``` ## Authentication Set your API key as an environment variable: ```bash theme={null} export DATALAB_API_KEY=your_api_key_here ``` Or pass it with each command: ```bash theme={null} datalab convert document.pdf --api_key YOUR_API_KEY ``` ## Convert Documents Convert documents to markdown, HTML, JSON, or chunks. ### Basic Usage ```bash theme={null} # Convert a single file datalab convert document.pdf # Convert to specific format datalab convert document.pdf --format html # Convert with processing mode datalab convert document.pdf --mode accurate ``` ### Output Options ```bash theme={null} # Save to specific directory datalab convert document.pdf --output_dir ./output/ # Output formats datalab convert document.pdf --format markdown datalab convert document.pdf --format html datalab convert document.pdf --format json datalab convert document.pdf --format chunks ``` ### Processing Options ```bash theme={null} # Processing modes datalab convert document.pdf --mode fast # Lowest latency (default) datalab convert document.pdf --mode balanced # Balance of speed and accuracy datalab convert document.pdf --mode accurate # Highest accuracy # Limit pages datalab convert document.pdf --max_pages 10 # Specific page range (0-indexed) datalab convert document.pdf --page_range "0-5,10,15-20" # For spreadsheets, page_range filters by sheet index datalab convert workbook.xlsx --page_range "0,2" # Add page delimiters datalab convert document.pdf --paginate ``` ### Advanced Options ```bash theme={null} # Add block IDs for citations (HTML only) datalab convert document.pdf --format html --add_block_ids # Disable image extraction datalab convert document.pdf --disable_image_extraction # Disable image captions datalab convert document.pdf --disable_image_captions # Skip cached results datalab convert document.pdf --skip_cache ``` ### Directory Processing Convert all documents in a directory: ```bash theme={null} # Convert all supported files datalab convert ./documents/ --output_dir ./output/ # Filter by extension datalab convert ./documents/ --extensions pdf,docx # Control concurrency datalab convert ./documents/ --max_concurrent 5 ``` ### Convert Command Reference | Option | Description | | ---------------------------- | --------------------------------------------------- | | `--format` | Output format: `markdown`, `html`, `json`, `chunks` | | `--mode` | Processing mode: `fast`, `balanced`, `accurate` | | `--output_dir`, `-o` | Output directory | | `--max_pages` | Maximum pages to process | | `--page_range` | Specific pages (e.g., `"0-5,10"`) | | `--paginate` | Add page delimiters | | `--add_block_ids` | Add block IDs to HTML output | | `--disable_image_extraction` | Don't extract images | | `--disable_image_captions` | Don't generate image captions | | `--skip_cache` | Force reprocessing | | `--extensions` | File extensions to process (for directories) | | `--max_concurrent` | Maximum concurrent requests | | `--max_polls` | Maximum polling attempts | | `--poll_interval` | Seconds between polls | | `--api_key` | Datalab API key | | `--base_url` | API base URL | ## Extract Structured Data Extract structured data from documents using a JSON schema. ### Basic Usage ```bash theme={null} # Extract data using a page schema datalab extract invoice.pdf \ --page_schema '{"invoice_number": {"type": "string"}, "total": {"type": "number"}}' # Extract with a specific mode datalab extract invoice.pdf \ --page_schema '{"title": {"type": "string"}}' \ --mode accurate # Extract using a checkpoint from a previous conversion datalab extract invoice.pdf \ --page_schema '{"total": {"type": "number"}}' \ --checkpoint_id "ckpt_abc123" ``` ### Extract Command Reference | Option | Description | | -------------------- | ----------------------------------------------------- | | `--page_schema` | **(Required)** JSON schema defining fields to extract | | `--checkpoint_id` | Checkpoint ID from a previous conversion | | `--format` | Output format: `markdown`, `html`, `json`, `chunks` | | `--mode` | Processing mode: `fast`, `balanced`, `accurate` | | `--output_dir`, `-o` | Output directory | | `--max_pages` | Maximum pages to process | | `--page_range` | Specific pages (e.g., `"0-5,10"`) | | `--skip_cache` | Force reprocessing | | `--api_key` | Datalab API key | | `--base_url` | API base URL | ## Segment Documents Segment documents into logical sections using a schema. ### Basic Usage ```bash theme={null} # Segment a document datalab segment report.pdf \ --segmentation_schema '{"sections": [{"name": "intro", "description": "Introduction"}, {"name": "body", "description": "Main content"}]}' # Segment with a checkpoint datalab segment report.pdf \ --segmentation_schema '{"sections": [{"name": "summary", "description": "Executive summary"}]}' \ --checkpoint_id "ckpt_abc123" ``` ### Segment Command Reference | Option | Description | | ----------------------- | ------------------------------------------------------------------ | | `--segmentation_schema` | **(Required)** JSON schema defining segment names and descriptions | | `--checkpoint_id` | Checkpoint ID from a previous conversion | | `--mode` | Processing mode: `fast`, `balanced`, `accurate` | | `--output_dir`, `-o` | Output directory | | `--max_pages` | Maximum pages to process | | `--page_range` | Specific pages (e.g., `"0-5,10"`) | | `--skip_cache` | Force reprocessing | | `--api_key` | Datalab API key | | `--base_url` | API base URL | ## Track Changes Extract tracked changes from DOCX documents. ### Basic Usage ```bash theme={null} # Extract tracked changes from a Word document datalab track-changes contract.docx # Specify output format datalab track-changes contract.docx --format html # With pagination datalab track-changes contract.docx --format html --paginate ``` ### Track Changes Command Reference | Option | Description | | -------------------- | --------------------------------------------------------------------------------- | | `--format` | Comma-separated output formats: `markdown`, `html`, `chunks` (default: all three) | | `--paginate` | Add page delimiters to output | | `--output_dir`, `-o` | Output directory | | `--api_key` | Datalab API key | | `--base_url` | API base URL | ## Custom Processor The `custom-pipeline` CLI command is deprecated. It continues to work and calls the new `/api/v1/custom-processor` endpoint internally, but the command name itself will be updated in a future SDK release. Execute a custom processor on a document. ### Basic Usage ```bash theme={null} # Run a custom processor datalab custom-pipeline document.pdf --pipeline_id "cp_XXXXX" # Run with evaluation datalab custom-pipeline document.pdf \ --pipeline_id "cp_XXXXX" \ --run_eval # Specify format and mode datalab custom-pipeline document.pdf \ --pipeline_id "cp_XXXXX" \ --format json \ --mode accurate ``` ### Custom Processor Command Reference | Option | Description | | -------------------- | --------------------------------------------------- | | `--pipeline_id` | **(Required)** Custom processor ID (`cp_XXXXX`) | | `--run_eval` | Run evaluation rules for the processor | | `--format` | Output format: `markdown`, `html`, `json`, `chunks` | | `--mode` | Processing mode: `fast`, `balanced`, `accurate` | | `--output_dir`, `-o` | Output directory | | `--api_key` | Datalab API key | | `--base_url` | API base URL | ## Create Document Create a DOCX document from markdown with track changes. ### Basic Usage ```bash theme={null} # Create a document from a markdown file datalab create-document --markdown input.md --output output.docx # Create a document from inline markdown content datalab create-document \ --markdown "# Title\n\nDocument content here." \ --output document.docx ``` ### Create Document Command Reference | Option | Description | | ---------------- | ---------------------------------------------------------- | | `--markdown` | **(Required)** Markdown content or path to a markdown file | | `--output`, `-o` | **(Required)** Output file path for the generated DOCX | | `--api_key` | Datalab API key | | `--base_url` | API base URL | ## Examples ### Batch Convert PDFs ```bash theme={null} # Convert all PDFs in a directory with accurate mode datalab convert ./invoices/ \ --extensions pdf \ --mode accurate \ --format json \ --output_dir ./processed/ ``` ### Extract Data from Documents ```bash theme={null} # Extract structured data using a schema datalab extract invoice.pdf \ --page_schema '{ "invoice_number": {"type": "string", "description": "Invoice ID"}, "total": {"type": "number", "description": "Total amount"}, "vendor": {"type": "string", "description": "Vendor name"} }' \ --mode balanced \ --output_dir ./extracted/ ``` ### High-Throughput Processing ```bash theme={null} # Process many files with high concurrency datalab convert ./documents/ \ --max_concurrent 10 \ --mode fast \ --output_dir ./output/ ``` ## Getting Help ```bash theme={null} # General help datalab --help # Command-specific help datalab convert --help datalab extract --help datalab segment --help datalab track-changes --help datalab custom-pipeline --help datalab create-document --help ``` ## Next Steps Get up and running with Datalab in minutes. Process multiple documents efficiently in parallel. Explore the full Python SDK for advanced usage. See all document formats supported by Datalab. # Document Conversion Source: https://documentation.datalab.to/docs/welcome/sdk/conversion Convert PDFs, images, and documents to Markdown, HTML, JSON, or chunks using the Datalab SDK. ## Basic Usage ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Convert to markdown (default) result = client.convert("document.pdf") print(result.markdown) # Convert from URL result = client.convert(file_url="https://example.com/document.pdf") print(result.markdown) ``` ## Conversion Options Use `ConvertOptions` to control the conversion: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() options = ConvertOptions( output_format="markdown", # Output format mode="balanced", # Processing mode paginate=True, # Add page delimiters max_pages=10, # Limit pages processed page_range="0-5,10", # Specific pages (0-indexed) ) result = client.convert("document.pdf", options=options) ``` ### All Options | Option | Type | Default | Description | | ----------------------------- | ---- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `output_format` | str | `"markdown"` | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"` | | `mode` | str | `"fast"` | Processing mode: `"fast"`, `"balanced"`, `"accurate"` | | `paginate` | bool | `False` | Add page delimiters to output | | `max_pages` | int | None | Maximum number of pages to process | | `page_range` | str | None | Specific pages to process (e.g., `"0-5,10,15-20"`). For spreadsheets, filters by sheet index. | | `skip_cache` | bool | `False` | Skip cached results, force reprocessing | | `disable_image_extraction` | bool | `False` | Don't extract images from document | | `disable_image_captions` | bool | `False` | Don't generate captions for images | | `token_efficient_markdown` | bool | `False` | Optimize markdown output for LLM token usage | | `fence_synthetic_captions` | bool | `False` | Fence synthetic image captions | | `include_markdown_in_chunks` | bool | `False` | Include markdown in chunks/JSON output | | `save_checkpoint` | bool | `False` | Save intermediate checkpoint for reuse | | `extras` | str | None | Comma-separated features: `"track_changes"`, `"chart_understanding"`, `"extract_links"`, `"table_cell_bboxes"`, `"list_item_bboxes"`, `"infographic"`, `"new_block_types"`. (`"table_row_bboxes"` is deprecated — use `"table_cell_bboxes"` instead.) | | `add_block_ids` | bool | `False` | Add block IDs to HTML output for citations | | `keep_spreadsheet_formatting` | bool | `False` | Preserve spreadsheet styling in HTML output | | `webhook_url` | str | None | Override account webhook URL for this request | | `additional_config` | dict | None | Additional configuration options | Use `save_checkpoint=True` to save the parsed document state. Then call `client.extract()` or `client.segment()` with the returned `checkpoint_id` to run extraction or segmentation without re-parsing. ### Processing Modes | Mode | Description | Use Case | | ---------- | ----------------------------- | ---------------------------------------- | | `fast` | Lowest latency (default) | Simple documents, real-time applications | | `balanced` | Balance of speed and accuracy | General use | | `accurate` | Highest accuracy | Complex layouts, tables, figures | ### Output Formats | Format | Description | | ---------- | ------------------------------------------ | | `markdown` | Clean markdown with headers, lists, tables | | `html` | Structured HTML preserving layout | | `json` | Block-level structure with bounding boxes | | `chunks` | Pre-chunked output for RAG applications | ## Conversion Result The `ConversionResult` object contains the converted content and metadata: ```python theme={null} result = client.convert("document.pdf") # Access content based on output format print(result.markdown) # Markdown output print(result.html) # HTML output print(result.json) # JSON structure print(result.chunks) # Chunked output # Metadata print(result.success) # True if conversion succeeded print(result.page_count) # Number of pages processed print(result.images) # Dict of extracted images (filename -> base64) print(result.metadata) # Document metadata print(result.parse_quality_score) # Quality score (0-5) print(result.cost_breakdown) # Cost in cents ``` ### Result Fields | Field | Type | Description | | --------------------- | ----- | ---------------------------------------------------- | | `success` | bool | Whether conversion succeeded | | `markdown` | str | Markdown output (if format is markdown) | | `html` | str | HTML output (if format is html) | | `json` | dict | JSON output (if format is json) | | `chunks` | dict | Chunked output (if format is chunks) | | `images` | dict | Extracted images as `{filename: base64_data}` | | `metadata` | dict | Document metadata | | `page_count` | int | Number of pages processed | | `parse_quality_score` | float | Quality score from 0-5 | | `cost_breakdown` | dict | Cost details (`list_cost_cents`, `final_cost_cents`) | | `checkpoint_id` | str | Checkpoint ID if `save_checkpoint` was True | | `error` | str | Error message if conversion failed | ## Saving Output Save the conversion result to files: ```python theme={null} # Save during conversion result = client.convert("document.pdf", save_output="output/document") # Or save afterward result.save_output("output/document", save_images=True) ``` This creates: * `document.md` (or `.html`, `.json` based on format) * `document_images/` directory with extracted images (if `save_images=True`) ## Async Usage For high-throughput applications: ```python theme={null} import asyncio from datalab_sdk import AsyncDatalabClient, ConvertOptions async def convert_documents(): async with AsyncDatalabClient() as client: options = ConvertOptions(mode="fast", max_pages=5) result = await client.convert("document.pdf", options=options) return result.markdown markdown = asyncio.run(convert_documents()) ``` ## Polling Configuration Control polling behavior for long-running conversions: ```python theme={null} result = client.convert( "large_document.pdf", max_polls=600, # Maximum polling attempts (default: 300) poll_interval=2, # Seconds between polls (default: 1) ) ``` ## Special Features ### Track Changes (Word Documents) Extract tracked changes and comments from DOCX files: ```python theme={null} options = ConvertOptions( output_format="html", extras="track_changes", ) result = client.convert("contract.docx", options=options) # HTML contains , , and tags ``` ### Chart Understanding Extract data from charts and graphs: ```python theme={null} options = ConvertOptions( extras="chart_understanding", ) result = client.convert("report.pdf", options=options) ``` ### Block IDs for Citations Add block IDs for tracking content back to source locations: ```python theme={null} options = ConvertOptions( output_format="html", add_block_ids=True, ) result = client.convert("document.pdf", options=options) # HTML elements include data-block-id attributes ``` ### Structured Extraction For structured data extraction, use the dedicated [`client.extract()`](/docs/welcome/sdk/extraction) method. ## Next Steps Extract structured data from documents using JSON schemas. Process multiple documents efficiently in parallel. Programmatically fill PDF and image forms with field data. Convert documents from the command line. # Structured Extraction Source: https://documentation.datalab.to/docs/welcome/sdk/extraction Extract structured data from documents using JSON schemas with the Datalab SDK. ## Basic Usage ```python theme={null} import json from datalab_sdk import DatalabClient, ExtractOptions client = DatalabClient() # Define a JSON schema for extraction page_schema = json.dumps({ "invoice_number": {"type": "string", "description": "Invoice number"}, "total": {"type": "number", "description": "Total amount due"}, "vendor": {"type": "string", "description": "Vendor or company name"}, "items": { "type": "array", "items": { "type": "object", "properties": { "description": {"type": "string"}, "amount": {"type": "number"} } } } }) options = ExtractOptions(page_schema=page_schema) result = client.extract("invoice.pdf", options=options) # Access the extracted data extracted = json.loads(result.extraction_schema_json) print(extracted) ``` ## Extract Options Use `ExtractOptions` to configure extraction behavior: | Option | Type | Default | Description | | ----------------- | ---- | ------------ | ------------------------------------------------------------------------------------------------- | | `page_schema` | str | **Required** | JSON schema defining the fields to extract. Mutually exclusive with `schema_id`. | | `schema_id` | str | None | ID of a saved extraction schema (e.g. `sch_k8Hx9mP2nQ4v`). Mutually exclusive with `page_schema`. | | `schema_version` | int | None | Schema version to pin to. Only valid with `schema_id`. | | `checkpoint_id` | str | None | Checkpoint ID from a previous `convert()` call | | `mode` | str | `"fast"` | Parse mode: `"fast"`, `"balanced"`, `"accurate"`. Controls document parsing quality. | | `output_format` | str | `"markdown"` | Output format: `"markdown"`, `"html"`, `"json"`, `"chunks"` | | `save_checkpoint` | bool | `False` | Save checkpoint for reuse with subsequent calls | | `max_pages` | int | None | Maximum number of pages to process | | `page_range` | str | None | Specific pages to process (e.g., `"0-5,10"`). For spreadsheets, filters by sheet index. | | `skip_cache` | bool | `False` | Skip cached results, force reprocessing | | `webhook_url` | str | None | Webhook URL for completion notification | To control the **extraction pipeline mode** (fast vs. balanced), pass `extraction_mode` as a form field via the REST API directly — it is not yet exposed in `ExtractOptions`. See [Balanced Extraction Mode](/docs/recipes/structured-extraction/balanced-mode) for details on the two modes. ## Checkpoint Reuse Use checkpoints to avoid re-parsing a document when running extraction after conversion. First convert with `save_checkpoint=True`, then extract using the returned `checkpoint_id`: ```python theme={null} import json from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions client = DatalabClient() # Step 1: Convert and save a checkpoint convert_options = ConvertOptions( mode="accurate", save_checkpoint=True, ) convert_result = client.convert("report.pdf", options=convert_options) print(convert_result.markdown) # Step 2: Extract using the checkpoint (no re-parsing needed) page_schema = json.dumps({ "title": {"type": "string", "description": "Document title"}, "author": {"type": "string", "description": "Author name"}, "date": {"type": "string", "description": "Publication date"}, "summary": {"type": "string", "description": "Brief summary of the document"}, }) extract_options = ExtractOptions( page_schema=page_schema, checkpoint_id=convert_result.checkpoint_id, ) extract_result = client.extract("report.pdf", options=extract_options) extracted = json.loads(extract_result.extraction_schema_json) print(extracted) ``` ## Extraction Result The result object contains the extracted data alongside standard conversion fields: ```python theme={null} result = client.extract("invoice.pdf", options=options) # Extracted structured data (JSON string) extracted = json.loads(result.extraction_schema_json) print(extracted["invoice_number"]) print(extracted["total"]) # Standard conversion fields are also available print(result.success) print(result.markdown) print(result.page_count) print(result.cost_breakdown) ``` ## Async Usage ```python theme={null} import asyncio import json from datalab_sdk import AsyncDatalabClient, ExtractOptions async def extract_data(): async with AsyncDatalabClient() as client: page_schema = json.dumps({ "title": {"type": "string", "description": "Document title"}, "author": {"type": "string", "description": "Author name"}, }) options = ExtractOptions(page_schema=page_schema) result = await client.extract("document.pdf", options=options) return json.loads(result.extraction_schema_json) extracted = asyncio.run(extract_data()) print(extracted) ``` ## Next Steps Learn more about structured extraction patterns and best practices. Segment documents into logical sections using schemas. Convert documents to Markdown, HTML, JSON, or chunks. Process multiple documents efficiently in parallel. # File Management Source: https://documentation.datalab.to/docs/welcome/sdk/file-management Upload, list, and manage files in Datalab storage using the SDK. ## Overview Datalab provides file storage for documents you want to process with pipelines or reuse across multiple API calls. Uploaded files get a reference URL (`datalab://file-xxx`) that you can use in pipelines. ## Upload Files Upload one or more files to Datalab storage: ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Upload a single file file = client.upload_files("document.pdf") print(f"Uploaded: {file.original_filename}") print(f"Reference: {file.reference}") # datalab://file-abc123 # Upload multiple files files = client.upload_files(["doc1.pdf", "doc2.pdf", "doc3.pdf"]) for f in files: print(f"{f.original_filename}: {f.reference}") ``` ### Upload Result The `UploadedFileMetadata` object contains: | Field | Type | Description | | ------------------- | ---- | ---------------------------------------------- | | `file_id` | int | Unique file ID | | `original_filename` | str | Original filename | | `content_type` | str | MIME type | | `reference` | str | Datalab reference URL (`datalab://file-xxx`) | | `upload_status` | str | Status: `"pending"`, `"completed"`, `"failed"` | | `file_size` | int | File size in bytes | | `created` | str | Upload timestamp | ## List Files List all uploaded files with pagination: ```python theme={null} # List first 50 files result = client.list_files(limit=50, offset=0) print(f"Total files: {result['total']}") for file in result['files']: print(f" {file.original_filename} ({file.file_size} bytes)") print(f" Reference: {file.reference}") print(f" Status: {file.upload_status}") ``` ### Pagination ```python theme={null} # Page through all files offset = 0 limit = 50 while True: result = client.list_files(limit=limit, offset=offset) for file in result['files']: print(file.original_filename) if offset + limit >= result['total']: break offset += limit ``` ## Get File Metadata Get details for a specific file: ```python theme={null} # By file ID (integer) file = client.get_file_metadata(123) # By hashid (string from reference URL) file = client.get_file_metadata("abc123") print(f"Filename: {file.original_filename}") print(f"Size: {file.file_size} bytes") print(f"Type: {file.content_type}") print(f"Created: {file.created}") ``` ## Get Download URL Generate a presigned URL to download a file: ```python theme={null} result = client.get_file_download_url( file_id=123, expires_in=3600 # URL valid for 1 hour (default) ) print(f"Download URL: {result['download_url']}") print(f"Expires in: {result['expires_in']} seconds") # Download the file import requests response = requests.get(result['download_url']) with open("downloaded.pdf", "wb") as f: f.write(response.content) ``` ### Expiration Options The `expires_in` parameter accepts values from 60 to 86400 seconds (1 minute to 24 hours): ```python theme={null} # Short-lived URL (1 minute) result = client.get_file_download_url(file_id, expires_in=60) # Long-lived URL (24 hours) result = client.get_file_download_url(file_id, expires_in=86400) ``` ## Delete File Delete an uploaded file: ```python theme={null} result = client.delete_file(123) if result['success']: print(f"Deleted: {result['message']}") ``` ## Using Files in Pipelines File references can be used as input to pipelines: ```python theme={null} from datalab_sdk import DatalabClient client = DatalabClient() # Upload files files = client.upload_files(["invoice1.pdf", "invoice2.pdf"]) # Run pipeline on each uploaded file for f in files: execution = client.run_pipeline( "pl_abc123", file_url=f.reference # e.g., 'datalab://file-abc123' ) print(f"{f.original_filename}: {execution.execution_id}") ``` See [Pipelines](/docs/recipes/pipelines/pipeline-overview) for more details. ## Async Usage ```python theme={null} import asyncio from datalab_sdk import AsyncDatalabClient async def manage_files(): async with AsyncDatalabClient() as client: # Upload files = await client.upload_files(["doc.pdf"]) # List result = await client.list_files(limit=10) # Get metadata file = await client.get_file_metadata(files[0].file_id) # Download URL url = await client.get_file_download_url(files[0].file_id) # Delete await client.delete_file(files[0].file_id) asyncio.run(manage_files()) ``` ## Example: Batch Upload and Process ```python theme={null} from datalab_sdk import DatalabClient from pathlib import Path client = DatalabClient() # Find all PDFs in a directory pdf_files = list(Path("./documents").glob("*.pdf")) # Upload all files uploaded = client.upload_files([str(p) for p in pdf_files]) print(f"Uploaded {len(uploaded)} files:") for file in uploaded: print(f" {file.original_filename}: {file.reference}") # Store references for later use references = {f.original_filename: f.reference for f in uploaded} ``` ## Supported File Types See [Supported File Types](/docs/common/supportedfiletypes) for a complete list of supported formats. ## Next Steps Step-by-step guide for uploading and managing files via the API. Chain processors into versioned, reusable pipelines. Convert documents to Markdown, HTML, JSON, or chunks. Understand rate limits and file size constraints. # Form Filling Source: https://documentation.datalab.to/docs/welcome/sdk/form-filling Fill PDF and image forms with structured field data using the Datalab SDK. ## Overview The form filling API lets you programmatically fill forms in PDFs and images. It works with both: * **Native PDF forms** - Forms with actual form fields * **Image-based forms** - Scanned forms or images with visual form layouts The API matches your field data to form fields and returns a filled PDF or image. ## Basic Usage ```python theme={null} from datalab_sdk import DatalabClient, FormFillingOptions client = DatalabClient() options = FormFillingOptions( field_data={ "full_name": {"value": "John Doe", "description": "Full legal name"}, "date_of_birth": {"value": "1990-01-15", "description": "Date of birth"}, "address": {"value": "123 Main St, City, ST 12345", "description": "Mailing address"}, } ) result = client.fill("form.pdf", options=options) result.save_output("filled_form.pdf") ``` ## Form Filling Options | Option | Type | Default | Description | | ---------------------- | ----- | -------- | ------------------------------------------------------------------------------------ | | `field_data` | dict | Required | Field names mapped to values and descriptions | | `context` | str | None | Additional context to help match fields | | `confidence_threshold` | float | `0.5` | Minimum confidence for field matching (0.0-1.0) | | `max_pages` | int | None | Maximum pages to process | | `page_range` | str | None | Specific pages to process (e.g., `"0-2"`). For spreadsheets, filters by sheet index. | | `skip_cache` | bool | `False` | Skip cached results | ### Field Data Format Each field in `field_data` is a dictionary with: ```python theme={null} field_data = { "field_key": { "value": "The value to fill", "description": "Description to help match the field" } } ``` The `description` helps the API match your field key to the actual form field, especially when field names in the PDF don't match your data structure. ### Example with Multiple Field Types ```python theme={null} options = FormFillingOptions( field_data={ # Text fields "name": {"value": "Jane Smith", "description": "Full name"}, "email": {"value": "jane@example.com", "description": "Email address"}, # Date fields "date": {"value": "2024-01-15", "description": "Today's date"}, # Numeric fields "amount": {"value": "1500.00", "description": "Total amount"}, # Checkbox (use descriptive value) "agree_terms": {"value": "Yes", "description": "Agreement checkbox"}, # Signature (text is rendered) "signature": {"value": "Jane Smith", "description": "Signature field"}, }, context="This is an employment application form" ) ``` ### Using Context The `context` parameter provides additional information to improve field matching: ```python theme={null} options = FormFillingOptions( field_data={ "ssn": {"value": "123-45-6789", "description": "Social Security Number"}, "employer": {"value": "Acme Corp", "description": "Current employer name"}, }, context="W-4 tax withholding form for new employee onboarding" ) ``` ### Confidence Threshold Adjust `confidence_threshold` to control field matching strictness: ```python theme={null} options = FormFillingOptions( field_data={...}, confidence_threshold=0.7, # Higher = more strict matching ) ``` * **Lower values (0.3-0.5)**: More fields matched, but may have incorrect matches * **Higher values (0.7-0.9)**: Fewer fields matched, but more accurate ## Form Filling Result ```python theme={null} result = client.fill("form.pdf", options=options) # Check results print(result.success) # True if filling succeeded print(result.status) # "complete" when done print(result.output_format) # "pdf" or "png" print(result.fields_filled) # List of successfully filled fields print(result.fields_not_found) # List of fields that couldn't be matched print(result.page_count) # Number of pages processed print(result.cost_breakdown) # Cost details ``` ### Result Fields | Field | Type | Description | | ------------------ | ----- | ----------------------------------------- | | `success` | bool | Whether form filling succeeded | | `status` | str | Processing status | | `output_format` | str | Output type: `"pdf"` or `"png"` | | `output_base64` | str | Base64-encoded filled form | | `fields_filled` | list | Field names that were successfully filled | | `fields_not_found` | list | Field names that couldn't be matched | | `page_count` | int | Number of pages processed | | `runtime` | float | Processing time in seconds | | `cost_breakdown` | dict | Cost details | ## Saving the Filled Form ```python theme={null} # Save to file result.save_output("filled_form.pdf") # Or access the raw base64 data import base64 pdf_bytes = base64.b64decode(result.output_base64) with open("filled.pdf", "wb") as f: f.write(pdf_bytes) ``` ## Filling Image Forms The API also works with image-based forms (PNG, JPG, etc.): ```python theme={null} result = client.fill("scanned_form.png", options=options) result.save_output("filled_form.png") # Returns filled image ``` For images, the output is a PNG with the field values rendered onto the image. ## From URL Fill a form from a URL: ```python theme={null} result = client.fill( file_url="https://example.com/form.pdf", options=options ) ``` ## Async Usage ```python theme={null} import asyncio from datalab_sdk import AsyncDatalabClient, FormFillingOptions async def fill_form(): async with AsyncDatalabClient() as client: options = FormFillingOptions( field_data={ "name": {"value": "John Doe", "description": "Full name"}, } ) result = await client.fill("form.pdf", options=options) result.save_output("filled.pdf") asyncio.run(fill_form()) ``` ## Handling Unmatched Fields Check which fields couldn't be matched: ```python theme={null} result = client.fill("form.pdf", options=options) if result.fields_not_found: print("These fields couldn't be matched:") for field in result.fields_not_found: print(f" - {field}") # Consider adjusting descriptions or lowering confidence threshold ``` ## Example: Tax Form ```python theme={null} from datalab_sdk import DatalabClient, FormFillingOptions client = DatalabClient() options = FormFillingOptions( field_data={ "first_name": {"value": "John", "description": "First name"}, "last_name": {"value": "Doe", "description": "Last name"}, "ssn": {"value": "123-45-6789", "description": "Social Security Number"}, "address": {"value": "123 Main Street", "description": "Street address"}, "city": {"value": "Springfield", "description": "City"}, "state": {"value": "IL", "description": "State abbreviation"}, "zip": {"value": "62701", "description": "ZIP code"}, "filing_status": {"value": "Single", "description": "Filing status"}, "signature": {"value": "John Doe", "description": "Taxpayer signature"}, "date": {"value": "2024-04-15", "description": "Date signed"}, }, context="IRS W-4 Employee's Withholding Certificate" ) result = client.fill("w4_form.pdf", options=options) print(f"Filled {len(result.fields_filled)} fields") print(f"Unmatched: {result.fields_not_found}") result.save_output("w4_filled.pdf") ``` ## Next Steps Detailed guide on form filling with field matching and templates. Upload, list, and manage files in Datalab storage. Convert documents to Markdown, HTML, JSON, or chunks. Chain processors into versioned, reusable pipelines. # Pipelines Source: https://documentation.datalab.to/docs/welcome/sdk/pipelines Create, version, and run document processing pipelines using the Datalab SDK. ## Overview Pipelines chain processors (convert, extract, segment, custom) into reusable, versioned configurations. See [Pipeline Overview](/docs/recipes/pipelines/pipeline-overview) for concepts. ## Basic Usage ```python theme={null} from datalab_sdk import DatalabClient, PipelineProcessor client = DatalabClient() # Create a pipeline pipeline = client.create_pipeline(steps=[ PipelineProcessor(type="convert", settings={"mode": "balanced"}), PipelineProcessor(type="extract", settings={ "page_schema": { "type": "object", "properties": { "title": {"type": "string"}, "date": {"type": "string"} } } }) ]) # Save and publish pipeline = client.save_pipeline(pipeline.pipeline_id, name="My Pipeline") version = client.create_pipeline_version(pipeline.pipeline_id) # Run execution = client.run_pipeline(pipeline.pipeline_id, file_path="doc.pdf") execution = client.get_pipeline_execution(execution.execution_id, max_polls=300) # Get results result = client.get_step_result(execution.execution_id, step_index=1) ``` ## Models ### PipelineProcessor Defines a single processor in a pipeline. ```python theme={null} from datalab_sdk import PipelineProcessor step = PipelineProcessor( type="extract", # Step type settings={"page_schema": {...}}, # Step-specific config custom_processor_id="cp_abc123", # For custom steps eval_rubric_id=42, # Optional eval rubric ) ``` | Field | Type | Required | Description | | --------------------- | ---- | -------- | ---------------------------------------------------- | | `type` | str | Yes | `"convert"`, `"extract"`, `"segment"`, or `"custom"` | | `settings` | dict | Yes | Step-specific configuration | | `custom_processor_id` | str | No | Custom processor ID for `"custom"` steps | | `eval_rubric_id` | int | No | Evaluation rubric to apply | ### PipelineConfig Returned by pipeline CRUD methods. | Field | Type | Description | | ---------------- | -------- | ------------------------------------------------------ | | `pipeline_id` | str | Unique ID (`pl_XXXXX`) | | `steps` | list | Ordered list of step definitions | | `name` | str | Pipeline name (set via `save_pipeline`) | | `is_saved` | bool | Whether pipeline has been saved | | `archived` | bool | Whether pipeline is archived | | `active_version` | int | Current published version (`0` = no published version) | | `created` | datetime | Creation timestamp | | `updated` | datetime | Last update timestamp | ### PipelineVersion Immutable snapshot of pipeline steps at a point in time. | Field | Type | Description | | ------------- | -------- | -------------------------- | | `version` | int | Version number | | `steps` | list | Steps at this version | | `description` | str | Version description | | `created` | datetime | When version was published | ### PipelineExecution Result from running a pipeline. | Field | Type | Description | | ------------------ | -------- | -------------------------------------------------------------------- | | `execution_id` | str | Unique ID (`pex_XXXXX`) | | `pipeline_id` | str | Pipeline that was executed | | `pipeline_version` | int | Version used (`0` = draft) | | `status` | str | `pending`, `running`, `completed`, `completed_with_errors`, `failed` | | `steps` | list | List of `PipelineExecutionStepResult` | | `started_at` | datetime | Execution start time | | `completed_at` | datetime | Execution end time | | `created` | datetime | When execution was created | | `config_snapshot` | dict | Frozen step configuration used | | `input_config` | dict | Input file details | | `rate_breakdown` | dict | Billing breakdown | ### PipelineExecutionStepResult Status of a single step within an execution. | Field | Type | Description | | --------------- | -------- | -------------------------------------------------------------------- | | `step_index` | int | Position in pipeline | | `step_type` | str | Step type | | `status` | str | `pending`, `dispatched`, `running`, `completed`, `failed`, `skipped` | | `result_url` | str | URL to fetch step result | | `checkpoint_id` | str | Checkpoint passed to downstream steps | | `started_at` | datetime | Step start time | | `finished_at` | datetime | Step end time | | `error_message` | str | Error details if failed | ## Pipeline Management ### Create ```python theme={null} pipeline = client.create_pipeline(steps=[ PipelineProcessor(type="convert", settings={"mode": "balanced"}), PipelineProcessor(type="extract", settings={"page_schema": {...}}) ]) ``` ### Save ```python theme={null} pipeline = client.save_pipeline(pipeline.pipeline_id, name="Invoice Parser") ``` ### Update Creates a draft if a published version exists: ```python theme={null} pipeline = client.update_pipeline(pipeline.pipeline_id, steps=[ PipelineProcessor(type="convert", settings={"mode": "accurate"}), PipelineProcessor(type="extract", settings={"page_schema": {...}}) ]) ``` ### List ```python theme={null} result = client.list_pipelines( saved_only=True, # Only saved pipelines (default) include_archived=False, # Include archived (default: False) limit=50, offset=0 ) for p in result["pipelines"]: print(f"{p.pipeline_id}: {p.name}") ``` ### Get ```python theme={null} pipeline = client.get_pipeline("pl_abc123") ``` ### Archive / Unarchive ```python theme={null} client.archive_pipeline("pl_abc123") client.unarchive_pipeline("pl_abc123") ``` ## Versioning ### Publish a Version ```python theme={null} version = client.create_pipeline_version( "pl_abc123", description="Added line items extraction" ) print(f"Published v{version.version}") ``` ### List Versions ```python theme={null} result = client.list_pipeline_versions("pl_abc123") for v in result["versions"]: print(f"v{v.version}: {v.description}") ``` ### Discard Draft ```python theme={null} # Revert to active published version pipeline = client.discard_pipeline_draft("pl_abc123") # Revert to a specific version pipeline = client.discard_pipeline_draft("pl_abc123", version=1) ``` ### Get Rate ```python theme={null} rate = client.get_pipeline_rate("pl_abc123") print(f"{rate['rate_per_1000_pages_cents']} cents per 1000 pages") ``` ## Execution ### Run ```python theme={null} execution = client.run_pipeline( "pl_abc123", file_path="document.pdf", # or file_url="https://..." page_range="0-10", output_format="json", skip_cache=False, run_evals=False, webhook_url="https://example.com/hook", version=2, # omit for active version max_polls=1, # polls after submission poll_interval=1, ) ``` | Parameter | Type | Default | Description | | --------------- | ---- | -------- | ------------------------------------------------- | | `pipeline_id` | str | Required | Pipeline to run | | `file_path` | str | - | Local file path | | `file_url` | str | - | URL to document | | `page_range` | str | - | Pages to process (`"0-5,10"`) | | `output_format` | str | - | Override output format | | `skip_cache` | bool | `False` | Skip cached results | | `run_evals` | bool | `False` | Run eval rubrics on steps | | `webhook_url` | str | - | Webhook URL for completion | | `version` | int | - | Version to run (omit=active, 0=draft, N=specific) | | `max_polls` | int | `1` | Polling attempts | | `poll_interval` | int | `1` | Seconds between polls | ### Poll Execution ```python theme={null} execution = client.get_pipeline_execution( "pex_abc123", max_polls=300, poll_interval=2 ) ``` ### List Executions ```python theme={null} result = client.list_pipeline_executions("pl_abc123", limit=20) for ex in result["executions"]: print(f"{ex.execution_id}: {ex.status}") ``` ### Get Step Result ```python theme={null} result = client.get_step_result("pex_abc123", step_index=1) ``` ## Async Usage All pipeline methods are available on `AsyncDatalabClient`: ```python theme={null} import asyncio from datalab_sdk import AsyncDatalabClient, PipelineProcessor async def run(): async with AsyncDatalabClient() as client: pipeline = await client.create_pipeline(steps=[ PipelineProcessor(type="convert", settings={"mode": "balanced"}), PipelineProcessor(type="extract", settings={"page_schema": { "type": "object", "properties": {"title": {"type": "string"}} }}) ]) pipeline = await client.save_pipeline( pipeline.pipeline_id, name="Async Pipeline" ) execution = await client.run_pipeline( pipeline.pipeline_id, file_path="doc.pdf" ) execution = await client.get_pipeline_execution( execution.execution_id, max_polls=300 ) result = await client.get_step_result( execution.execution_id, step_index=1 ) return result result = asyncio.run(run()) ``` ## Next Steps Concepts, processor types, and when to use pipelines. Step-by-step guide to building pipelines. Manage drafts and publish versions. Execution, overrides, and result retrieval. # Document Segmentation Source: https://documentation.datalab.to/docs/welcome/sdk/segmentation Segment documents into logical sections using the Datalab SDK. ## Basic Usage ```python theme={null} import json from datalab_sdk import DatalabClient, SegmentOptions client = DatalabClient() # Define a segmentation schema with section names and descriptions segmentation_schema = json.dumps({ "sections": [ {"name": "introduction", "description": "Introduction and overview"}, {"name": "methodology", "description": "Methods and approach"}, {"name": "results", "description": "Findings and results"}, {"name": "conclusion", "description": "Summary and conclusions"}, {"name": "references", "description": "Bibliography and references"} ] }) options = SegmentOptions(segmentation_schema=segmentation_schema) result = client.segment("research_paper.pdf", options=options) # Access segmentation results segments = result.segmentation_results for segment in segments: print(f"{segment['name']}: pages {segment['page_range']}") ``` ## Segment Options Use `SegmentOptions` to configure segmentation behavior: | Option | Type | Default | Description | | --------------------- | ---- | ------------ | --------------------------------------------------------------------------------------- | | `segmentation_schema` | str | **Required** | JSON schema defining segment names and descriptions | | `checkpoint_id` | str | None | Checkpoint ID from a previous `convert()` call | | `mode` | str | `"fast"` | Processing mode: `"fast"`, `"balanced"`, `"accurate"` | | `save_checkpoint` | bool | `False` | Save checkpoint for reuse with subsequent calls | | `max_pages` | int | None | Maximum number of pages to process | | `page_range` | str | None | Specific pages to process (e.g., `"0-5,10"`). For spreadsheets, filters by sheet index. | | `skip_cache` | bool | `False` | Skip cached results, force reprocessing | | `webhook_url` | str | None | Webhook URL for completion notification | ## Checkpoint Reuse Use checkpoints to avoid re-parsing a document when running segmentation after conversion. First convert with `save_checkpoint=True`, then segment using the returned `checkpoint_id`: ```python theme={null} import json from datalab_sdk import DatalabClient, ConvertOptions, SegmentOptions client = DatalabClient() # Step 1: Convert and save a checkpoint convert_options = ConvertOptions( mode="accurate", save_checkpoint=True, ) convert_result = client.convert("report.pdf", options=convert_options) print(convert_result.markdown) # Step 2: Segment using the checkpoint (no re-parsing needed) segmentation_schema = json.dumps({ "sections": [ {"name": "executive_summary", "description": "Executive summary"}, {"name": "financials", "description": "Financial data and analysis"}, {"name": "outlook", "description": "Future outlook and projections"}, ] }) segment_options = SegmentOptions( segmentation_schema=segmentation_schema, checkpoint_id=convert_result.checkpoint_id, ) segment_result = client.segment("report.pdf", options=segment_options) print(segment_result.segmentation_results) ``` ## Segmentation Result The result object contains the segmentation data alongside standard conversion fields: ```python theme={null} result = client.segment("document.pdf", options=options) # Segmentation results (list of segments with names and page ranges) segments = result.segmentation_results for segment in segments: print(f"Section: {segment['name']}") print(f" Pages: {segment['page_range']}") # Standard conversion fields are also available print(result.success) print(result.markdown) print(result.page_count) print(result.cost_breakdown) ``` ## Async Usage ```python theme={null} import asyncio import json from datalab_sdk import AsyncDatalabClient, SegmentOptions async def segment_document(): async with AsyncDatalabClient() as client: segmentation_schema = json.dumps({ "sections": [ {"name": "introduction", "description": "Introduction"}, {"name": "body", "description": "Main content"}, {"name": "conclusion", "description": "Conclusion"}, ] }) options = SegmentOptions(segmentation_schema=segmentation_schema) result = await client.segment("document.pdf", options=options) return result.segmentation_results segments = asyncio.run(segment_document()) print(segments) ``` ## Next Steps Learn more about document segmentation patterns and use cases. Extract structured data from documents using JSON schemas. Convert documents to Markdown, HTML, JSON, or chunks. # Welcome to Datalab Source: https://documentation.datalab.to/index Datalab provides document intelligence APIs to convert PDFs, spreadsheets, images, and other formats into structured, machine-readable outputs — fast, accurately, and at scale. We offer a [fully managed platform](./docs/welcome/api), [on-prem deployment](./docs/on-prem/overview) for sensitive documents, and open-source tools for developers. **New accounts include \$5 in free credits** — [sign up here](https://www.datalab.to/auth/sign_up). ## Key Capabilities * **Document Conversion** — Parse PDFs, Word docs, and spreadsheets into Markdown, HTML, or JSON (powered by [Marker](https://github.com/datalab-to/marker), [Surya](https://github.com/datalab-to/surya), and [Chandra](https://github.com/datalab-to/chandra)) * **Pipelines** — Chain processors into versioned, reusable configurations and deploy to production * **Structured Extraction** — Extract specific fields with citations back to source bounding boxes for auditability * **Form Filling** — Automatically fill PDF and image forms with structured data * **Document Segmentation** — Split multi-document PDFs into separate logical sections * **Track Changes** — Extract redlines and comments from Word documents * **OCR** — High-accuracy text recognition supporting 90+ languages ## What do you want to do? **Convert documents to structured formats** → [Document Conversion](./docs/recipes/conversion/conversion-api-overview) **Extract specific data from documents** → [Structured Extraction](./docs/recipes/structured-extraction/api-overview) **Automatically fill PDF forms** → [Form Filling](./docs/recipes/form-filling/form-filling-api-overview) **Split combined PDFs into separate documents** → [Document Segmentation](./docs/recipes/document-segmentation/auto-segmentation) **Build document processing pipelines** → [Pipelines](./docs/recipes/pipelines/pipeline-overview) **Extract tracked changes from Word documents** → [Track Changes](./docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents) ## Who uses Datalab? Datalab serves teams building AI agents, RAG systems, and document automation workflows: * **AI/ML teams** — Feed knowledge graphs, retrieval systems, and automation pipelines with clean, structured document data * **Enterprises** — Automate high-volume document processing with auditability and citation tracking * **Product teams** — Convert financial statements, legal filings, tax forms, and research papers into product-ready content ## Getting Started Start converting documents in minutes with Python. REST API documentation. Chain processors into versioned, reusable pipelines. Run our models locally. ## Support Email [support@datalab.to](mailto:support@datalab.to) for help. Check API availability. # Billing Source: https://documentation.datalab.to/platform/billing Datalab uses per-page pricing — you pay only for the pages you process, across whichever processors you run. This page explains how billing works and how to manage your usage. ### Per-Page Pricing Every API request consumes credits based on the number of pages processed: * Charges are rounded up to the nearest cent. * Rates vary by processor (convert, extract, segment, etc.) and processing mode. * Add-ons such as `word_bboxes`, `table_cell_bboxes`, and `list_item_bboxes` are billed additively per 1K pages on top of the base rate. See the full rate card at [datalab.to/pricing](https://www.datalab.to/pricing). ### Free Tier New accounts receive a **monthly usage allowance** with no credit card required: * **\$20/month** for accounts created with a work email address * **\$10/month** for accounts created with a personal email address Credits reset at the start of each 30-day cycle. The free tier supports up to 10 requests per minute and is designed to let you run a complete proof of concept before committing to a paid plan. ### Pay-as-You-Go When you outgrow the free allowance, add a payment method and switch to pay-as-you-go. There is no subscription, no plan to choose, and no minimum spend. You are billed only for the pages you actually process. Processors are additive — if you convert and then extract the same document, you pay for each step separately. ### Team Plan For production workloads, the **Team** plan is \$400/month and includes: * \$400 of monthly usage (same per-page rates) * Production rate limits * Clickthrough BAA/DPA * SOC 2 report * Additional custom processor capacity [Contact us](https://www.datalab.to/contact) for Enterprise pricing with volume discounts and custom SLAs. ### Payment Failures and Grace Periods When a payment fails, you will receive an email notification from Stripe. * On failure, your account enters an `unpaid` state. * A 24-hour grace period with a usage cap gives you time to resolve the issue. * After the grace period, API access is blocked until payment is resolved. Update your payment method in the [billing dashboard](https://www.datalab.to/app/billing) or [contact support](mailto:support@datalab.to). ## Understanding Your Usage ### What Counts as a Page? * **PDF files**: Each page in the PDF * **Images**: Each image file counts as one page * **Office documents**: Each page in the document * **Multi-page TIFFs**: Each frame counts as a separate page * **Spreadsheets**: Pricing varies by extraction mode: * **Simple mode**: 2,500 cells per page, capped at 100 pages (\$0.60) per sheet * **Advanced mode**: 500 cells per page, no cap * For files with multiple sheets, each sheet is calculated separately and then summed ### Monitoring Usage Track usage through: 1. **Dashboard Overview**: Real-time usage statistics at [datalab.to/app](https://www.datalab.to/app) 2. **Usage Reports**: Detailed breakdown by processing type Usage statistics may be slightly delayed. ## Next Steps Understand file size limits, page limits, and rate limiting. Understand HTTP error codes and subscription errors. Get started converting documents in minutes. See the latest updates and changes to the Datalab platform. # Changelog Source: https://documentation.datalab.to/platform/changelog Major changes to the Datalab hosted service are listed here. ## 6/24/2026 * **Breaking change for `word_bboxes` users:** The `metadata.words` JSON array is no longer emitted in API responses. Per-word bounding boxes are now exclusively available as inline `` elements in HTML output. If you were reading word bboxes from `page_info[id].metadata.words`, update your code to parse them from the HTML spans instead. ## 6/22/2026 * API keys that reached their 30-day spend cap now return HTTP 402 with a clear spend-limit message instead of a misleading "Invalid API key" 401 error. If you were catching 401 errors to detect spend-cap exhaustion, update to catch 402. * Checkboxes and radio buttons detected in documents now render as `☒` (checked) or `☐` (unchecked) in markdown output instead of being silently dropped. ## 6/18/2026 * Free tier and pay-as-you-go pricing launched — new accounts receive a monthly usage allowance ($20 on a work email, $10 on a personal email) with no credit card required. Add a card to upgrade to pay-as-you-go and be billed only for pages you process, with no subscription or minimum commitment. See the [pricing page](https://www.datalab.to/pricing) for per-processor rates. ## 6/16/2026 * Word-level bounding boxes (`word_bboxes`), table cell/row/column bboxes (`table_cell_bboxes` extra), and list item bboxes (`list_item_bboxes` extra) are now available to all teams — no longer require allowlist access. `table_cell_bboxes` and `list_item_bboxes` are billed at \$0.30 per 1K pages each and automatically enable `word_bboxes`. HTML output carries `data-bbox` and `data-confidence` attributes on the annotated elements; `table_row_bboxes` is deprecated and replaced by `table_cell_bboxes`. * Maximum input image dimensions increased by 1.5× (from 4,800×7,800 px to 7,200×11,700 px), reducing rejections for large-format scans and high-resolution page images. ## 6/15/2026 * Custom Processors are now generally available to all authenticated teams — no allowlist required. Create, iterate, and run processors from the [dashboard](https://www.datalab.to/app/processors) or via `POST /api/v1/custom-processor` (SDK: `client.run_custom_processor()`). ## 6/12/2026 * On-premises container adds structured extraction via `POST /api/v1/extract`. Supports `fast` and `turbo` extraction modes. Requires the Chandra model with the Lift model enabled; `balanced` mode is not available on-prem. ## 6/4/2026 * Structured extraction now offers two modes via the `extraction_mode` parameter on `/api/v1/extract`: **fast** (lowest latency, $6 / 1K pages) and **balanced** (higher accuracy with per-field verification, reasoning, and citations, $25 / 1K pages). `balanced` is the default. See [Balanced Extraction Mode](/docs/recipes/structured-extraction/balanced-mode). * Teams that made an extraction request in the 30 days before June 4, 2026 keep **fast** as their default extraction mode; all other teams (and new teams) default to **balanced**. Set `extraction_mode` explicitly on any request to override the default. ## 5/22/2026 * `processing_location` is now supported in the direct file upload API — include `"processing_location": "eu"` in the `POST /api/v1/files/upload` request body to store uploaded files in EU infrastructure before passing the reference to inference endpoints. ## 5/21/2026 * New `processing_location` parameter on all inference API endpoints (`/api/v1/convert`, `/api/v1/extract`, `/api/v1/segment`, `/api/v1/fill`, `/api/v1/track-changes`, and pipeline runs) — pass `"eu"` to route processing and result storage to EU infrastructure for data residency requirements. When using `processing_location`, send the file via `file_url` or a pre-uploaded `datalab://` reference; multipart form uploads are not supported with this parameter. EU-region processing carries a regional pricing premium. * Helm chart is now available for deploying the on-prem inference container on Kubernetes clusters. ## 4/9/2026 * Custom processor creation in Forge is now a guided 3-step wizard (Describe → Documents → Review) with a chat-driven builder that helps you articulate what your processor should do before generating it. ## 4/8/2026 * Form filling is now a first-class pipeline step type in Forge — build standalone fill pipelines with `PipelineProcessor(type="fill", settings={"field_data": {...}})` to apply versioning and execution tracking to your form-filling workflows. * Forge pipeline workspace now shows extraction confidence scores inline as each pipeline step completes. * Playground block annotations — give feedback on individual parsed blocks directly in the document view. ## 4/6/2026 * Spreadsheets now support the `page_range` parameter — use 0-based sheet indices (e.g., `"0,2"`) to process only specific sheets from a workbook. * Forge is now the primary hub for creating and editing pipelines. Build processor chains visually, configure per-processor settings, and deploy versioned pipelines directly from the UI. * Pipeline versioning UI — create, browse, and restore versions from within Forge. Discard draft changes to revert to any published version. * Per-processor execution status tracking in Forge — watch each processor (convert, extract, segment) complete in real-time as a pipeline runs. * New pipeline draft/discard API — `POST /api/v1/pipelines/{pipeline_id}/discard` (SDK: `client.discard_pipeline_draft()`) discards unsaved edits and reverts the pipeline to any published version. * `POST /api/v1/custom-pipeline` is deprecated — use `POST /api/v1/pipelines/{pipeline_id}/run` instead. * Workflows API (`/api/v1/workflows`) is deprecated. We recommend using [Pipelines](/docs/recipes/pipelines/pipeline-overview) for all new integrations and migrating any existing ones. ## 4/2/2026 * Saved Schemas are now available to all users. Create and manage reusable extraction schemas from the [dashboard](https://www.datalab.to/app/schemas), then pass `schema_id` to `POST /api/v1/extract` instead of an inline `page_schema`. Use `schema_version` to pin extractions to a specific schema version. * Forge Evals now supports extraction comparison — select saved schemas in the compare flow to run and score extractions side-by-side across document collections. Scores display inline in the eval grid. ## 4/1/2026 * Playground now shows real-time scoring-in-progress indicators while extraction confidence scores are being computed asynchronously. * Improved parse quality scoring accuracy with an upgraded underlying model. ## 3/31/2026 * Saved Schemas — create and manage reusable extraction schemas via `POST /api/v1/extraction_schemas`. Pass `schema_id` to `POST /api/v1/extract` instead of inline `page_schema` to reference a saved schema. Schemas support versioning; use `schema_version` to pin to a specific version. ## 3/26/2026 * Extraction confidence scoring released in beta — pass `include_scores=true` to `POST /api/v1/extract` to receive per-field `_score` values alongside citations, or score asynchronously via the new `POST /api/v1/extract/score` endpoint. Scoring is free. [Learn more](/docs/recipes/structured-extraction/confidence-scoring). * New usage threshold alerts — set daily page-count thresholds in the dashboard to receive email notifications when your API usage approaches or exceeds them. ## 3/23/2026 * New pipeline templates in Forge — browse and run example pipeline configurations directly in PipelineWorkspace to see results without any setup. ## 3/20/2026 * Custom Pipelines now support a `classify` modification type — use LLM structured output to classify pages into categories and route subsequent processing steps based on classification results. ## 3/16/2026 * Per-page concurrency limit enforcement is now active. API results will return `success: false` with an error message if your team exceeds 5,000 pages in flight simultaneously. See [API Limits](/docs/common/limits) for details. ## 1/25/2026 * New Create Document API (`POST /api/v1/create-document`) — generate DOCX files from markdown with native Word track changes. Supports insertions (``), deletions (``), and comments (``) that appear as reviewable changes in Microsoft Word. SDK: `client.create_document()`. ## 1/24/2026 * Custom Pipelines beta launch — create reusable AI-powered document processing pipelines and execute them via the API (`POST /api/v1/custom-pipeline`). SDK: `client.run_custom_pipeline()`. ## 1/22/2026 * Chandra 1.5 release with improved table extraction, chemistry support, diagram rendering, and latency improvements. New `new_block_types` extra enables detection of chemistry structures, handwriting, and signatures. ## 12/19/2025 * Forge Evals now supports comparing against external providers — evaluate Datalab's parsing against other open source models (OlmoOCR, RolmoOCR, DotsOCR, DeepSeekOCR) and third-party services (upon request). ## 12/17/2025 * Form Filling API launch — automatically fill PDF and image forms with structured data. Supports native PDF form fields and visual/scanned forms. ## 12/10/2025 * Forge Evals launch — evaluate and compare parsing configurations across your documents to find optimal settings. ## 12/5/2025 * Improved tracked changes extraction from Word documents with better performance and accuracy. ## 12/4/2025 * Spreadsheet parsing support — parse Excel (.xlsx, .xls) and other spreadsheet formats with the Convert API. ## 12/3/2025 * Agni model improvements for better multi-page section hierarchy detection in OCR. ## 12/2/2025 * Chandra speed improvements — faster document processing with optimized inference. ## 12/1/2025 * Chandra 1.1 release with improved accuracy and performance. ## 11/18/2025 * Enhanced password security: minimum password length increased to 12 characters and validation against 100K common/compromised passwords list per NIST SP 800-63B standards. * Improved section header hierarchy detection in accurate mode. ## 10/23/2025 * Workflows beta launch! You can now use the API and SDK to compose various steps like parse, extract, segment, and conditional logic to create document processing workflows that are reusable. ## 10/22/2025 * New model launch! Our SOTA model, Chandra, is now publicly-available, open-source, and accessible via our API (when using modes `balanced` and `accurate`). ## 10/20/2025 * During the global AWS outage we put mitigations in place to work around issues our upstream providers were experiencing. With these mitigations, despite ongoing upstream outages, we restored API service to our customers. ## 10/10/2025 * If parses are taking over 10 seconds in the playground, users will receive an option to receive an email notification when it is complete. * Fixes and improvements to long-document processing in the playground. * Fixes to how request statuses are updated (from e.g. "processing" --> "complete"), so they update properly and on-time. ## 10/8/2025 * v1.0.7 of our container released with stability improvements for very long-running containers (self-serve and enterprise customers only). ## 10/6/2025 * Users can now click on "View in Playground" on API requests in the Usage tab to view how their document was parsed, segmented, or extracted. This feature is enabled as long as users have the correct data retention settings. ## 10/3/2025 * v1.0.5 of our container released with settings to significantly reduce log output, useful for highly-scaled workloads. * ## 9/25/2025 * Improvements to Segment/Extract UX in the playground. * Fixes and improvements to segmentation results. ## 9/18/2025 * High Accuracy Mode launch -- API users can select `mode: "accurate"` for our highest accuracy document processing, trading off latency and cost (both higher). * Public playground launch -- unauthenticated users can access to the same playground experience as subscribers (with limitations) at [https://www.datalab.to/playground](https://www.datalab.to/playground) ## 9/16/2025 * New playground launch -- we now offer a significantly-improved playground where users can inspect how their documents are parsed or view document segmentation/structured extraction results. * Segmentation V1 launch -- API users can segment documents automatically or with a schema. ## 9/5/2025 * v1.0.2 of our container released supporting both self-serve and enterprise customers with improved functionality and stability. * Added `marker_lite` support to our container to measure OCR-likelihood. ## 9/1/2025 * Users can view showcased static examples in the Datalab playground. * RTF file format support added to the API. ## 8/27/2025 * Launched our self-serve on-prem container, purchaseable via Stripe checkout -- no sales or contracting process required. * Added support for our `/ocr` endpoint in the conatiner in addition to `/marker`. ## 8/20/2025 * Users can generate schemas automatically based on document content in the playground. * Improvements to structured extraction quality and latency. ## 8/15/2025 * Users can view citation highlights from structured extraction requests in the playground. * If parse quality scores are available, they will now be returned in the `/marker` response. ## 8/5/2025 * Launch a new OCR model with improved math performance. * Improve marker quality in cases where there are inline equations or other text that needs OCR. ## 7/25/2025 * Improve speed of LLM mode and when outputting multiple output formats. ## 7/20/2025 * Launch a visual editor for structured extraction that lets you edit schemas and visualize results. ## 7/15/2025 * Add a visual editor for marker prompts that lets you see how the document was changed, test across documents, and save prompts. ## 7/1/2025 * Structured extraction beta - pass `page_schema` to the `marker` endpoint to extract structured data from documents. The schema should be a pydantic schema generated with `.model_dump_json_schema()`, or another JSON schema format. * Support the new `chunks` output format for marker, which is a simplified list of blocks with their full html, ideal for chunking/RAG. * Marker endpoint is now promptable - pass `block_correction_prompt` to the marker endpoint to correct the output of marker with your custom logic. * We support additional configuration parameters for marker via the `additional_config` parameter. This is a JSON object where the keys are the configuration options and the values are the values for those options. You can see the exact options in the API schema. ## 6/26/2025 * Support multiple output formats for one doc by passing them as comma-separated values in `output_format` for marker. * Complete redesign of the dashboard, with a new look and feel. This will also make it easier for us improve functionality in the future. ## 6/18/2025 * Improve the playground to make it more functional (easier to test options) * Significantly improve styling in the playground * Add a public version of the playground to make marker easier to test ## 6/3/2025 * Initial launch of playground, for testing marker parsing configurations ## 5/27/2025 * New OCR model which benchmarks better overall, handles inline math, gives detailed character bboxes. * Add `format_lines` flag to marker to add inline math and formatting to lines. (this will automatically OCR lines that need it, also) ## 3/26/2025 * Add support for multiple file formats - spreadsheets, epub, html, in addition to existing document, image, pdf, and presentation formats. * Improve inline math and formatting when passing `use_llm`. * `use_llm` (the high accuracy mode) now costs the same as regular inference. ## 1/30/2025 Marker: * Integrate a new table recognition model, which handles rowspans and colspans better. This is a significant improvement on the old model. * Improve the `--use_llm` option to merge tables across pages, OCR handwriting, OCR forms, and generally have much higher quality than before. * Integrate a new LaTeX OCR model that is significantly more accurate. * Add links and references to the markdown - the references include internal links. General: * Speed up inference time. * Remove the line detection endpoint - it had low usage. * Improve the `table_rec` endpoint - it now takes the `--use_llm` flag, and should run much faster. ## 1/3/2025 * Add the `use_llm` option to the marker API - this uses an LLM to make conversion much more accurate for tables, forms, inline math, and complex pages. It's a beta feature, and will currently double the cost per request. * Added other options to the marker endpoint. * Use `disable_image_extraction` to disable image extraction for marker. * Use `strip_existing_ocr` to strip all existing OCR text and re-OCR (if it was added by something like tesseract) * Better automatic heuristics for when to OCR with marker. * Better text extraction and layout detection for marker. * Speed up the marker and OCR endpoints by \~30%. ## 12/4/2024 * Uploaded files can now be up to 200MB in size. * Improved speed by optimizing file handling on the backend. ## 12/3/2024 * We now offer \$5 in free credits to new signups * Additional bugfixes to improve markdown output quality ## 12/2/2024 * We sped up file operations internally, which should result in a decent API speed boost * We now handle blockquotes and nested lists with the marker endpoint ## 11/27/2024 * Marker is now at v1, with a lot of improvements - it's 4x faster than a month ago, and quality is much higher across all document types * The layout model has been upgraded to a new version, with more potential prediction types ## 10/31/2024 * More API speedups, on the order of 15-20% for marker. * Bump concurrency/rate limits to 200. * Improve stability of service under load. * If you cancel, you will now retain your credits until the end of the month. * Visual improvements on the marketing site. ## 10/28/2024 * Significant API speedups, on the order of 40% faster. ## 10/25/24 * Flatten form fields into pdf when extracting tables and markdown * Fix page separators, they now appear at the start of every page, and include a page number ## 10/23/24 * Speed up marker, layout, and detection by 20-30% * Fix various bugs that cause edge case errors in conversion * Increase concurrent request limit to 100 ## 10/21/24 * Significantly improve marker output quality * Include header levels like h1, h2, etc. * Parse tables very accurately * Improve block type detection and markdown quality * Fix many output bugs * Add in new table recognition model at the /table\_rec endpoint * This will detect and convert tables into a given format * Improve OCR, layout, text detection quality * Fix memory leaks and improve performance * Fix bugs with pagination and marker ## 8/19/2024 * Add in new OCR model with better accuracy across the board * Language is now optional for marker and OCR model * Increase max page count and max pixel width ## 7/20/2024 * Drop prices for marker and surya inference. ## 7/12/2024 * Significant speedup for marker and surya text detection/layout. 10-15% faster. ## 7/10/2024 * Increase concurrent request limit to 50. ## 7/6/2024 * Major infrastructure stability improvements. ## 7/3/2024 * Added response caching for up to 1 hour. If you send the same document to the same endpoint, with the same options, within that time, you'll get a cache hit and won't be billed again. ## 7/2/2024 * Improved parsing for Powerpoint presentations and Word documents. * Add status page and changelog. ## 6/26/2024 * Increase concurrency limits for all users ## 6/25/2024 * Return page count from all endpoints * Users can now disable marker image extraction * Webhooks are now supported instead of polling. Webhooks will ping a given URL when inference is complete. ## 6/21/2024 * Initial support for Microsoft Word and Microsoft Powerpoint documents (docx/doc/pptx/ppt). ## 6/18/2024 * Enable paginating marker output. ## 5/31/2024 * Initial launch of marker and surya APIs. # Error Codes Source: https://documentation.datalab.to/platform/errors HTTP error codes, response formats, and retry guidance. ## Error Response Format All API errors return a JSON response with a `detail` field: ```json theme={null} { "detail": "Error message describing what went wrong" } ``` For validation errors (malformed request body), the response includes field-level details: ```json theme={null} { "detail": [ { "type": "validation_error", "loc": ["body", "field_name"], "msg": "Field validation message", "input": "provided_value" } ] } ``` ## HTTP Error Codes | Code | Type | Retryable | Description | | ---- | ----------------------- | --------- | ------------------------------------------------------------------------------------------------- | | 400 | `invalid_request_error` | No | Issue with the format or content of your request | | 401 | `authentication_error` | No | Invalid or missing API key | | 402 | `spend_cap_error` | No | API key has reached its configured 30-day spend cap — increase the limit in your billing settings | | 403 | `permission_error` | No | API key lacks permission or subscription issue | | 404 | `not_found_error` | No | Requested resource not found or expired | | 413 | `request_too_large` | No | File exceeds the maximum allowed size | | 429 | `rate_limit_error` | **Yes** | Rate limit exceeded — wait and retry | | 500 | `api_error` | **Yes** | Internal server error — wait and retry | | 529 | `overloaded_error` | **Yes** | API temporarily overloaded — wait and retry | ## SDK Exception Mapping The Python SDK maps HTTP errors to specific exception classes: | HTTP Code | SDK Exception | Description | | ---------- | ------------------------ | ---------------------------------------------------------- | | 400 | `DatalabAPIError` | Check the `response_data` field for details | | 401 | `DatalabAPIError` | Invalid API key | | 402 | `DatalabAPIError` | Spend cap exceeded — check `response_data` for the message | | 403 | `DatalabAPIError` | Subscription or permission issue | | 404 | `DatalabAPIError` | Resource not found or expired | | 413 | `DatalabAPIError` | File too large | | 429 | Auto-retried | SDK retries automatically with exponential backoff | | 500 | Auto-retried | SDK retries automatically | | Timeout | `DatalabTimeoutError` | Request timed out | | File error | `DatalabFileError` | File not found or empty | | Validation | `DatalabValidationError` | Invalid input parameters | ```python theme={null} from datalab_sdk import DatalabClient from datalab_sdk.exceptions import ( DatalabAPIError, DatalabTimeoutError, DatalabFileError, DatalabValidationError, ) client = DatalabClient() try: result = client.convert("document.pdf") except DatalabFileError as e: print(f"File issue: {e}") except DatalabTimeoutError as e: print(f"Timed out: {e}") except DatalabAPIError as e: print(f"API error (HTTP {e.status_code}): {e}") if e.response_data: print(f"Details: {e.response_data}") ``` ## Common Error Messages ### 400 Bad Request ```json theme={null} {"detail": "Invalid file type. Only PDF files, word documents, spreadsheets, powerpoints, HTML, and PNG, JPG, GIF, TIFF, and WEBP images are accepted."} ``` ```json theme={null} {"detail": "File size exceeds upload limit of 209715200 bytes."} ``` ### 401 Unauthorized ```json theme={null} {"detail": "Invalid API key provided. Set the X-API-Key header to your API key."} ``` ### 403 Forbidden ```json theme={null} {"detail": "You need an active, paid subscription to use this API."} ``` ```json theme={null} {"detail": "Your subscription has expired. You may need to re-enable your plan, or pay an unpaid invoice."} ``` ```json theme={null} {"detail": "Your payment has failed. Please pay any unpaid invoices to continue using the API."} ``` ### 429 Too Many Requests ```json theme={null} {"detail": "Rate limit exceeded for endpoint /api/v1/convert. You can make 200 requests every 60 seconds. Please try again later, or reach out to support@datalab.to if you need a higher limit."} ``` ```json theme={null} {"detail": "Concurrency exceeded for endpoint /api/v1/convert. You can have 400 concurrent requests running at once. Please try again later, or reach out to support@datalab.to if you need a higher limit."} ``` ### Page Concurrency Limit (returned in results, not as an HTTP error) The page concurrency limit is enforced during processing, not at submission time. Instead of a `429` response, the result will return with `success` set to `false`: ```json theme={null} {"success": false, "error": "Page rate limit exceeded. Your team has {current_pages} pages in flight and this request adds {page_count} more ({total} total, limit: 5,000). Please wait for some requests to complete before submitting more, or contact support@datalab.to for a higher limit."} ``` See [API Limits](/docs/common/limits#page-concurrency-limit) for details. ## Subscription and Access Errors When making API requests, you may encounter 403 errors related to your subscription status: ### No Active Subscription **Error**: `"You need an active, paid subscription to use this API."` This occurs when you don't have an active subscription and have exhausted your free credits. To resolve: * Subscribe to a paid plan in the [dashboard](https://www.datalab.to/app/billing) * New accounts include free credits — verify your email to claim them ### Expired Subscription **Error**: `"Your subscription has expired. You may need to re-enable your plan, or pay an unpaid invoice."` Your subscription has passed its end date and grace period. To resolve: * Renew your subscription in the [dashboard](https://www.datalab.to/app/billing) * Pay any outstanding invoices ### Payment Failed **Error**: `"Your payment has failed. Please pay any unpaid invoices to continue using the API."` A payment for your subscription has failed and you've exceeded the grace period. To resolve: * Update your payment method in the [dashboard](https://www.datalab.to/app/billing) * Pay any unpaid invoices ### Inactive Subscription **Error**: `"Your subscription is not active. You may need to re-enable your plan or pay an unpaid invoice."` Your subscription is canceled or inactive. To resolve: * Reactivate your subscription in the [dashboard](https://www.datalab.to/app/billing) * Subscribe to a new plan ## Next Steps Detailed debugging guide for common issues Per-page pricing, payment failures, and grace periods File size limits, page limits, and rate limiting Python SDK with automatic retries and error handling # Migration Guide Source: https://documentation.datalab.to/platform/migration Migrate from deprecated endpoints to the current API. This guide helps you migrate from deprecated Datalab API endpoints to their current replacements. ## Marker → Dedicated Endpoints The `/api/v1/marker` endpoint is deprecated. Migrate to the new dedicated endpoints below. The monolithic `/api/v1/marker` endpoint has been replaced with dedicated endpoints for each operation: | Old Usage | New Endpoint | SDK Method | | ------------------------------------- | ------------------------------- | ------------------------------- | | `/marker` (basic conversion) | `POST /api/v1/convert` | `client.convert()` | | `/marker` with `page_schema` | `POST /api/v1/extract` | `client.extract()` | | `/marker` with `segmentation_schema` | `POST /api/v1/segment` | `client.segment()` | | `/marker` with `extras=track_changes` | `POST /api/v1/track-changes` | `client.track_changes()` | | `/marker` with `pipeline_id` | `POST /api/v1/custom-processor` | `client.run_custom_processor()` | ### SDK upgrade Update to the latest SDK for the new dedicated methods: ```bash theme={null} pip install --upgrade datalab-python-sdk ``` SDK users who only use `client.convert()` do not need to change code — it continues to work and now calls `/api/v1/convert` internally. ### Document Conversion ```python Python SDK theme={null} # No changes needed — convert() works the same from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() result = client.convert("document.pdf") print(result.markdown) ``` ```bash cURL (before) theme={null} # Old curl -X POST https://www.datalab.to/api/v1/marker \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=markdown" ``` ```bash cURL (after) theme={null} # New curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=markdown" ``` ### Structured Extraction ```python Python SDK (before) theme={null} # Old: page_schema on ConvertOptions from datalab_sdk import DatalabClient, ConvertOptions options = ConvertOptions(page_schema=schema) result = client.convert("invoice.pdf", options=options) ``` ```python Python SDK (after) theme={null} # New: Dedicated extract() method with ExtractOptions from datalab_sdk import DatalabClient, ExtractOptions import json client = DatalabClient() options = ExtractOptions( page_schema=json.dumps(schema) ) result = client.extract("invoice.pdf", options=options) extracted = json.loads(result.extraction_schema_json) ``` ### Document Segmentation ```python Python SDK (before) theme={null} # Old: segmentation_schema on ConvertOptions options = ConvertOptions(segmentation_schema=json.dumps(schema)) result = client.convert("document.pdf", options=options) ``` ```python Python SDK (after) theme={null} # New: Dedicated segment() method with SegmentOptions from datalab_sdk import DatalabClient, SegmentOptions import json client = DatalabClient() options = SegmentOptions( segmentation_schema=json.dumps(schema) ) result = client.segment("document.pdf", options=options) segments = result.segmentation_results ``` ### Track Changes ```python Python SDK (before) theme={null} # Old: extras parameter on ConvertOptions options = ConvertOptions(extras="track_changes", output_format="html") result = client.convert("contract.docx", options=options) ``` ```python Python SDK (after) theme={null} # New: Dedicated track_changes() method from datalab_sdk import DatalabClient, TrackChangesOptions client = DatalabClient() options = TrackChangesOptions(output_format="markdown,html,chunks") result = client.track_changes("contract.docx", options=options) ``` ### Checkpoint reuse The new endpoints support a checkpoint system to avoid re-parsing documents. Convert once, then extract or segment multiple times: ```python theme={null} from datalab_sdk import DatalabClient, ConvertOptions, ExtractOptions, SegmentOptions import json client = DatalabClient() # Step 1: Convert and save checkpoint options = ConvertOptions(save_checkpoint=True) result = client.convert("document.pdf", options=options) checkpoint_id = result.checkpoint_id # Step 2: Extract using checkpoint (no re-parsing) extract_opts = ExtractOptions( checkpoint_id=checkpoint_id, page_schema=json.dumps({"invoice_number": {"type": "string"}}) ) extracted = client.extract(options=extract_opts) # Step 3: Segment using same checkpoint segment_opts = SegmentOptions( checkpoint_id=checkpoint_id, segmentation_schema=json.dumps({"sections": ["Header", "Body", "Footer"]}) ) segmented = client.segment(options=segment_opts) ``` ## Workflows → Pipelines The Workflows API (`/api/v1/workflows`) is deprecated. Use [Pipelines](/docs/recipes/pipelines/pipeline-overview) for all new integrations and migrate existing workflows. Pipelines replace Workflows with a simpler API, per-step status tracking, versioning, and a visual editor in Forge. | Workflows | Pipelines | | ----------------------------------------------- | ------------------------------------------------------------------------------------ | | `POST /api/v1/workflows/workflows` | `POST /api/v1/pipelines` (via SDK: `client.create_pipeline()`) | | `POST /api/v1/workflows/workflows/{id}/execute` | `POST /api/v1/pipelines/{id}/run` (via SDK: `client.run_pipeline()`) | | `GET /api/v1/workflows/executions/{id}` | `GET /api/v1/pipelines/executions/{id}` (via SDK: `client.get_pipeline_execution()`) | See [Pipelines](/docs/recipes/pipelines/pipeline-overview) for a full walkthrough. ## Custom Pipeline → Custom Processor `POST /api/v1/custom-pipeline` is deprecated (sunset: September 30, 2026). Migrate to `POST /api/v1/custom-processor`. The management routes `/api/v1/custom_pipelines/*` are also deprecated; use `/api/v1/custom_processors/*` instead. ```python Python SDK (before) theme={null} from datalab_sdk import DatalabClient, CustomProcessorOptions client = DatalabClient() options = CustomProcessorOptions(pipeline_id="cp_XXXXX") result = client.run_custom_pipeline("document.pdf", options=options) ``` ```python Python SDK (after) theme={null} from datalab_sdk import DatalabClient, CustomProcessorOptions client = DatalabClient() options = CustomProcessorOptions(pipeline_id="cp_XXXXX") result = client.run_custom_processor("document.pdf", options=options) ``` ```bash cURL (before) theme={null} curl -X POST https://www.datalab.to/api/v1/custom-pipeline \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "pipeline_id=cp_XXXXX" ``` ```bash cURL (after) theme={null} curl -X POST https://www.datalab.to/api/v1/custom-processor \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "pipeline_id=cp_XXXXX" ``` The response format is identical. `CustomPipelineOptions` remains as a backward-compatible alias for `CustomProcessorOptions`. ## Table Recognition → Document Conversion The standalone Table Recognition endpoint (`/api/v1/table_rec`) is deprecated. Use the Document Conversion endpoint with JSON output instead. ### Before (deprecated) ```python theme={null} # Old: Dedicated table recognition endpoint response = requests.post( "https://www.datalab.to/api/v1/table_rec", files={"file": ("doc.pdf", f, "application/pdf")}, headers={"X-API-Key": API_KEY} ) ``` ### After (current) ```python Python SDK theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() options = ConvertOptions( output_format="json", mode="balanced" ) result = client.convert("document.pdf", options=options) # Tables are in the JSON output with block_type "Table" for block in result.json.get("children", []): if block.get("block_type") == "Table": print(f"Table: {block['id']}") print(f"Bounding box: {block['bbox']}") # Access cells in block['children'] ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=json" \ -F "mode=balanced" ``` ## OCR → Document Conversion The standalone OCR endpoint (`/api/v1/ocr`) is deprecated. Use the Document Conversion endpoint instead, which includes OCR as part of its processing pipeline. ### Before (deprecated) ```python theme={null} # Old: Dedicated OCR endpoint response = requests.post( "https://www.datalab.to/api/v1/ocr", files={"file": ("doc.pdf", f, "application/pdf")}, headers={"X-API-Key": API_KEY} ) ``` ### After (current) ```python Python SDK theme={null} from datalab_sdk import DatalabClient, ConvertOptions client = DatalabClient() # For text extraction, use markdown output result = client.convert("document.pdf") print(result.markdown) # For page-level text, use JSON output options = ConvertOptions(output_format="json") result = client.convert("document.pdf", options=options) ``` ```bash cURL theme={null} curl -X POST https://www.datalab.to/api/v1/convert \ -H "X-API-Key: $DATALAB_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=markdown" ``` ## Next Steps Full guide to the current conversion API Extract structured data using JSON schemas See all API changes and deprecations Use the SDK for the simplest migration path # Security Best Practices Source: https://documentation.datalab.to/platform/security Keep your Datalab integration secure with these best practices. Follow these practices to keep your Datalab integration secure in production. ## API Key Management ### Store keys in environment variables Never hardcode API keys in source code. Use environment variables: ```bash theme={null} export DATALAB_API_KEY="your-api-key" ``` ```python theme={null} # The SDK reads DATALAB_API_KEY automatically from datalab_sdk import DatalabClient client = DatalabClient() # Uses env var ``` Never commit API keys to version control. Add `.env` files to your `.gitignore`. ### Use per-key spend limits Create separate API keys for different environments and set spend limits on each: * **Development key** — low spend limit for testing * **Staging key** — moderate limit for integration testing * **Production key** — appropriate limit for your expected usage Manage keys at [datalab.to/app/keys](https://www.datalab.to/app/keys). ### Rotate keys regularly If you suspect a key has been compromised: 1. Create a new API key at [datalab.to/app/keys](https://www.datalab.to/app/keys) 2. Update your application to use the new key 3. Revoke the old key Create the new key before revoking the old one to avoid downtime. ## Webhook Security ### Always use HTTPS Configure your webhook endpoint to use HTTPS. Webhook payloads contain request data that should be encrypted in transit. ### Verify webhook signatures Always verify the webhook signature before processing the payload: ```python theme={null} import hashlib import hmac from fastapi import FastAPI, Request, HTTPException app = FastAPI() WEBHOOK_SECRET = "your-webhook-secret" @app.post("/webhook") async def handle_webhook(request: Request): body = await request.body() signature = request.headers.get("X-Webhook-Signature") expected = hmac.new( WEBHOOK_SECRET.encode(), body, hashlib.sha256 ).hexdigest() if not hmac.compare_digest(signature, expected): raise HTTPException(status_code=401, detail="Invalid signature") # Process the webhook payload payload = await request.json() return {"status": "ok"} ``` ### Handle duplicate events Webhook deliveries may be retried on 5xx errors or timeouts. Use the `request_id` field to deduplicate: ```python theme={null} processed_ids = set() # Use a database in production @app.post("/webhook") async def handle_webhook(request: Request): payload = await request.json() request_id = payload["request_id"] if request_id in processed_ids: return {"status": "already processed"} processed_ids.add(request_id) # Process the payload ``` Do not log webhook secrets or full webhook payloads containing sensitive document data. ## Data Handling ### Results expiration Conversion results are automatically deleted from Datalab servers **one hour** after processing completes. Retrieve and store results in your own infrastructure promptly. ### Data retention consent You can control whether your documents are used to improve Datalab's models. This is an opt-in setting configurable in your team settings. Teams that opt in receive discounted rates. ### Minimize data exposure * Only send documents that need to be processed — avoid sending unnecessary files * Use `page_range` to process only the pages you need rather than entire documents * Download and delete results as soon as they're available ## Network Security ### For on-premises deployments * Place the Datalab container behind a reverse proxy with TLS termination * Restrict network access to the container's port (8000) to trusted clients only * The on-premises container does not require API key authentication by default — implement authentication at the network or reverse proxy level * See [On-Premises Overview](/docs/on-prem/overview) for deployment details ### IP restrictions For additional security, consider restricting API access to known IP addresses using your infrastructure's firewall or WAF rules. ## Next Steps Configure and verify webhook signatures Understand rate limits and quotas Manage spend limits and usage Self-hosted deployment security # Troubleshooting Source: https://documentation.datalab.to/platform/troubleshooting Common issues and solutions when using the Datalab API. This page covers the most common issues you may encounter when using the Datalab API, organized by the error messages you'll see. ## Authentication Errors ### "Invalid API key provided" ```json theme={null} {"detail": "Invalid API key provided. Set the X-API-Key header to your API key."} ``` **Status:** 401 Unauthorized **Cause:** The `X-API-Key` header is missing or contains an invalid key. **Solution:** 1. Check that you're passing the header: `X-API-Key: YOUR_KEY` 2. Verify your key at [datalab.to/app/keys](https://www.datalab.to/app/keys) 3. If using the SDK, set `DATALAB_API_KEY` environment variable or pass `api_key` to the client 4. If the key was recently created, wait a few seconds and retry ### "You need an active, paid subscription" ```json theme={null} {"detail": "You need an active, paid subscription to use this API."} ``` **Status:** 403 Forbidden **Cause:** Your team does not have an active subscription or has exhausted free credits. **Solution:** 1. Sign up for a plan at [datalab.to/pricing](https://www.datalab.to/pricing) 2. If you recently signed up, check that payment was processed successfully 3. Contact [support@datalab.to](mailto:support@datalab.to) if you believe this is in error ### "Your subscription has expired" ```json theme={null} {"detail": "Your subscription has expired. You may need to re-enable your plan, or pay an unpaid invoice."} ``` **Status:** 403 Forbidden **Solution:** Check your billing dashboard for unpaid invoices and update your payment method if needed. ### "Your payment has failed" ```json theme={null} {"detail": "Your payment has failed. Please pay any unpaid invoices to continue using the API."} ``` **Status:** 403 Forbidden **Solution:** Update your payment method and pay any outstanding invoices at [datalab.to/app/billing](https://www.datalab.to/app/billing). *** ## Rate Limiting ### "Rate limit exceeded" ```json theme={null} {"detail": "Rate limit exceeded for endpoint /api/v1/convert. You can make 200 requests every 60 seconds. Please try again later, or reach out to support@datalab.to if you need a higher limit."} ``` **Status:** 429 Too Many Requests **Cause:** You've exceeded the request rate limit for your plan. **Solution:** * Wait and retry with exponential backoff (the SDK does this automatically) * Reduce request frequency or spread requests over time * For higher limits, contact [support@datalab.to](mailto:support@datalab.to) ```python theme={null} # The SDK handles retries automatically with exponential backoff. # For the REST API, implement retry logic: import time def request_with_retry(url, headers, max_retries=5): for attempt in range(max_retries): response = requests.get(url, headers=headers) if response.status_code == 429: wait = min(2 ** attempt * 5, 120) time.sleep(wait) continue return response raise Exception("Max retries exceeded") ``` ### "Concurrency exceeded" ```json theme={null} {"detail": "Concurrency exceeded for endpoint /api/v1/convert. You can have 400 concurrent requests running at once. Please try again later, or reach out to support@datalab.to if you need a higher limit."} ``` **Status:** 429 Too Many Requests **Cause:** Too many requests are being processed simultaneously. **Solution:** Queue your requests and limit the number of concurrent submissions. See [Batch Processing](/docs/recipes/conversion/batch-documents) for patterns. ### "Page rate limit exceeded" ```json theme={null} {"success": false, "error": "Page rate limit exceeded. Your team has {current_pages} pages in flight and this request adds {page_count} more ({total} total, limit: 5,000). Please wait for some requests to complete before submitting more, or contact support@datalab.to for a higher limit."} ``` **Status:** Not an HTTP error — returned in the result payload with `success: false` **Cause:** Your team has too many pages being processed concurrently across all requests. The default limit is 5,000 concurrent pages. **Solution:** * Wait for in-flight requests to complete before submitting more documents * If you're polling for results, back off when you see this error and retry after some results return * If you're using webhooks, wait for completion notifications before submitting more * For a higher page limit, contact [support@datalab.to](mailto:support@datalab.to) This limit is **not** enforced at submission time. Your request will be accepted, but the result will come back with `success: false`. Always check the `success` field when retrieving results. *** ## File Errors ### "Invalid file type" ```json theme={null} {"detail": "Invalid file type. Only PDF files, word documents, spreadsheets, powerpoints, HTML, and PNG, JPG, GIF, TIFF, and WEBP images are accepted."} ``` **Status:** 400 Bad Request **Cause:** The uploaded file's content type is not supported. **Solution:** 1. Check [Supported File Types](/docs/common/supportedfiletypes) for the full list 2. Ensure the file extension matches the actual content type 3. If uploading via cURL, the content type may be auto-detected from the extension ### "File size exceeds upload limit" ```json theme={null} {"detail": "File size exceeds upload limit of 209715200 bytes."} ``` **Status:** 400 Bad Request **Cause:** The file exceeds the 200 MB size limit. **Solution:** * Split large PDFs into smaller files using page ranges * Use the `page_range` parameter to process specific pages * See [API Limits](/docs/common/limits) for current limits ### "File too large" ```json theme={null} {"detail": "File too large. Maximum size: 200MB"} ``` **Status:** 413 Payload Too Large **Solution:** Same as above — reduce file size or use page ranges. *** ## Request Errors ### "Request not found" ```json theme={null} {"detail": "Request not found."} ``` **Status:** 404 Not Found **Cause:** The request ID doesn't exist or has expired. **Solution:** * Results are deleted one hour after processing completes — retrieve them promptly * Verify the request ID is correct * Submit a new request if the results have expired ### "This resource has expired" ```json theme={null} {"detail": "This resource has expired."} ``` **Status:** 404 Not Found **Cause:** The conversion results have been cleaned up (1 hour after completion). **Solution:** Submit a new conversion request. Consider using [webhooks](/platform/webhooks) to be notified immediately when results are ready. ### "This request was not made by you" ```json theme={null} {"detail": "This request was not made by you."} ``` **Status:** 403 Forbidden **Cause:** You're trying to retrieve results for a request made by a different team. **Solution:** Ensure you're using the same API key that submitted the original request. *** ## Webhook Issues ### Webhook not firing **Possible causes:** 1. Webhook URL is not configured — set it at [dashboard](https://www.datalab.to/app/settings) or per-request via `webhook_url` 2. Your server is not reachable from Datalab's servers 3. Your server is returning 4xx errors (webhooks are not retried for client errors) **Debugging steps:** 1. Check your webhook URL is accessible from the internet 2. Verify HTTPS is properly configured (self-signed certificates may cause issues) 3. Check your server logs for incoming requests 4. Use a tool like [webhook.site](https://webhook.site) to test webhook delivery ### Webhook signature verification failing **Possible causes:** 1. Using the wrong webhook secret 2. Request body is being modified by middleware before verification 3. Encoding issues (verify UTF-8 encoding) **Solution:** See [Webhook Verification](/platform/webhooks#verifying-webhook-signatures) for the correct verification implementation. *** ## Processing Issues ### Conversion returns empty or poor results **Possible causes:** 1. Scanned PDF with no OCR layer — use `mode: "accurate"` for better OCR 2. Very complex layout — try `mode: "accurate"` 3. File is corrupted or password-protected **Solution:** * Try a different processing mode (fast → balanced → accurate) * Check the `parse_quality_score` in the response (0-5 scale) to assess output quality * For scanned documents, `accurate` mode provides the best OCR ### Conversion is slow **Possible causes:** 1. Using `accurate` mode on large documents 2. Large file with many pages **Solution:** * Use `fast` or `balanced` mode for lower latency * Use `page_range` to process only the pages you need * Use `max_pages` to limit processing *** ## Server Errors ### "Database error" / "Redis error" ```json theme={null} {"detail": "Database error"} ``` **Status:** 500 Internal Server Error **Cause:** Temporary infrastructure issue on Datalab's side. **Solution:** Wait a moment and retry. If the issue persists, contact [support@datalab.to](mailto:support@datalab.to). ### 529 Service Overloaded **Cause:** Datalab's servers are temporarily overloaded. **Solution:** Wait and retry with exponential backoff. The SDK handles this automatically. *** ## Next Steps Complete HTTP error code reference Rate limits and file size limits Set up webhook notifications SDK with built-in error handling and retries # Version Policies Source: https://documentation.datalab.to/platform/versioning Datalab is designed to be enterprise-ready. This means that we will not introduce breaking changes for top-level versions and provide a clear upgrade path for version changes. ## API Policy For any given API version, we will preserve: * Existing input parameters * Existing output parameters However, we may do the following: * Add additional optional inputs * Add additional values to the output * Change conditions for specific error types * Add new variants to enum-like output values (for example, streaming event types) ## SDK Policy The SDK is built on top of the API, so it will follow the same versioning principles. ## Model Output We frequently update our models to improve accuracy and performance. This can introduce subtle changes in outputs. At this time, we do not support version pinning outside of enterprise plans. Contact us at [support@datalab.to](mailto:support@datalab.to) for information. ## Next Steps See the latest updates and changes to the Datalab platform. Full Python SDK documentation with typed clients and async support. REST API reference for document conversion, form filling, and file management. Understand HTTP error codes and subscription errors. # Webhooks Source: https://documentation.datalab.to/platform/webhooks Webhooks provide real-time notifications when your document processing jobs complete, eliminating the need for continuous polling. This event-driven approach improves efficiency and reduces unnecessary API calls. ## Setting Up Webhooks 1. Navigate to the Settings Panel 2. Locate the "Webhooks" section 3. Enter your webhook endpoint URL with an optional secret We currently only support a single webhook per account. **Webhook reliability notes:** * Webhooks are retried on 5xx errors and timeouts, but **not** on 4xx errors * Always implement idempotent webhook handlers using `request_id` to deduplicate * Set a reasonable server timeout — Datalab waits up to 30 seconds for your endpoint to respond ### Per-Request Webhook Override You can override the default webhook URL for specific API requests by including the `webhook_url` parameter: ```python theme={null} import requests url = "https://www.datalab.to/api/v1/convert" form_data = { 'file': ('document.pdf', open('document.pdf', 'rb'), 'application/pdf'), 'output_format': (None, 'markdown'), 'webhook_url': (None, 'https://your-custom-webhook.com/endpoint') } headers = {"X-API-Key": "YOUR_API_KEY"} response = requests.post(url, files=form_data, headers=headers) ``` This is useful when: * Different projects need different webhook endpoints * You want to route notifications to specific services * Testing webhook integrations without changing account settings The per-request webhook URL will be used instead of your account's default webhook URL for that specific request only. ## Webhook Payload When a webhook is triggered, Datalab sends a POST request to your configured endpoint with a JSON payload containing the following fields: ```json theme={null} { "request_id": "abc123", "request_check_url": "https://api.datalab.to/api/v1/convert/abc123", "webhook_secret": "your_configured_secret" } ``` | Field | Description | | ------------------- | ---------------------------------------------------------- | | `request_id` | The unique identifier for the processing request | | `request_check_url` | URL to retrieve the full results of the processed document | | `webhook_secret` | Your configured webhook secret (if set) | ## Webhook Secret Verification The webhook secret is included in the JSON request body, allowing you to verify that incoming webhooks are authentic requests from Datalab. ### Verifying Webhooks on Your Server Here's an example of how to verify the webhook secret in your receiving endpoint: ```python theme={null} from fastapi import FastAPI, Request, HTTPException import os app = FastAPI() @app.post("/my-webhook") async def receive_webhook(request: Request): data = await request.json() # Verify the webhook secret expected_secret = os.environ["DATALAB_WEBHOOK_SECRET"] received_secret = data.get("webhook_secret") if received_secret != expected_secret: raise HTTPException(status_code=401, detail="Invalid webhook secret") # Process the webhook request_id = data["request_id"] check_url = data["request_check_url"] # Fetch the full results using check_url... ``` The webhook secret is transmitted in plaintext within the request body. Ensure your webhook endpoint uses HTTPS to encrypt the data in transit. Avoid logging the full request body in production to prevent secret exposure. ## Troubleshooting **Not Receiving Events** If your webhook is not receiving events, try the following: * Verify URL is publicly accessible * Validate webhook secret matches * Check your server logs for 4xx errors (authentication, invalid endpoint, etc.) * Ensure your endpoint responds within 30 seconds **Duplicate Events** We may send duplicate responses to a webhook endpoint. To handle this we recommend that you implement idempotency checks to ensure single processing. ## Coming Soon * Project-specific webhooks