> ## Documentation Index
> Fetch the complete documentation index at: https://documentation.datalab.to/llms.txt
> Use this file to discover all available pages before exploring further.

# Saved Schemas

> Create and manage reusable extraction schemas in the Datalab UI. Reference saved schemas by ID instead of sending the full schema with every request.

**Before you begin**, make sure you have:

1. A [Datalab account](https://www.datalab.to/auth/sign_up) with an [API key](https://www.datalab.to/app/keys) (new accounts include \$5 in free credits)
2. Python 3.10+ installed
3. The Datalab SDK: `pip install datalab-python-sdk`
4. Your `DATALAB_API_KEY` environment variable set

## Overview

Saved Schemas let you store extraction schemas in Datalab and reference them by ID (`schema_id`) when calling `/api/v1/extract`. Instead of sending a full JSON schema with every request, you save it once and reference it by its stable ID.

Saved schemas also support **versioning** — you can update a schema while keeping a history of previous versions and pin extractions to a specific version using `schema_version`.

## Create a Schema

Create schemas via the SDK or the [Datalab UI](https://www.datalab.to/app/schemas). Each schema is assigned a `schema_id` (e.g. `sch_k8Hx9mP2nQ4v`) that you can reference in extraction requests.

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient

  client = DatalabClient()

  schema = client.create_extraction_schema(
      name="Invoice Schema",
      description="Extracts key fields from invoices",
      schema_json={
          "properties": {
              "invoice_number": {"type": "string", "description": "Invoice ID"},
              "total_amount": {"type": "number", "description": "Total amount due"},
              "vendor_name": {"type": "string", "description": "Vendor or supplier name"},
              "due_date": {"type": "string", "description": "Payment due date"},
          }
      },
  )
  print(schema.schema_id)  # e.g. sch_k8Hx9mP2nQ4v
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/extraction_schemas \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "name": "Invoice Schema",
      "description": "Extracts key fields from invoices",
      "schema_json": {
        "properties": {
          "invoice_number": {"type": "string", "description": "Invoice ID"},
          "total_amount": {"type": "number", "description": "Total amount due"},
          "vendor_name": {"type": "string", "description": "Vendor or supplier name"},
          "due_date": {"type": "string", "description": "Payment due date"}
        }
      }
    }'
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.post(
      "https://www.datalab.to/api/v1/extraction_schemas",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
      json={
          "name": "Invoice Schema",
          "schema_json": {
              "properties": {
                  "invoice_number": {"type": "string"},
                  "total_amount": {"type": "number"},
              }
          },
      },
  )
  schema_id = resp.json()["schema_id"]
  print(schema_id)
  ```
</CodeGroup>

## Extract Using a Saved Schema

Pass `schema_id` to `/api/v1/extract` instead of `page_schema`:

<CodeGroup>
  ```python Python SDK theme={null}
  from datalab_sdk import DatalabClient, ExtractOptions
  import json

  client = DatalabClient()

  result = client.extract(
      "invoice.pdf",
      options=ExtractOptions(
          schema_id="sch_k8Hx9mP2nQ4v",
          mode="balanced",
      ),
  )
  extracted = json.loads(result.extraction_schema_json)
  print(extracted)
  ```

  ```bash cURL theme={null}
  curl -X POST https://www.datalab.to/api/v1/extract \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -F "file=@invoice.pdf" \
    -F "schema_id=sch_k8Hx9mP2nQ4v" \
    -F "mode=balanced"

  # Poll request_check_url from response until status is "complete"
  ```

  ```python Python (requests) theme={null}
  import requests, time, os

  headers = {"X-API-Key": os.getenv("DATALAB_API_KEY")}

  with open("invoice.pdf", "rb") as f:
      resp = requests.post(
          "https://www.datalab.to/api/v1/extract",
          files={"file": ("invoice.pdf", f, "application/pdf")},
          data={"schema_id": "sch_k8Hx9mP2nQ4v", "mode": "balanced"},
          headers=headers
      )

  check_url = resp.json()["request_check_url"]

  while True:
      result = requests.get(check_url, headers=headers).json()
      if result["status"] == "complete":
          import json
          extracted = json.loads(result["extraction_schema_json"])
          print(extracted)
          break
      elif result["status"] == "failed":
          print(f"Error: {result.get('error')}")
          break
      time.sleep(2)
  ```
</CodeGroup>

<Warning>
  `page_schema` and `schema_id` are mutually exclusive — provide exactly one. If you pass both, the API returns a `400` error.
</Warning>

## Schema Versioning

When you update a schema in the [Datalab UI](https://www.datalab.to/app/schemas), you can choose to create a new version. This saves the current state to version history and increments the version number.

### Pin to a specific version

Pass `schema_version` alongside `schema_id` to use a specific version:

```bash theme={null}
curl -X POST https://www.datalab.to/api/v1/extract \
  -H "X-API-Key: $DATALAB_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "schema_id=sch_k8Hx9mP2nQ4v" \
  -F "schema_version=1"
```

Omitting `schema_version` always uses the latest version.

<Tip>
  We recommend always specifying `schema_version` alongside `schema_id`. This ensures your extractions produce consistent results even if the schema is updated later.
</Tip>

## List Schemas

<CodeGroup>
  ```python Python SDK theme={null}
  result = client.list_extraction_schemas(limit=50, include_archived=False)
  for s in result["schemas"]:
      print(f"{s.schema_id}: {s.name} (v{s.version})")
  ```

  ```bash cURL theme={null}
  # List active schemas
  curl "https://www.datalab.to/api/v1/extraction_schemas" \
    -H "X-API-Key: $DATALAB_API_KEY"

  # Include archived schemas
  curl "https://www.datalab.to/api/v1/extraction_schemas?include_archived=true" \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.get(
      "https://www.datalab.to/api/v1/extraction_schemas",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
  )
  for s in resp.json()["schemas"]:
      print(s["schema_id"], s["name"])
  ```
</CodeGroup>

The response includes `schemas` (array) and `total` (count). Schemas are ordered by creation date, newest first.

## Get a Schema

<CodeGroup>
  ```python Python SDK theme={null}
  schema = client.get_extraction_schema("sch_k8Hx9mP2nQ4v")
  print(schema.name, schema.version)
  ```

  ```bash cURL theme={null}
  curl "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.get(
      "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
  )
  print(resp.json())
  ```
</CodeGroup>

## Update a Schema

Update schema fields. Pass `create_new_version=True` to save the current state to version history before updating:

<CodeGroup>
  ```python Python SDK theme={null}
  # Update schema fields and create a new version
  schema = client.update_extraction_schema(
      "sch_k8Hx9mP2nQ4v",
      schema_json={
          "properties": {
              "invoice_number": {"type": "string"},
              "total_amount": {"type": "number"},
              "line_items": {"type": "array", "items": {"type": "string"}},  # New field
          }
      },
      create_new_version=True,
  )
  print(f"Now at v{schema.version}")
  ```

  ```bash cURL theme={null}
  curl -X PUT "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \
    -H "X-API-Key: $DATALAB_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "schema_json": {
        "properties": {
          "invoice_number": {"type": "string"},
          "total_amount": {"type": "number"},
          "line_items": {"type": "array", "items": {"type": "string"}}
        }
      },
      "create_new_version": true
    }'
  ```

  ```python Python (requests) theme={null}
  import os, requests

  resp = requests.put(
      "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
      json={
          "schema_json": {"properties": {"invoice_number": {"type": "string"}}},
          "create_new_version": True,
      },
  )
  print(resp.json()["version"])
  ```
</CodeGroup>

## Archive a Schema

Archiving soft-deletes a schema — it no longer appears in list results (unless `include_archived=true`) and cannot be used for new extractions:

<CodeGroup>
  ```python Python SDK theme={null}
  client.delete_extraction_schema("sch_k8Hx9mP2nQ4v")
  ```

  ```bash cURL theme={null}
  curl -X DELETE "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v" \
    -H "X-API-Key: $DATALAB_API_KEY"
  ```

  ```python Python (requests) theme={null}
  import os, requests

  requests.delete(
      "https://www.datalab.to/api/v1/extraction_schemas/sch_k8Hx9mP2nQ4v",
      headers={"X-API-Key": os.getenv("DATALAB_API_KEY")},
  )
  ```
</CodeGroup>

## API Reference

### Schema Object

| Field             | Type         | Description                                             |
| ----------------- | ------------ | ------------------------------------------------------- |
| `schema_id`       | string       | Stable string ID (e.g. `sch_k8Hx9mP2nQ4v`)              |
| `name`            | string       | Human-readable name (max 200 chars)                     |
| `description`     | string\|null | Optional description                                    |
| `schema_json`     | object       | JSON schema with a `properties` key                     |
| `version`         | int          | Current version number (starts at 1)                    |
| `version_history` | array        | Previous versions saved with `create_new_version: true` |
| `archived`        | bool         | Whether the schema is archived                          |
| `created`         | datetime     | Creation timestamp                                      |
| `updated`         | datetime     | Last update timestamp                                   |

### `/extract` Parameters (schema-related)

| Parameter        | Type   | Description                                                      |
| ---------------- | ------ | ---------------------------------------------------------------- |
| `schema_id`      | string | ID of a saved schema. Mutually exclusive with `page_schema`.     |
| `schema_version` | int    | Version to use. Only valid with `schema_id`. Defaults to latest. |

## Next Steps

<CardGroup cols={2}>
  <Card title="Structured Extraction" icon="table" href="/docs/recipes/structured-extraction/api-overview">
    Full guide to extraction with inline schemas, checkpoints, and options.
  </Card>

  <Card title="Confidence Scoring" icon="chart-bar" href="/docs/recipes/structured-extraction/confidence-scoring">
    Score extraction results with per-field confidence ratings.
  </Card>

  <Card title="Forge Evals" icon="chart-bar" href="/docs/recipes/forge-evals/overview">
    Compare extraction results across configurations using saved schemas.
  </Card>

  <Card title="Handling Long Documents" icon="file-lines" href="/docs/recipes/structured-extraction/handling-long-documents">
    Strategies for extracting from 100+ page documents.
  </Card>
</CardGroup>
