Handling Long Documents

This post is a continuation of the Extraction API Overview. Okay! We got it working with a simple document and a straightforward schema. What if we’ve got a multi-hundred page PDF? This can happen sometimes when:

We just have legitimately long documents
Our PDFs are composed of multiple different types of documents, each needing their own extraction schema

We’re always looking at improving inference speeds and adding features to support this, but let’s go through a few tips as you integrate Structured Extraction into your document ingestion pipelines.

Restrict to Specific Page Range

If your extraction schema is typically constrained to a set of pages within your document and you know this upfront, use the page_range parameter in the API to ensure we only process the relevant pages. You’ll only be charged for those (even if your document is much longer). When you submit your marker request, set page_range to the right values. For example: 0,2-4 will process pages 0, 2, 3, and 4. Note that this overrides max_pages if you set that too, and that our page ranges are 0-indexed (so 0 is the first page).

Segment and Chain Extractions

Let’s say you have a massive 100 page file with lots of different sections. If you don’t know what pages they’re on, but do know the specific extraction schemas you’d use for each section, here’s one way to scale up your inference speed and improve accuracy.

Table of Contents Segmentation

Submit the whole PDF, but set page_range to 0-6 (whichever range includes the entire Table of Contents). Run it with an Extraction schema that’s designed to pull out a table of contents.
Then, dynamically construct page_range values for each section
Submit separate requests to marker using each page_range and the corresponding extraction schema for the info you know is in them.

You’ll only be charged for pages specified in page_range (even if submit the entire document each request). If your document doesn’t have a table of contents but you’d find segmentation valuable, send us a sample at support@datalab.to so we can give you a working example / strategy to handle it!

Full Code Sample

Here’s a complete example including marker submission, polling, and dynamic page range extraction.

import requests
import time
import json

API_URL = "<https://www.datalab.to/api/v1/marker>"
API_KEY = "YOUR_API_KEY"
HEADERS = {"X-Api-Key": API_KEY}

SAMPLE_TOC_EXTRACION_SCHEMA = {
  "type": "object",
  "title": "ToCExtractionSchema",
  "description": "Schema to pull out table of contents",
  "properties": {
    "table_of_contents": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "section_name": {
            "type": "string",
            "description": "the name of the section from table of contents"
          },
          "page_range": {
            "type": "string",
            "description": "the page range or page number of the item from the table of contents"
          }
        }
      }
    }
  },
  "required": [
    "table_of_contents"
  ]
}

def run_marker_extraction(pdf_path, schema_json, page_range=None):
    """
    Submit a marker request with schema and optional page range.
    Poll until complete, then return the parsed extraction schema as a dict.
    """
    with open(pdf_path, "rb") as f:
        files = {
            'file': ('document.pdf', f, 'application/pdf'),
            'page_schema': (None, schema_json),
            'use_llm': (None, True)
        }
        if page_range:
            files['page_range'] = (None, page_range)

        # Submit request
        response = requests.post(API_URL, files=files, headers=HEADERS)
        data = response.json()
        check_url = data["request_check_url"]

    # Poll until complete
    max_polls = 300
    for _ in range(max_polls):
        time.sleep(2)
        poll = requests.get(check_url, headers=HEADERS).json()

        if poll.get("status") == "failed":
            raise RuntimeError(f"Extraction failed: {poll.get('error')}")

        if poll.get("status") == "complete":
            return json.loads(poll.get('extraction_schema_json'))

    raise TimeoutError("Extraction job did not complete in time.")

def dynamic_page_range_extraction(pdf_path, toc_schema, schemas_by_section):
    """
    1. Extract TOC from first few pages.
    2. Parse TOC into section -> page_range mappings.
    3. Run marker again per section using its schema + page range.
    4. Merge results into a single dict.
    """
    # Step 1: Extract TOC
    toc_result = run_marker_extraction(pdf_path, schema_json=toc_schema, page_range="0-6")

    # Step 2: Parse TOC into usable mapping (customize parser as needed)
    section_page_ranges = parse_toc(toc_result)

    # Step 3: Extract per-section
    all_results = {}
    for section, page_range in section_page_ranges.items():
        schema_json = schemas_by_section.get(section)
        if schema_json:
            section_result = run_marker_extraction(pdf_path, schema_json=schema_json, page_range=page_range)
            all_results[section] = section_result

    return all_results

def parse_toc(toc_dict):
    """
    Example TOC parser: converts the TOC dict into {section: page_range}.
    In practice you'd implement parsing logic based on your schema design.
    """
    page_map = {}
    for item in toc_dict.get("table_of_contents", []):
        section = item.get("item")
        page = item.get("page number")
        if section and page:
            page_map[section] = page  # right now single page; could expand to ranges if you need to
    return page_map

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!

General

Parsing PDFs with Marker

Table Recognition

Structured Extraction

Document Segmentation

Try Datalab

Try Datalab

Restrict to Specific Page Range

Segment and Chain Extractions

Table of Contents Segmentation

Full Code Sample

Try it out

General

Parsing PDFs with Marker

Table Recognition

Structured Extraction

Document Segmentation

Try Datalab

Try Datalab

​Restrict to Specific Page Range

​Segment and Chain Extractions

​Table of Contents Segmentation

​Full Code Sample

​Try it out

Restrict to Specific Page Range

Segment and Chain Extractions

Table of Contents Segmentation

Full Code Sample

Try it out