Skip to main content
In this tutorial we’ll learn how to take one PDF that’s composed of multiple separate documents and get the page boundaries for each one.

Context

Why on earth would this ever happen, you rightfully ask? Let’s see, it’s:
  • more convenient to attach one file into an email
  • a common workaround when batch scanning papers when archiving content
  • possible some people thrive in chaos
These ‘digitally stapled PDFs’, as we’ve come to call them, pose many problems when you’re doing downstream data processing. What if you need to do extraction and have a schema for each of the individual document types within? Running it on the entire PDF can create latency issues, and maybe you don’t want to spend extra time and money processing pages unnecessarily. Our Segmentation functionality lets you auto-magically get the page numbers you need for each type of document contained within your PDF, making it easier to post-process archival records efficiently, properly archive documents, and more.

Running a Segmentation

This set up is similar to our Marker example from earlier:
  • Upload your file for processing, along with a few settings (we’ll cover them below)
  • Poll to see if your request is done
  • voila!

Making an API Call

Segmentation uses our marker endpoint which is available at /api/v1/marker, and will automatically run if you pass in a segmentation_schema parameter. Let’s say we had a file of 5 pages that is actually composed of two different documents. Here is an example request in Python:

import requests

url = "https://www.datalab.to/api/v1/marker"

form_data = {
    'file': ('test.pdf', open('~/pdfs/test.pdf', 'rb'), 'application/pdf'),
    "force_ocr": (None, False),
    "paginate": (None, False),
    'output_format': (None, 'markdown'),
    "use_llm": (None, False),
    "strip_existing_ocr": (None, False),
    "disable_image_extraction": (None, False),
    "segmentation_schema": (None, '{"segments": []}')   # This is the magic
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Your response will look something like this
{
	"success": true,
	"error": null,
	"request_id": "<your_request_id>",
	"request_check_url": "<https://www.datalab.to/api/v1/marker/<your_request_id>",
	"versions": null
}
You can then poll for completion by using request_check_url every few seconds.
import time

# Use request_check_url to poll for job completion

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        print("Job complete!")
        segmentation_results = data["segmentation_results"]
        #
        # Do something with your results
        #
        break
Note that status will be processing until it’s done (at which point it changes to complete. When it’s done, your response will look something like this:
{
	"status": "complete",
	"segmentation_results": {
		"segments": [
			{
				"name": "Research Paper",
				"pages": [
					0,
					1,
                    2
				],
				"confidence": "medium"
			},
			{
				"name": "Document 2",
				"pages": [
					3,
					4
				],
				"confidence": "medium"
			}
		],
		"metadata": {
			"total_pages": 5,
			"segmentation_method": "auto_detected",
			"strategy": "AutoSegmentationStrategy"
		}
      ]
    },
	...
}
Note that your original document of 5 pages was automatically segmented into two documents with their corresponding pages. You can then use the page_range parameter in Marker to process each segment of the document as you wish.

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!