Parsing a Batch of Documents

Our earlier walkthrough showed you how to submit one file to Marker for PDF conversion. What if you had thousands of documents, locally or in a Dropbox / Google Drive directory? The core logic to process one document is going to be the same as before (submit and poll with your desired settings), but now:

We’ll modify the entrypoint to retrieve all our files
Add multithreading to improve throughput on our processing
Make sure we respect API rate limits

Let’s see how we can extend our submit_and_poll_pdf_conversion(...) function to support all this!

Reading multiple files

In the base case with one file, we would have just called

submit_and_poll_pdf_conversion(path_to_pdf)

And our function from earlier would have submitted that file for conversion and polled for completion. If we had a local directory with thousands of PDFs, we could do something like this to read them in and submit:

from pathlib import Path

def batch_convert_pdfs(
  document_directory: str
):
    doc_dir = Path(document_directory)
    if not doc_dir.exists():
      print("Couldn't find your directory, exiting early...")
      raise FileNotFoundError(f"Couldn't find {document_directory}")

    # Collect all PDF files
    docs_to_process = list(doc_dir.glob("*.pdf"))
    print(f"Found {len(docs_to_process)} PDFs to convert...")

    for doc in docs_to_process:
      #
      # Process each file in order
      #
      submit_and_poll_pdf_conversion(doc)

This is rather annoying still as you’re submitting each document one at a time; wouldn’t it be nice to process multiple at a time?

Simultaneous Requests

Let’s modify batch_convert_pdf(...) to submit multiple files at once. We can do this with multithreading in python. We’ll add max_workers to process documents in parallel.

from pathlib import Path

def batch_convert_pdfs(
  document_directory: str,
  max_workers: int = 3
):
    doc_dir = Path(document_directory)
    if not doc_dir.exists():
      print("Couldn't find your directory, exiting early...")
      raise FileNotFoundError(f"Couldn't find {document_directory}")

    # Collect all PDF files
    docs_to_process = list(doc_dir.glob("*.pdf"))
    print(f"Found {len(docs_to_process)} PDFs to convert...")

    #
    # Process multiple files at once
    #
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(submit_and_poll_pdf_conversion, pdf_path): pdf_path.name
            for pdf_path in docs_to_process
        }

        for future in as_completed(future_to_file):
            filename = future_to_file[future]
            try:
                future.result()
            except Exception as e:
                print(f"✗ Error processing {filename}: {e}")

Do not set workers too high, we have API Rate Limits which you need to respect. By default, we cap requests at 200 per minute for your team (we can increase this for enterprise customers if needed, reach out to us at support@datalab.to).

Handling Rate Limits

We need to modify our submit_and_poll_pdf_conversion(...) function from the previous example to respect rate limits and retry as needed. Let’s use urllib3's retry system with requests.adapters.HTTPAdapter to handle retrying different status codes.

import requests
from requests.adapters import HTTPAdapter, Retry

#
# Configure a session with retries, customize retry behavior
#   for your usage needs. Our default rate limit is 200 per minute
#   per account (not per API key).
#
session = requests.Session()
retries = Retry(
    total=20,
    backoff_factor=4,
    status_forcelist=[429],
    allowed_methods=["GET", "POST"],
    raise_on_status=False,
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)

And then modify submit_and_poll_pdf_conversion(...) to use session.post(...) instead of requests.post(...)

def submit_and_poll_pdf_conversion(
  pdf_path: Path,
  output_format: Optional[str] = 'markdown',
  use_llm: Optional[bool] = True
):
  ...
  response = session.post(url, files=form_data, headers=headers)
  ...
  for i in range(max_polls):
    response = session.get(check_url, headers=headers) # Need to include headers for API key
  ...

That should do the trick. If you have multiple API keys, note that the 200 requests per minute is capped to your team / subscription, not your individual API key. If you have a custom limit request, reach out to us at support@datalab.to!

Full Code Sample

Okay, here’s a full code sample that extends our previous API call with batch documents with multi-threading and retry handling.

import os
import time
import requests
from requests.adapters import HTTPAdapter, Retry
from pathlib import Path
from typing import Optional

API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = os.getenv("DATALAB_API_KEY")

#
# Configure a session with retries, customize retry behavior
#   for your usage needs. Our default rate limit is 200 per minute
#   per account (not per API key).
#
session = requests.Session()
retries = Retry(
    total=20,
    backoff_factor=4,
    status_forcelist=[429],
    allowed_methods=["GET", "POST"],
    raise_on_status=False,
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)


def submit_and_poll_pdf_conversion(
  pdf_path: Path,
  output_format: Optional[str] = 'markdown',
  use_llm: Optional[bool] = True
):
  url = "https://www.datalab.to/api/v1/marker"

  def submit_request():
    #
    # Submit initial request
    #
    with open(pdf_path, 'rb') as f:
      form_data = {
          'file': (pdf_path.name, f, 'application/pdf'),
          "force_ocr": (None, False),
          "paginate": (None, False),
          'output_format': (None, output_format),
          "use_llm": (None, use_llm),
          "strip_existing_ocr": (None, False),
          "disable_image_extraction": (None, False)
      }
    
      headers = {"X-Api-Key": API_KEY}
      return session.post(API_URL, headers=headers, files=form_data)

  response = api_call_with_retry(submit_request)
  response.raise_for_status()
  data = response.json()

  #
  # Poll for completion
  #
  max_polls = 300
  check_url = data["request_check_url"]
  for i in range(max_polls):
    response = session.get(check_url, headers=headers) # Need to include headers for API key
    check_result = response.json()

    if check_result['status'] == 'complete':
      #
      # Your processing is finished, you can do your post-processing!
      #
      converted_document = check_result[output_format]  # the 'html', 'markdown', or 'json' field in the response will contain what you're looking for (maps to our initial `output_format`)
      #
      # .. do something with it!
      #
    elif check_result["status"] == "failed":
      print("Failed to convert, uh oh...")
      break
    else:
      print("Waiting 2 more seconds to re-check conversion status")
      time.sleep(2)


def batch_convert_pdfs(
  document_directory: str,
  max_workers: int = 3
):
    doc_dir = Path(document_directory)
    if not doc_dir.exists():
      print("Couldn't find your directory, exiting early...")
      raise FileNotFoundError(f"Couldn't find {document_directory}")

    # Collect all PDF files
    docs_to_process = list(doc_dir.glob("*.pdf"))
    print(f"Found {len(docs_to_process)} PDFs to convert...")

    #
    # Process multiple files at once, up to `max_workers`
    #
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(submit_and_poll_pdf_conversion, pdf_path): pdf_path.name
            for pdf_path in docs_to_process
        }

        for future in as_completed(future_to_file):
            filename = future_to_file[future]
            try:
                future.result()
            except Exception as e:
                print(f"✗ Error processing {filename}: {e}")

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!

General

Parsing PDFs with Marker

Structured Extraction

Document Segmentation

Workflows (Beta)

Table Recognition

Try Datalab

Try Datalab

Reading multiple files

Simultaneous Requests

Handling Rate Limits

Full Code Sample

Try it out

General

Parsing PDFs with Marker

Structured Extraction

Document Segmentation

Workflows (Beta)

Table Recognition

Try Datalab

Try Datalab

​Reading multiple files

​Simultaneous Requests

​Handling Rate Limits

​Full Code Sample

​Try it out

Reading multiple files

Simultaneous Requests

Handling Rate Limits

Full Code Sample

Try it out