In an earlier tutorial, we covered the accurate conversion of a PDF into a linearized form like HTML, Markdown, and JSON using Marker. What if all we really cared about were tables? We have an endpoint dedicated for table identification and extraction from documents. This works with our usual range of files. If you’re pulling them out of websites by URL, we recommend converting it to PDF first.

Context

Let’s say we’re looking at this website with information on various public schools in Connecticut: https://www.publicschoolreview.com/top-ranked-public-schools/connecticut What we really care about is pulling out the tables with all their original information preserved so we can do some post-processing, for example by loading it into a dataframe or database.

Preprocessing

The easiest way to work with Datalab is to convert your file into a PDF or DOCX. You can do something like this in Python using WeasyPrint.
from weasyprint import HTML

url = "https://www.publicschoolreview.com/top-ranked-public-schools/connecticut"
output_file = "connecticut_schools.pdf"

HTML(url).write_pdf(output_file)

print(f"Saved PDF: {output_file}")

Running Table Recognition

Now, the set up is similar to our Marker example from earlier:
  • Upload your file for processing, along with a few settings (we’ll cover them below)
  • Poll to see if your request is done
  • voila!

PDF Submission

The Table Recognition endpoint is available at /api/v1/table_rec. Here is an example request in Python:
import requests

url = "https://www.datalab.to/api/v1/table_rec"

form_data = {
    'file': ('test.pdf', open('~/pdfs/test.pdf', 'rb'), 'application/pdf'),
    "use_llm": (None, False),
    "force_ocr": (None, False),
    "paginate": (None, False),
    'output_format': (None, 'json'),
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Everything is a form parameter. This is because we’re uploading a file, so the request body has to be multipart/form-data. A note on parameters:
  • By default, we disable use_llm, but depending on your accuracy / latency tradeoffs, it’s worth testing enabling it for the types of documents you care about. Enabling it can often fix issues with dense tables.
  • force_ocr will slow down inference, but can fix rendering issues, e.g. with ligatures in text.
  • output_format is useful to get your table out as html, json, or markdown. Note that if you use json, it’ll include Table and TableCell blocks where available, including their bounding boxes.
You can see a full list of parameters and descriptions in the Table Recognition API reference. The request will return the following response:
{
  'success': True,
  'error': None,
  'request_id': "PpK1oM-HB4RgrhsQhVb2uQ",
  'request_check_url': 'https://www.datalab.to/api/v1/table_rec/PpK1oM-HB4RgrhsQhVb2uQ'
}

Polling for Completion

You will then need to poll request_check_url, like this to check your Table Recognition Result:
import time

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break
Eventually, the status field will be set to complete, and you will get a response containing your identified tables. They’ll be within a key corresponding to your value for output_format, i.e. html | json | markdown.
{
    "status": "complete",
    "output_format": "json",
    "json": {
      "children": [
        {
          "id": "/page/0/Page/770",
          "block_type": "Page",
          "html": "<content-ref src='/page/0/Table/2'></content-ref>",
          "polygon": [
              ...
          ],
          "bbox": [
              0.0,
              0.0,
              530.0,
              433.0
          ],
          "children": [
              {
                  "id": "/page/0/Table/2",
                  "block_type": "Table",
                  ...
              }
          ]
    }
    ...
}
Important!: All response data will be deleted from Datalab servers an hour after the processing is complete, so make sure to get your results by then.

Full Code Sample

import os
import time
import requests
from pathlib import Path
from typing import Optional

API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = os.getenv("DATALAB_API_KEY")

def submit_and_poll_table_recognition(
  pdf_path: Path,
  output_format: Optional[str] = 'json',
  use_llm: Optional[bool] = True
):
  url = "https://www.datalab.to/api/v1/table_rec"

  #
  # Submit initial request
  #
  with open(pdf_path, 'rb') as f:
    form_data = {
        'file': (pdf_path.name, f, 'application/pdf'),
        "force_ocr": (None, False),
        "paginate": (None, False),
        'output_format': (None, output_format),
        "use_llm": (None, use_llm),
        "strip_existing_ocr": (None, False),
        "disable_image_extraction": (None, False)
    }
  
    headers = {"X-Api-Key": API_KEY}

    response = requests.post(url, files=form_data, headers=headers)
    data = response.json()

  #
  # Poll for completion
  #
  max_polls = 300
  check_url = data["request_check_url"]
  for i in range(max_polls):
    response = requests.get(check_url, headers=headers) # Need to include headers for API key
    check_result = response.json()

    if check_result['status'] == 'complete':
      #
      # Your processing is finished, you can do your post-processing!
      #
      extracted_tables = check_result[output_format]  # the 'html', 'markdown', or 'json' field in the response will contain what you're looking for (maps to our initial `output_format`)
      #
      # .. do something with it!
      #
    elif check_result["status"] == "failed":
      print("Failed to convert, uh oh...")
      break
    else:
      print("Waiting 2 more seconds to re-check conversion status")
      time.sleep(2)

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!