NOTE: This example shows you how to do conversion with our API. If you have sensitive documents, our self-hosted on prem solution might be of interest - you can do the exact same thing. Let’s continue on and check out how Marker can be used to convert a PDF into Markdown, JSON, or HTML. This is a classic and necessary step in modern document processing pipelines: PDFs are great for humans, bad for machines. Worse, if we want to pull information out of PDFs to store in different ways, it’s extremely hard (and in the past, we’d rely on manual entry). By converting it accurately to another form, we can reliably store, extract, and do all sorts of useful things. The Marker API is powered using our open source packages marker and surya, with a few enhancements for our platform / self-serve version.

Running a conversion

The setup looks like this:
  • Upload your file for processing, along with a few settings (we’ll cover them below)
  • Poll to see if your request is done
  • voila!

PDF Submission

The marker endpoint is available at /api/v1/marker. Here is an example request in Python:
import requests

url = "https://www.datalab.to/api/v1/marker"

form_data = {
    'file': ('test.pdf', open('~/pdfs/test.pdf', 'rb'), 'application/pdf'),
    "force_ocr": (None, False),
    "paginate": (None, False),
    'output_format': (None, 'markdown'),
    "use_llm": (None, False),
    "strip_existing_ocr": (None, False),
    "disable_image_extraction": (None, False)
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Everything is a form parameter. This is because we’re uploading a file, so the request body has to be multipart/form-data. Parameters:
  • file, the input file.
  • file_url , a URL pointing to the input file. Either file or file_url can be provided.
  • output_format - one of json, html, or markdown.
  • force_ocr will force OCR on every page (ignore the text in the PDF). This is slower, but can be useful for PDFs with known bad text.
  • format_lines will partially OCR the lines to properly include inline math and styling (bold, superscripts, etc.). This is faster than force_ocr.
  • paginate - adds delimiters to the output pages. See the API reference for details.
  • use_llm - setting this to True will use an LLM to enhance accuracy of forms, tables, inline math, and layout. It can be much more accurate, but carries a small hallucination risk. Setting use_llm to True will make responses slower.
  • strip_existing_ocr - setting to True will remove all existing OCR text from the file and redo OCR. This is useful if you know OCR text was added to the PDF by a low-quality OCR tool.
  • disable_image_extraction - setting to True will disable extraction of images. If use_llm is set to True, this will also turn images into text descriptions.
  • max_pages - from the start of the file, specifies the maximum number of pages to inference.
You can see a full list of parameters and descriptions in the API reference. The request will return the following response:
{'success': True, 'error': None, 'request_id': "PpK1oM-HB4RgrhsQhVb2uQ", 'request_check_url': 'https://www.datalab.to/api/v1/marker/PpK1oM-HB4RgrhsQhVb2uQ'}

Polling for Completion

You will then need to poll request_check_url, like this:
import time

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break
Eventually, the status field will be set to complete, and you will get an object that looks like this:
{
    "output_format": "markdown",
    "markdown": "...",
    "status": "complete",
    "success": True,
    "images": {...},
    "metadata": {...},
    "error": "",
    "page_count": 5
}

Response fields

  • output_format is the requested output format, json, html, or markdown.
  • markdown | json | html is the output from the file. It will be named according to the output_format. You can find more details on the json format here.
  • status - indicates the status of the request (complete, or processing).
  • success - indicates if the request completed successfully. True or False.
  • images - dictionary of image filenames (keys) and base64 encoded images (values). Each value can be decoded with base64.b64decode(value). Then it can be saved to the filename (key).
  • meta - metadata about the markdown conversion.
  • error - if there was an error, this contains the error message.
  • page_count - number of pages that were converted.
And boom, you have a PDF converted into Markdown! You can change output_format to get other response formats. If you have more downstream needs like taking that content and structuring it into a specific schema, you may be interested in Structured Extraction. Important!: All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.

Try it out

Sign up for Datalab and try out Marker. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!