Skip to main content
Let’s see how Marker can be used to convert a PDF into Markdown, JSON, or HTML. This is a classic and necessary step in modern document processing pipelines: PDFs are great for humans, bad for machines. Worse, if we want to pull information out of PDFs to store amd ise in different ways, it’s extremely hard (and in the past, we’d rely on manual entry). The Marker API is powered using our open source packages marker and surya, with a few enhancements for our platform / self-serve version.

Running a conversion

The setup looks like this:
  • Upload your file for processing, along with a few settings (we’ll cover them below)
  • Poll to see if your request is done
  • voila!

PDF Submission

The marker endpoint is available at /api/v1/marker. Here is an example request in Python:
import requests

url = "https://www.datalab.to/api/v1/marker"

form_data = {
    'file': ('test.pdf', open('~/pdfs/test.pdf', 'rb'), 'application/pdf'),
    "force_ocr": (None, False),
    "paginate": (None, False),
    'output_format': (None, 'markdown'),
    "use_llm": (None, False),
    "strip_existing_ocr": (None, False),
    "disable_image_extraction": (None, False)
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Everything is a form parameter. This is because we’re uploading a file, so the request body has to be multipart/form-data. Parameters:
  • file, the input file.
  • file_url , a URL pointing to the input file. Either file or file_url can be provided.
  • output_format - one of json, html, or markdown.
  • force_ocr will force OCR on every page (ignore the text in the PDF). This is slower, but can be useful for PDFs with known bad text.
  • format_lines will partially OCR the lines to properly include inline math and styling (bold, superscripts, etc.). This is faster than force_ocr.
  • paginate - adds delimiters to the output pages. See the API reference for details.
  • use_llm - setting this to True will use an LLM to enhance accuracy of forms, tables, inline math, and layout. It can be much more accurate, but carries a small hallucination risk. Setting use_llm to True will make responses slower.
  • strip_existing_ocr - setting to True will remove all existing OCR text from the file and redo OCR. This is useful if you know OCR text was added to the PDF by a low-quality OCR tool.
  • disable_image_extraction - setting to True will disable extraction of images. If use_llm is set to True, this will also turn images into text descriptions.
  • max_pages - from the start of the file, specifies the maximum number of pages to inference.
You can see a full list of parameters and descriptions in the Marker API reference. The request will return the following response:
{
  'success': True,
  'error': None,
  'request_id': "PpK1oM-HB4RgrhsQhVb2uQ",
  'request_check_url': 'https://www.datalab.to/api/v1/marker/PpK1oM-HB4RgrhsQhVb2uQ'
}

Polling for Completion

You will then need to poll request_check_url, like this:
import time

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break
Eventually, the status field will be set to complete, and you will get an object that looks like this:
{
    "output_format": "markdown",
    "markdown": "...",
    "status": "complete",
    "success": True,
    "images": {...},
    "metadata": {...},
    "error": "",
    "page_count": 5
}

Response fields

  • output_format is the requested output format, json, html, or markdown.
  • markdown | json | html is the output from the file. It will be named according to the output_format. You can find more details on the json format here.
  • status - indicates the status of the request (complete, or processing).
  • success - indicates if the request completed successfully. True or False.
  • images - dictionary of image filenames (keys) and base64 encoded images (values). Each value can be decoded with base64.b64decode(value). Then it can be saved to the filename (key).
  • meta - metadata about the markdown conversion.
  • error - if there was an error, this contains the error message.
  • page_count - number of pages that were converted.
And boom, you have a PDF converted into Markdown! You can change output_format to get other response formats. If you have more downstream needs like taking that content and structuring it into a specific schema, you may be interested in Structured Extraction. Important!: All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.

Full Code Sample

import os
import time
import requests
from pathlib import Path
from typing import Optional

API_URL = "https://www.datalab.to/api/v1/marker"
API_KEY = os.getenv("DATALAB_API_KEY")

def submit_and_poll_pdf_conversion(
  pdf_path: Path,
  output_format: Optional[str] = 'markdown',
  use_llm: Optional[bool] = True
):
  url = "https://www.datalab.to/api/v1/marker"

  #
  # Submit initial request
  #
  with open(pdf_path, 'rb') as f:
    form_data = {
        'file': (pdf_path.name, f, 'application/pdf'),
        "force_ocr": (None, False),
        "paginate": (None, False),
        'output_format': (None, output_format),
        "use_llm": (None, use_llm),
        "strip_existing_ocr": (None, False),
        "disable_image_extraction": (None, False)
    }
  
    headers = {"X-Api-Key": API_KEY}

    response = requests.post(url, files=form_data, headers=headers)
    data = response.json()

  #
  # Poll for completion
  #
  max_polls = 300
  check_url = data["request_check_url"]
  for i in range(max_polls):
    response = requests.get(check_url, headers=headers) # Need to include headers for API key
    check_result = response.json()

    if check_result['status'] == 'complete':
      #
      # Your processing is finished, you can do your post-processing!
      #
      converted_document = check_result[output_format]  # the 'html', 'markdown', or 'json' field in the response will contain what you're looking for (maps to our initial `output_format`)
      #
      # .. do something with it!
      #
    elif check_result["status"] == "failed":
      print("Failed to convert, uh oh...")
      break
    else:
      print("Waiting 2 more seconds to re-check conversion status")
      time.sleep(2)

Try it out

Sign up for Datalab and try out Marker - it’s free, and we’ll include credits. If you need a self-hosted solution, you can directly purchase an on-prem license, no crazy sales process needed, or reach out for custom enterprise quotes / contracts. As always, write to us at support@datalab.to if you want credits or have any specific questions / requests!
I