Datalab supports the marker (document conversion), and ocr endpoints. All Datalab converters can be accessed via a /api/v1/{converter_name} endpoint. Submit a Request
cURL
curl -X POST https://www.datalab.to/api/v1/{endpoint} \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@library/document.pdf"
This returns the following JSON response:
{
  "success": true,
  "error": null,
  "request_id": "<string>",
  "request_check_url": "<string>"
}
request_id is a unique identifier for the job you can use it to get a result. Get a Result
curl -X GET https://www.datalab.to/api/v1/{engine_name}/{endpoint} \
  -H "X-API-Key: YOUR_API_KEY"
Each endpoint has its own options and response formats. See below for more information

Authentication

All requests to the Datalab API must include an X-API-Key header with your API key. The API always accepts JSON in request bodies and returns JSON in response bodies. You will need to send the content-type: application/json header in requests. Data is uploaded as a multipart/form-data request, with the file included as a file field.

Marker

The marker endpoint is available at /api/v1/marker. Here is an example request in Python:
import requests

url = "https://www.datalab.to/api/v1/marker"

form_data = {
    'file': ('test.pdf', open('~/pdfs/test.pdf', 'rb'), 'application/pdf'),
    "force_ocr": (None, False),
    "paginate": (None, False),
    'output_format': (None, 'markdown'),
    "use_llm": (None, False),
    "strip_existing_ocr": (None, False),
    "disable_image_extraction": (None, False)
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Everything is a form parameter. This is because we’re uploading a file, so the request body has to be multipart/form-data. Parameters:
  • file, the input file.
  • output_format - one of json, html, or markdown.
  • force_ocr will force OCR on every page (ignore the text in the PDF). This is slower, but can be useful for PDFs with known bad text.
  • format_lines will partially OCR the lines to properly include inline math and styling (bold, superscripts, etc.). This is faster than force_ocr.
  • paginate - adds delimiters to the output pages. See the API reference for details.
  • use_llm - setting this to True will use an LLM to enhance accuracy of forms, tables, inline math, and layout. It can be much more accurate, but carries a small hallucination risk. Setting use_llm to True will make responses slower.
  • strip_existing_ocr - setting to True will remove all existing OCR text from the file and redo OCR. This is useful if you know OCR text was added to the PDF by a low-quality OCR tool.
  • disable_image_extraction - setting to True will disable extraction of images. If use_llm is set to True, this will also turn images into text descriptions.
  • max_pages - from the start of the file, specifies the maximum number of pages to inference.
You can see a full list of parameters and descriptions in the API reference. The request will return the following:
{'success': True, 'error': None, 'request_id': "PpK1oM-HB4RgrhsQhVb2uQ", 'request_check_url': 'https://www.datalab.to/api/v1/marker/PpK1oM-HB4RgrhsQhVb2uQ'}
You will then need to poll request_check_url, like this:
import time

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break
Eventually, the status field will be set to complete, and you will get an object that looks like this:
{
    "output_format": "markdown",
    "markdown": "...",
    "status": "complete",
    "success": True,
    "images": {...},
    "metadata": {...},
    "error": "",
    "page_count": 5
}
If success is False, you will get an error code along with the response. All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.

Response fields

  • output_format is the requested output format, json, html, or markdown.
  • markdown | json | html is the output from the file. It will be named according to the output_format. You can find more details on the json format here.
  • status - indicates the status of the request (complete, or processing).
  • success - indicates if the request completed successfully. True or False.
  • images - dictionary of image filenames (keys) and base64 encoded images (values). Each value can be decoded with base64.b64decode(value). Then it can be saved to the filename (key).
  • meta - metadata about the markdown conversion.
  • error - if there was an error, this contains the error message.
  • page_count - number of pages that were converted.

OCR

The OCR endpoint at /api/v1/ocr will run OCR on a given page and return detailed character and positional information. Here is an example request in python:
import requests

url = "https://www.datalab.to/api/v1/ocr"

form_data = {
    'file': ('test.png', open('~/images/test.png', 'rb'), 'image/png'),
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
The request will return the following:
{'success': True, 'error': None, 'request_id': "PpK1oM-HB4RgrhsQhVb2uQ", 'request_check_url': 'https://www.datalab.to/api/v1/ocr/PpK1oM-HB4RgrhsQhVb2uQ'}
You will then need to poll request_check_url, as seen above in the marker section. The final response will look like this:

{
    'status': 'complete',
    'pages': [
        {
            'text_lines': [{
                'polygon': [[267.0, 139.0], [525.0, 139.0], [525.0, 159.0], [267.0, 159.0]],
                'confidence': 0.99951171875,
                'text': 'Subspace Adversarial Training',
                'bbox': [267.0, 139.0, 525.0, 159.0]
            }, ...],
            'image_bbox': [0.0, 0.0, 816.0, 1056.0],
            'page': 12
        }
    ],
    'success': True,
    'error': '',
    'page_count': 5
}
If success is False, you will get an error code along with the response. All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.

Response fields

  • status - indicates the status of the request (complete, or processing).
  • success - indicates if the request completed successfully. True or False.
  • error - If there was an error, this is the error message.
  • page_count - number of pages we ran ocr on.
  • pages - a list containing one dictionary per input page. The fields are:
    • text_lines - the detected text and bounding boxes for each line
      • text - the text in the line
      • confidence - the confidence of the model in the detected text (0-1)
      • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
      • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
      • chars - the individual characters in the line
        • text - the text of the character
        • bbox - the character bbox (same format as line bbox)
        • polygon - the character polygon (same format as line polygon)
        • confidence - the confidence of the model in the detected character (0-1)
        • bbox_valid - if the character is a special token or math, the bbox may not be valid
    • page - the page number in the file
    • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.