Skip to main content
Datalab supports the marker (document conversion and structured extraction), and ocr endpoints. All Datalab converters can be accessed via a /api/v1/{converter_name} endpoint. (Note: for detailed tutorials, check out our Recipes section) Submit a Request
cURL
curl -X POST https://www.datalab.to/api/v1/{endpoint} \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@library/document.pdf"
This returns the following JSON response:
{
  "success": true,
  "error": null,
  "request_id": "<string>",
  "request_check_url": "<string>"
}
request_id is a unique identifier for the job you can use it to get a result. Get a Result
curl -X GET https://www.datalab.to/api/v1/{engine_name}/{endpoint} \
  -H "X-API-Key: YOUR_API_KEY"
Each endpoint has its own options and response formats. See below for more information

Authentication

All requests to the Datalab API must include an X-API-Key header with your API key. The API always accepts JSON in request bodies and returns JSON in response bodies. You will need to send the content-type: application/json header in requests. Data is uploaded as a multipart/form-data request, with the file included as a file field.

Marker

You can submit PDFs for conversion using the Marker API. In general, the marker endpoint is available at /api/v1/marker. Here is an example request in Python:
import requests

url = "https://www.datalab.to/api/v1/marker"

form_data = {
    'file': ('test.pdf', open('~/pdfs/test.pdf', 'rb'), 'application/pdf'),
    "force_ocr": (None, False),
    "paginate": (None, False),
    'output_format': (None, 'markdown'),
    "use_llm": (None, False),
    "strip_existing_ocr": (None, False),
    "disable_image_extraction": (None, False)
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Everything is a form parameter. This is because we’re uploading a file, so the request body has to be multipart/form-data. You can see a full list of parameters and descriptions in the API reference. The request will return the following:
{'success': True, 'error': None, 'request_id': "PpK1oM-HB4RgrhsQhVb2uQ", 'request_check_url': 'https://www.datalab.to/api/v1/marker/PpK1oM-HB4RgrhsQhVb2uQ'}
You will then need to poll request_check_url, like this:
import time

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break
Eventually, the status field will be set to complete, and you will get an object that looks like this:
{
    "output_format": "markdown",
    "markdown": "...",
    "status": "complete",
    "success": True,
    "images": {...},
    "metadata": {...},
    "error": "",
    "page_count": 5
}
If success is False, you will get an error code along with the response. All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then. We have a more detailed guide for Marker here under Recipes.

Structured Extraction

Structured Extraction is an output mode of marker that lets you provide the fields you care about extracting from a PDF. After conversion, marker will check to see if you passed in a page_schema parameter and use it to run extraction. We have a detailed guide on how you can run Structured Extraction with our API, including tips for handling long documents and other edge cases in our Recipes here.

OCR

The OCR endpoint at /api/v1/ocr will run OCR on a given page and return detailed character and positional information. Here is an example request in python:
import requests

url = "https://www.datalab.to/api/v1/ocr"

form_data = {
    'file': ('test.png', open('~/images/test.png', 'rb'), 'image/png'),
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
The request will return the following:
{'success': True, 'error': None, 'request_id': "PpK1oM-HB4RgrhsQhVb2uQ", 'request_check_url': 'https://www.datalab.to/api/v1/ocr/PpK1oM-HB4RgrhsQhVb2uQ'}
You will then need to poll request_check_url, as seen above in the marker section. The final response will look like this:

{
    'status': 'complete',
    'pages': [
        {
            'text_lines': [{
                'polygon': [[267.0, 139.0], [525.0, 139.0], [525.0, 159.0], [267.0, 159.0]],
                'confidence': 0.99951171875,
                'text': 'Subspace Adversarial Training',
                'bbox': [267.0, 139.0, 525.0, 159.0]
            }, ...],
            'image_bbox': [0.0, 0.0, 816.0, 1056.0],
            'page': 12
        }
    ],
    'success': True,
    'error': '',
    'page_count': 5
}
If success is False, you will get an error code along with the response. All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.

Response fields

  • status - indicates the status of the request (complete, or processing).
  • success - indicates if the request completed successfully. True or False.
  • error - If there was an error, this is the error message.
  • page_count - number of pages we ran ocr on.
  • pages - a list containing one dictionary per input page. The fields are:
    • text_lines - the detected text and bounding boxes for each line
      • text - the text in the line
      • confidence - the confidence of the model in the detected text (0-1)
      • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
      • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
      • chars - the individual characters in the line
        • text - the text of the character
        • bbox - the character bbox (same format as line bbox)
        • polygon - the character polygon (same format as line polygon)
        • confidence - the confidence of the model in the detected character (0-1)
        • bbox_valid - if the character is a special token or math, the bbox may not be valid
    • page - the page number in the file
    • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
I