Datalab supports the marker (document conversion), and ocr endpoints. All Datalab converters can be accessed via a /api/v1/{converter_name} endpoint. Submit a Request
cURL
curl -X POST https://www.datalab.to/api/v1/{endpoint} \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@library/document.pdf"
This returns the following JSON response:
{
  "success": true,
  "error": null,
  "request_id": "<string>",
  "request_check_url": "<string>"
}
request_id is a unique identifier for the job you can use it to get a result. Get a Result
curl -X GET https://www.datalab.to/api/v1/{engine_name}/{endpoint} \
  -H "X-API-Key: YOUR_API_KEY"
Each endpoint has its own options and response formats. See below for more information

Authentication

All requests to the Datalab API must include an X-API-Key header with your API key. The API always accepts JSON in request bodies and returns JSON in response bodies. You will need to send the content-type: application/json header in requests. Data is uploaded as a multipart/form-data request, with the file included as a file field.

Marker

The marker endpoint is available at /api/v1/marker. Here is an example request in Python:
import requests

url = "https://www.datalab.to/api/v1/marker"

form_data = {
    'file': ('test.pdf', open('~/pdfs/test.pdf', 'rb'), 'application/pdf'),
    "force_ocr": (None, False),
    "paginate": (None, False),
    'output_format': (None, 'markdown'),
    "use_llm": (None, False),
    "strip_existing_ocr": (None, False),
    "disable_image_extraction": (None, False)
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
Everything is a form parameter. This is because we’re uploading a file, so the request body has to be multipart/form-data. Parameters:
  • file, the input file.
  • file_url , a URL pointing to the input file. Either file or file_url can be provided.
  • output_format - one of json, html, or markdown.
  • force_ocr will force OCR on every page (ignore the text in the PDF). This is slower, but can be useful for PDFs with known bad text.
  • format_lines will partially OCR the lines to properly include inline math and styling (bold, superscripts, etc.). This is faster than force_ocr.
  • paginate - adds delimiters to the output pages. See the API reference for details.
  • use_llm - setting this to True will use an LLM to enhance accuracy of forms, tables, inline math, and layout. It can be much more accurate, but carries a small hallucination risk. Setting use_llm to True will make responses slower.
  • strip_existing_ocr - setting to True will remove all existing OCR text from the file and redo OCR. This is useful if you know OCR text was added to the PDF by a low-quality OCR tool.
  • disable_image_extraction - setting to True will disable extraction of images. If use_llm is set to True, this will also turn images into text descriptions.
  • max_pages - from the start of the file, specifies the maximum number of pages to inference.
You can see a full list of parameters and descriptions in the API reference. The request will return the following:
{'success': True, 'error': None, 'request_id': "PpK1oM-HB4RgrhsQhVb2uQ", 'request_check_url': 'https://www.datalab.to/api/v1/marker/PpK1oM-HB4RgrhsQhVb2uQ'}
You will then need to poll request_check_url, like this:
import time

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break
Eventually, the status field will be set to complete, and you will get an object that looks like this:
{
    "output_format": "markdown",
    "markdown": "...",
    "status": "complete",
    "success": True,
    "images": {...},
    "metadata": {...},
    "error": "",
    "page_count": 5
}
If success is False, you will get an error code along with the response. All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.

Response fields

  • output_format is the requested output format, json, html, or markdown.
  • markdown | json | html is the output from the file. It will be named according to the output_format. You can find more details on the json format here.
  • status - indicates the status of the request (complete, or processing).
  • success - indicates if the request completed successfully. True or False.
  • images - dictionary of image filenames (keys) and base64 encoded images (values). Each value can be decoded with base64.b64decode(value). Then it can be saved to the filename (key).
  • meta - metadata about the markdown conversion.
  • error - if there was an error, this contains the error message.
  • page_count - number of pages that were converted.

Structured Extraction

While marker generally lets you convert PDFs into HTML, JSON, or Markdown, Structured Extraction is a different output mode of marker that lets you go one step further and extract only the fields you care about. You can do this by setting the page_schema parameter in your marker request, which forces it to fill in your schema after PDF conversion finishes. The easiest way to generate this correctly is to use our editor in Forge Extract, or create a Pydantic schema, then convert to JSON with .model_dump_json(). For example, if you were extracting information from a research paper, you might upload the PDF and define a schema describing only the fields you’re interested in, e.g.the title, authors, clinical_trial_inclusion_criteria, or something else related to their methodology or protocol. The description field is optional but helpful in providing the right context can helps improve extraction accuracy.

Making a Request

Here’s a representative example:
import requests

url = "https://www.datalab.to/api/v1/marker"
schema = """{
  "type": "object",
  "title": "ExampleExtractionSchema",
  "description": "Example schema to demonstrate metadata extraction from a paper",
  "properties": {
    "title": {
      "type": "string",
      "description": "the title of the paper"
    },
    "authors": {
      "type": "array",
      "description": "the authors who wrote this paper",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "description": "the name of the author"
          }
        }
      }
    }
  },
  "required": [
    "title",
    "authors"
  ]
}"""

form_data = {
    'file': ('document.pdf', open('document.pdf', 'rb'), 'application/pdf'),
    'page_schema': (None, schema),
    'output_format': (None, 'json'),
    'use_llm': (None, True)
}

headers = {"X-Api-Key": "YOUR_API_KEY"}
response = requests.post(url, files=form_data, headers=headers)
data = response.json()

# Then you need to poll for completion like in the Marker example earlier by using `request_check_url` from the response, e.g.

max_polls = 300
check_url = data["request_check_url"]

for i in range(max_polls):
    time.sleep(2)
    response = requests.get(check_url, headers=headers) # Don't forget to send the auth headers
    data = response.json()

    if data["status"] == "complete":
        break

# Do something with `data` after processing finishes

#
# Note that:
#  - the "json" field in the response is the document tree rendered as json (with bounding boxes).
#  - the "extraction_schema_json" field in the response is a stringified json field containing your filled in schema.
#

Optimal Settings:

  • Structured Extraction is essentially an “output mode” of Marker
  • By passing in page_schema, you force extraction to run after Marker finishes parsing your PDF into Markdown, etc.
  • We recommend the following settings when using Marker with extraction, though you may want to evaluate different combinations for your specific use case, as they impact accuracy and speed:
    • force_ocr: true
    • format_lines: false
    • use_llm: false

Response Handling

  • All extraction responses are available in the json field of our response. It comes as a string, but you should be able to safely parse it via JSON.parse(...).
  • For each field that we fill in from the original schema, we include an extra [fieldname]_citations referencing any block IDs from the original document. NOTE: This is currently an experimental feature and may change in the future (into a different metadata object, with the full document tree).

OCR

The OCR endpoint at /api/v1/ocr will run OCR on a given page and return detailed character and positional information. Here is an example request in python:
import requests

url = "https://www.datalab.to/api/v1/ocr"

form_data = {
    'file': ('test.png', open('~/images/test.png', 'rb'), 'image/png'),
}

headers = {"X-Api-Key": "YOUR_API_KEY"}

response = requests.post(url, files=form_data, headers=headers)
data = response.json()
The request will return the following:
{'success': True, 'error': None, 'request_id': "PpK1oM-HB4RgrhsQhVb2uQ", 'request_check_url': 'https://www.datalab.to/api/v1/ocr/PpK1oM-HB4RgrhsQhVb2uQ'}
You will then need to poll request_check_url, as seen above in the marker section. The final response will look like this:

{
    'status': 'complete',
    'pages': [
        {
            'text_lines': [{
                'polygon': [[267.0, 139.0], [525.0, 139.0], [525.0, 159.0], [267.0, 159.0]],
                'confidence': 0.99951171875,
                'text': 'Subspace Adversarial Training',
                'bbox': [267.0, 139.0, 525.0, 159.0]
            }, ...],
            'image_bbox': [0.0, 0.0, 816.0, 1056.0],
            'page': 12
        }
    ],
    'success': True,
    'error': '',
    'page_count': 5
}
If success is False, you will get an error code along with the response. All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.

Response fields

  • status - indicates the status of the request (complete, or processing).
  • success - indicates if the request completed successfully. True or False.
  • error - If there was an error, this is the error message.
  • page_count - number of pages we ran ocr on.
  • pages - a list containing one dictionary per input page. The fields are:
    • text_lines - the detected text and bounding boxes for each line
      • text - the text in the line
      • confidence - the confidence of the model in the detected text (0-1)
      • polygon - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
      • bbox - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
      • chars - the individual characters in the line
        • text - the text of the character
        • bbox - the character bbox (same format as line bbox)
        • polygon - the character polygon (same format as line polygon)
        • confidence - the confidence of the model in the detected character (0-1)
        • bbox_valid - if the character is a special token or math, the bbox may not be valid
    • page - the page number in the file
    • image_bbox - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.