marker
(document conversion), and ocr
endpoints. All Datalab converters can be accessed via a /api/v1/{converter_name}
endpoint.
Submit a Request
request_id
is a unique identifier for the job you can use it to get a result.
Get a Result
X-API-Key
header with your API
key.
The API always accepts JSON in request bodies and returns JSON in
response bodies. You will need to send the content-type: application/json
header
in requests. Data is uploaded as a multipart/form-data
request, with the file
included as a file
field.
/api/v1/marker
.
Here is an example request in Python:
multipart/form-data
.
Parameters:
file
, the input file.output_format
- one of json
, html
, or markdown
.force_ocr
will force OCR on every page (ignore the text in the PDF). This is slower, but can be useful for PDFs with known bad text.format_lines
will partially OCR the lines to properly include inline math and styling (bold, superscripts, etc.). This is faster than force_ocr
.paginate
- adds delimiters to the output pages. See the API reference for details.use_llm
- setting this to True
will use an LLM to enhance accuracy of forms, tables, inline math, and layout. It can be much more accurate, but carries a small hallucination risk. Setting use_llm
to True
will make responses slower.strip_existing_ocr
- setting to True
will remove all existing OCR text from the file and redo OCR. This is useful if you know OCR text was added to the PDF by a low-quality OCR tool.disable_image_extraction
- setting to True
will disable extraction of images. If use_llm
is set to True
, this will also turn images into text descriptions.max_pages
- from the start of the file, specifies the maximum number of pages to inference.request_check_url
, like this:
status
field will be set to complete
, and you will get an object that looks like this:
False
, you will get an error code along with the response.
All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.
output_format
is the requested output format, json
, html
, or markdown
.markdown
| json
| html
is the output from the file. It will be named according to the output_format
. You can find more details on the json format here.status
- indicates the status of the request (complete
, or processing
).success
- indicates if the request completed successfully. True
or False
.images
- dictionary of image filenames (keys) and base64 encoded images (values). Each value can be decoded with base64.b64decode(value)
. Then it can be saved to the filename (key).meta
- metadata about the markdown conversion.error
- if there was an error, this contains the error message.page_count
- number of pages that were converted./api/v1/ocr
will run OCR on a given page and return detailed character and positional information.
Here is an example request in python:
request_check_url
, as seen above in the marker section.
The final response will look like this:
False
, you will get an error code along with the response.
All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.
status
- indicates the status of the request (complete
, or processing
).success
- indicates if the request completed successfully. True
or False
.error
- If there was an error, this is the error message.page_count
- number of pages we ran ocr on.pages
- a list containing one dictionary per input page. The fields are:
text_lines
- the detected text and bounding boxes for each line
text
- the text in the lineconfidence
- the confidence of the model in the detected text (0-1)polygon
- the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.bbox
- the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.chars
- the individual characters in the line
text
- the text of the characterbbox
- the character bbox (same format as line bbox)polygon
- the character polygon (same format as line polygon)confidence
- the confidence of the model in the detected character (0-1)bbox_valid
- if the character is a special token or math, the bbox may not be validpage
- the page number in the fileimage_bbox
- the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.