Accurately parse a PDF into a more useful format!
marker
and surya
, with a few enhancements for our platform / self-serve version.
/api/v1/marker
.
Here is an example request in Python:
multipart/form-data
.
Parameters:
file
, the input file.file_url
, a URL pointing to the input file. Either file
or file_url
can be provided.output_format
- one of json
, html
, or markdown
.force_ocr
will force OCR on every page (ignore the text in the PDF). This is slower, but can be useful for PDFs with known bad text.format_lines
will partially OCR the lines to properly include inline math and styling (bold, superscripts, etc.). This is faster than force_ocr
.paginate
- adds delimiters to the output pages. See the API reference for details.use_llm
- setting this to True
will use an LLM to enhance accuracy of forms, tables, inline math, and layout. It can be much more accurate, but carries a small hallucination risk. Setting use_llm
to True
will make responses slower.strip_existing_ocr
- setting to True
will remove all existing OCR text from the file and redo OCR. This is useful if you know OCR text was added to the PDF by a low-quality OCR tool.disable_image_extraction
- setting to True
will disable extraction of images. If use_llm
is set to True
, this will also turn images into text descriptions.max_pages
- from the start of the file, specifies the maximum number of pages to inference.request_check_url
, like this:
status
field will be set to complete
, and you will get an object that looks like this:
output_format
is the requested output format, json
, html
, or markdown
.markdown
| json
| html
is the output from the file. It will be named according to the output_format
. You can find more details on the json format here.status
- indicates the status of the request (complete
, or processing
).success
- indicates if the request completed successfully. True
or False
.images
- dictionary of image filenames (keys) and base64 encoded images (values). Each value can be decoded with base64.b64decode(value)
. Then it can be saved to the filename (key).meta
- metadata about the markdown conversion.error
- if there was an error, this contains the error message.page_count
- number of pages that were converted.output_format
to get other response formats. If you have more downstream needs like taking that content and structuring it into a specific schema, you may be interested in Structured Extraction.
Important!: All response data will be deleted from datalab servers an hour after the processing is complete, so make sure to get your results by then.