Marker - Datalab Documentation

Marker

curl --request POST \
  --url https://www.datalab.to/api/v1/marker \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form 'file_url=<string>' \
  --form max_pages=123 \
  --form 'page_range=<string>' \
  --form 'langs=<string>' \
  --form force_ocr=false \
  --form format_lines=false \
  --form paginate=false \
  --form strip_existing_ocr=false \
  --form disable_image_extraction=false \
  --form disable_ocr_math=false \
  --form use_llm=false \
  --form mode=fast \
  --form 'output_format=<string>' \
  --form skip_cache=false \
  --form save_checkpoint=false \
  --form 'block_correction_prompt=<string>' \
  --form 'page_schema=<string>' \
  --form 'segmentation_schema=<string>' \
  --form 'additional_config=<string>' \
  --form workflowstepdata_id=123 \
  --form file=@example-file

{
  "success": true,
  "error": "<string>",
  "request_id": "<string>",
  "request_check_url": "<string>",
  "versions": {}
}

POST

api

marker

Marker

curl --request POST \
  --url https://www.datalab.to/api/v1/marker \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form 'file_url=<string>' \
  --form max_pages=123 \
  --form 'page_range=<string>' \
  --form 'langs=<string>' \
  --form force_ocr=false \
  --form format_lines=false \
  --form paginate=false \
  --form strip_existing_ocr=false \
  --form disable_image_extraction=false \
  --form disable_ocr_math=false \
  --form use_llm=false \
  --form mode=fast \
  --form 'output_format=<string>' \
  --form skip_cache=false \
  --form save_checkpoint=false \
  --form 'block_correction_prompt=<string>' \
  --form 'page_schema=<string>' \
  --form 'segmentation_schema=<string>' \
  --form 'additional_config=<string>' \
  --form workflowstepdata_id=123 \
  --form file=@example-file

{
  "success": true,
  "error": "<string>",
  "request_id": "<string>",
  "request_check_url": "<string>",
  "versions": {}
}

Authorizations

X-API-Key

string

header

required

Cookies

access_token

string

Body

multipart/form-data

file_url

string | null

Optional file URL (http/https). If provided, the server will download and process it.

max_pages

integer | null

The maximum number of pages in the PDF to convert.

page_range

string | null

The page range to parse, comma separated like 0,5-10,20. This will override max_pages if provided. Example: '0,2-4' will process pages 0, 2, 3, and 4.

langs

string | null

Note: This parameter has been deprecated, and will be ignored in the current version. The languages to use if OCR is needed, comma separated. Must be either the names or codes from https://github.com/datalab-to/surya/blob/master/surya/languages.py. Any other inputs will be ignored.

force_ocr

boolean

default:false

Force OCR on all pages of the PDF. Defaults to False. This can lead to worse results if you have good text in your PDFs (which is true in most cases).

format_lines

boolean

default:false

Format the lines in the output. Defaults to False. If set to True, the lines will be formatted to detect inline math and styles.

paginate

boolean

default:false

Whether to paginate the output. Defaults to False. If set to True, each page of the output will be separated by a horizontal rule that contains the page number (2 newlines, {PAGE_NUMBER}, 48 - characters, 2 newlines).

strip_existing_ocr

boolean

default:false

Strip existing OCR text from the PDF and re-run OCR. If force_ocr is set, this will be ignored. Defaults to False.

disable_image_extraction

boolean

default:false

Disable image extraction from the PDF. If use_llm is also set, then images will be automatically captioned. Defaults to False.

disable_ocr_math

boolean

default:false

Disable inline math recognition in OCR.

use_llm

boolean

default:false

Significantly improves accuracy by using an LLM to enhance tables, forms, inline math, and layout detection. Will increase latency. Defaults to False.

mode

string

default:fast

Which output mode to use - fast has the lowest latency and preserves the most positional information. Accurate is the slowest, and preserves the least.

output_format

string | null

The output format for the text. Can be 'json', 'html', 'markdown', or 'chunks'. Defaults to 'markdown'. You can comma separate multiple formats, like markdown,html.

skip_cache

boolean

default:false

Skip the cache and re-run the inference. Defaults to False. If set to True, the cache will be skipped and the inference will be re-run.

save_checkpoint

boolean

default:false

Save the checkpoint after processing. Defaults to False. This is only useful if you're applying custom rules iteratively.

block_correction_prompt

string | null

An optional prompt that marker will use to improve the output, and align it to specific requirements.

page_schema

string | null

The schema to use for structured extraction (only used with structured extraction endpoint). The ideal way to generate this is to create a Pydantic schema, then convert to JSON with .model_dump_json().

segmentation_schema

string | null

The schema to use for document segmentation. Should be a JSON string containing segment names and descriptions for identifying page ranges of different document sections.

additional_config

string | null

Additional configuration options for marker. This should be a JSON string with key-value pairs. For example, '{"key": "value"}'. This supports these keys: ['disable_links', 'keep_pageheader_in_output', 'keep_pagefooter_in_output', 'filter_blank_pages', 'drop_repeated_text', 'layout_coverage_threshold', 'merge_threshold', 'height_tolerance', 'gap_threshold', 'image_threshold', 'min_line_length', 'level_count', 'default_level', 'no_merge_tables_across_pages', 'force_layout_block']

workflowstepdata_id

integer | null

Optional workflow step data ID. If provided, this request will be associated with the specified workflow step execution.

file

file | null

Input PDF, word document, powerpoint, or image file, uploaded as multipart form data. Images must be png, jpg, or webp format.

Response

Successful Response

request_id

string

required

The ID of the request. This ID can be used to check the status of the request.

request_check_url

string

required

The URL to check the status of the request and get results.

success

boolean

default:true

Whether the request was successful.

error

string | null

If the request was not successful, this will contain an error message.

versions

A dictionary of the versions of the libraries used in the request.

Generate Extraction Schemas Table Recognition

⌘I

API Reference

Authorizations

Cookies

Body

Response