Major changes to the Datalab hosted service are listed here.
page_schema
to the marker
endpoint to extract structured data from documents. The schema should be a pydantic schema generated with .model_dump_json_schema()
, or another JSON schema format.chunks
output format for marker, which is a simplified list of blocks with their full html, ideal for chunking/RAG.block_correction_prompt
to the marker endpoint to correct the output of marker with your custom logic.additional_config
parameter. This is a JSON object where the keys are the configuration options and the values are the values for those options. You can see the exact options in the API schema.output_format
for marker.format_lines
flag to marker to add inline math and formatting to lines. (this will automatically OCR lines that need it, also)use_llm
.use_llm
(the high accuracy mode) now costs the same as regular inference.--use_llm
option to merge tables across pages, OCR handwriting, OCR forms, and generally have much higher quality than before.table_rec
endpoint - it now takes the --use_llm
flag, and should run much faster.use_llm
option to the marker API - this uses an LLM to make conversion much more accurate for tables, forms, inline math, and complex pages. It’s a beta feature, and will currently double the cost per request.disable_image_extraction
to disable image extraction for marker.strip_existing_ocr
to strip all existing OCR text and re-OCR (if it was added by something like tesseract)