Datalab supports the following extensions and mime types:
  • PDF
    • pdf/application/pdf
  • Spreadsheet
    • xls/application/vnd.ms-excel
    • xlsx/application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
    • ods/application/vnd.oasis.opendocument.spreadsheet
  • Word documents
    • doc/application/msword
    • docx/application/vnd.openxmlformats-officedocument.wordprocessingml.document
    • odt/application/vnd.oasis.opendocument.text
  • Powerpoint
    • ppt/application/vnd.ms-powerpoint
    • pptx/application/vnd.openxmlformats-officedocument.presentationml.presentation
    • odp/application/vnd.oasis.opendocument.presentation
  • HTML
    • html/text/html
  • Epub
    • epub/application/epub+zip
  • Images
    • png/image/png
    • jpeg/image/jpeg
    • wepb/image/webp
    • gif/image/gif
    • tiff/image/tiff
    • jpg/image/jpg
You can automatically find the mimetype in python by installing filetype, then using filetype.guess(FILEPATH).mime.

Troubleshooting

If you get bad output, setting format_lines or force_ocr to True is a good first step. A lot of PDFs have bad text inside. Marker attempts to auto-detect this and run OCR, but the auto-detection is not 100% accurate.You can also pass the block_correction_prompt field if you have specific things you want to change about the output.