pdf
/application/pdf
xls
/application/vnd.ms-excel
xlsx
/application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
ods
/application/vnd.oasis.opendocument.spreadsheet
doc
/application/msword
docx
/application/vnd.openxmlformats-officedocument.wordprocessingml.document
odt
/application/vnd.oasis.opendocument.text
ppt
/application/vnd.ms-powerpoint
pptx
/application/vnd.openxmlformats-officedocument.presentationml.presentation
odp
/application/vnd.oasis.opendocument.presentation
html
/text/html
epub
/application/epub+zip
png
/image/png
jpeg
/image/jpeg
wepb
/image/webp
gif
/image/gif
tiff
/image/tiff
jpg
/image/jpg
filetype
, then using filetype.guess(FILEPATH).mime
.
format_lines
or force_ocr
to True
is a good first step. A lot of PDFs have bad text inside. Marker attempts to auto-detect this and run OCR, but the auto-detection is not 100% accurate.You can also pass the block_correction_prompt
field if you have specific things you want to change about the output.