Skip to content

Preprocessor

Preprocessor

The Preprocessor step is used to convert multiple file-types into a cleaned text format. This step can accept the following file types:

  • .pdf
  • .docx
  • .txt
  • .md

If ingested files contains images of text, this will be auto-detected, and the images will be converted to raw text using the OCR engine in Tesseract.

More cleaning and preprocessing options coming soon.