Preprocessor
Preprocessor
The Preprocessor
step is used to convert multiple file-types into a cleaned text format. This step can accept the following file types:
- .docx
- .txt
- .md
If ingested files contains images of text, this will be auto-detected, and the images will be converted to raw text using the OCR engine in Tesseract.
More cleaning and preprocessing options coming soon.