Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats. Tesseract is highly customizable and can operate using most languages, including multilingual documents and vertical text.
Running tesseract on RCC Resources
To run tesseract on HPC, you can directly run the command from the terminal as it does not require a modulefile. In the example below, simply replace
outputbase with your filenames. The options and configfile content are all listed out here: https://tesseract-ocr.github.io/tessdoc/
tesseract imagename outputbase [options...] [configfile...]