How to build tesseract 4 beta on macOS

Posted on 2018-05-06 Edited on 2023-09-29 Disqus: Word count in article: 1.1k Reading time ≈ 1 mins.

brew info tesseract

tesseract: stable 3.05.01 (bottled), HEAD
OCR (Optical Character Recognition) engine

The result of recognizing Simplified Chinese is a bit terrible.
I noticed that it added a new neural network system based on LSTM after 4.0.0+.
But it needs to be built from source code on macOS.
Fortunately, the manual on its README.md has detailed instructions.

Install dependencies

brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc

Compile

git clone https://github.com/tesseract-ocr/tesseract/
cd tesseract
./autogen.sh
./configure CC=gcc CXX=g++ CPPFLAGS=-I/usr/local/opt/icu4c/include LDFLAGS=-L/usr/local/opt/icu4c/lib
make -j
make install

Download their best trained models, download the language chi_sim.traineddata and put it under tesseract/4.0.0.1/tessdata/

Usage

1 2	tesseract image.png image -l chi_sim cat image.txt

Okay, it’s still terrible under the song font. I need to train with the new model myself.

Finally, I ignored tesseract, and I found that dragging the image into OneNote, and then Ctrl + click -> Copy Text from Picture will give higher accuracy. 😓

Translated by gpt-3.5-turbo