You never realize how similar characters are until you start an OCR project. ec ...

magicalhippo · on Dec 21, 2019

I too have been very disappointed in Tesseract for "simple" OCR (converting subtitles).

For communication, Turbo Codes[1] for example have the decoder produce an integer value for each bit, rather than just a bit. The value is a measure of how likely the value is 0 or 1.

This is then used with previous bit values, which includes parity data, to make a "hard" decision.

I wonder if something similar has been tried for OCR? I imagine the OCR front-end could feed a number of probable hits, along with confidence, into a spellchecker. Or something along those lines.

[1]: https://en.wikipedia.org/wiki/Turbo_code