You never realize how similar characters are until you start an OCR project.
ec
ij
tf
Il|1
hb
OQ
etc.
Even the tiniest addition or subtraction of printed ink can transform one character in one of the above rows to any other character in the row. Throw in page tilt/warp/etc. and OCR can frequently confuse them unless you train it specifically on your text. The pipeline I've found that works best is:
I too have been very disappointed in Tesseract for "simple" OCR (converting subtitles).
For communication, Turbo Codes[1] for example have the decoder produce an integer value for each bit, rather than just a bit. The value is a measure of how likely the value is 0 or 1.
This is then used with previous bit values, which includes parity data, to make a "hard" decision.
I wonder if something similar has been tried for OCR? I imagine the OCR front-end could feed a number of probable hits, along with confidence, into a spellchecker. Or something along those lines.
ec
ij
tf
Il|1
hb
OQ
etc.
Even the tiniest addition or subtraction of printed ink can transform one character in one of the above rows to any other character in the row. Throw in page tilt/warp/etc. and OCR can frequently confuse them unless you train it specifically on your text. The pipeline I've found that works best is:
image -> upscale -> dewarp -> OCR -> spellcheck -> grammar check