Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You never realize how similar characters are until you start an OCR project.

ec

ij

tf

Il|1

hb

OQ

etc.

Even the tiniest addition or subtraction of printed ink can transform one character in one of the above rows to any other character in the row. Throw in page tilt/warp/etc. and OCR can frequently confuse them unless you train it specifically on your text. The pipeline I've found that works best is:

image -> upscale -> dewarp -> OCR -> spellcheck -> grammar check



I too have been very disappointed in Tesseract for "simple" OCR (converting subtitles).

For communication, Turbo Codes[1] for example have the decoder produce an integer value for each bit, rather than just a bit. The value is a measure of how likely the value is 0 or 1.

This is then used with previous bit values, which includes parity data, to make a "hard" decision.

I wonder if something similar has been tried for OCR? I imagine the OCR front-end could feed a number of probable hits, along with confidence, into a spellchecker. Or something along those lines.

[1]: https://en.wikipedia.org/wiki/Turbo_code




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: