[Techtalk] home server and digitizing documents

Miriam English mim at miriam-english.org
Thu May 8 01:26:29 UTC 2014

I use tesseract for OCR. It's free and very reliable.

It was originally proprietary software developed by HP-Labs. Google 
bought it and open-sourced it. It's gone from strength to strength.
In its early days it couldn't cope with multiple columns, but the 
version I have (v3.00) does. They're up to version 3.03 so I should 
upgrade again.

It tends to be a bit sensitive to resolution, having an optimum range 
about 300 - 600 dpi for ordinary print. And I've found it works best 
with grayscale images, though it works well with color too.

Tesseract does just basic OCR. It doesn't fix hyphenation and it can 
make occasional stupid mistakes like putting the number "1" in a word 
instead of the letter "l". It would be nice if it learned from its 
mistakes, and one day it will, but it's not there yet. (You can train it 
for certain fonts, but I haven't tried that yet). At the moment I find 
it easy enough to run a spell-check on the result. Formatting can be a 
bit of a pain because it (well, the version I have) ignores the indent 
beginning paragraphs, which makes it difficult to reformat. But it is 
the right price: free. And it is improving all the time, as you'd expect 
with an open-source project.

Tesseract is really easy to set up with scripts. Some of the people who 
use it to convert books for Project Gutenberg have made some pretty cool 
scripts, but I tend to roll my own.

I'd love to one day build a mass-scanning setup like this:

This one is probably beyond me, but would be lovely to have:


	- Miriam

Chris Hardy wrote:
> I'm rethinking my household server/network and one thing that I need to address is digitizing archival documents (tax receipts and handwritten journals).
> I have a freeNAS box and an ESXi box running a variety of linux and windoze virtual machines (unfortunately freeNAS isn't really stable on a VM, hence the separation).  I also have a mac laptop, but am reluctant to invest in mac-dependent archival gear since I doubt that I'll stick with mac for the next decade.
> What setup works for you?  what hardware are you using?  Is ScanSnap really the only document scanner solution? What software are you using for OCR? What are you using for document management like indexing and search?
> Thanks for any thoughts and experiences!
> chris
> _______________________________________________
> Techtalk mailing list
> Techtalk at linuxchix.org
> http://mailman.linuxchix.org/mailman/listinfo/techtalk

If you don't have any failures then you're not trying hard enough.
  - Dr. Charles Elachi, director of NASA's Jet Propulsion Laboratory
Website: http://miriam-english.org
Blogs:   http://miriam-e.dreamwidth.org

More information about the Techtalk mailing list