[Techtalk] home server and digitizing documents
Miriam English
mim at miriam-english.org
Thu May 8 01:26:29 UTC 2014
I use tesseract for OCR. It's free and very reliable.
http://code.google.com/p/tesseract-ocr/
It was originally proprietary software developed by HP-Labs. Google
bought it and open-sourced it. It's gone from strength to strength.
In its early days it couldn't cope with multiple columns, but the
version I have (v3.00) does. They're up to version 3.03 so I should
upgrade again.
It tends to be a bit sensitive to resolution, having an optimum range
about 300 - 600 dpi for ordinary print. And I've found it works best
with grayscale images, though it works well with color too.
Tesseract does just basic OCR. It doesn't fix hyphenation and it can
make occasional stupid mistakes like putting the number "1" in a word
instead of the letter "l". It would be nice if it learned from its
mistakes, and one day it will, but it's not there yet. (You can train it
for certain fonts, but I haven't tried that yet). At the moment I find
it easy enough to run a spell-check on the result. Formatting can be a
bit of a pain because it (well, the version I have) ignores the indent
beginning paragraphs, which makes it difficult to reformat. But it is
the right price: free. And it is improving all the time, as you'd expect
with an open-source project.
Tesseract is really easy to set up with scripts. Some of the people who
use it to convert books for Project Gutenberg have made some pretty cool
scripts, but I tend to roll my own.
I'd love to one day build a mass-scanning setup like this:
http://www.instructables.com/id/DIY-High-Speed-Book-Scanner-from-Trash-and-Cheap-C/?ALLSTEPS
This one is probably beyond me, but would be lovely to have:
http://www.theverge.com/2012/11/13/3639016/google-books-scanner-vacuum-diy
Cheers,
- Miriam
Chris Hardy wrote:
> I'm rethinking my household server/network and one thing that I need to address is digitizing archival documents (tax receipts and handwritten journals).
>
> I have a freeNAS box and an ESXi box running a variety of linux and windoze virtual machines (unfortunately freeNAS isn't really stable on a VM, hence the separation). I also have a mac laptop, but am reluctant to invest in mac-dependent archival gear since I doubt that I'll stick with mac for the next decade.
>
> What setup works for you? what hardware are you using? Is ScanSnap really the only document scanner solution? What software are you using for OCR? What are you using for document management like indexing and search?
>
> Thanks for any thoughts and experiences!
>
> chris
>
>
> _______________________________________________
> Techtalk mailing list
> Techtalk at linuxchix.org
> http://mailman.linuxchix.org/mailman/listinfo/techtalk
>
>
--
If you don't have any failures then you're not trying hard enough.
- Dr. Charles Elachi, director of NASA's Jet Propulsion Laboratory
-----
Website: http://miriam-english.org
Blogs: http://miriam-e.dreamwidth.org
http://miriam-e.livejournal.com
More information about the Techtalk
mailing list