-- techteach menu -- main menu -- scan menu --

Scanning Text -- details


Options:
Scan:
You can scan a printed page as 1) an image, or you can use 2) OCR (Optical Character Reading) software to convert it to text.
A single sheet is easy to scan. A page in a thin magazine or paperback is fairly easy to place correctly -- break the spine and hold down the scanner cover so the page will touch the scanner glass as evenlyas possible. A page in a hard cover book is hard to scan as you may not want to damage the book -- Have it done in a copy shop, for a sheet for each page that you then can scan successfully.

Edit the images: they may need cropping (to get rid of the dark margin where the copy did not touch the scanner), straightening, sharpening, better contrast, spot removal.....
OR convert each page to text with OCR software if it has text only. The resulting text will need careful editing, especially when the copy is not good quality or a very small font is used or there are italics or accents. Remember, for example, that 1 and ! and l and i look very similar, as do g and q. Etc. Use your imagination or try to decipher an old handwritten document to find out how difficult this can be.

Then: For use on Bb or elsewhere on the web, you have alternatives:
You can use each page as an image, OR you can use the converted-to-text and edited version.
In either case, add the images or text to a Word doc or, preferably, to an html document that will open automatically on the web.

Needed:
You need a scanner -- very inexpensive.

OCR software -- Needed for conversion of printed text to digitized text:
Most scanners include basic software for image editing and text conversion. Professional OCR (Optical Character Recognition) software: TextBridge or OmniPage are recommended, especially for foreign language text. The latest version of OmniPage includes more than 100 languages and dictionaries.

The scanner makes an image of a page. Then the software tries to recognize the text in the image.
This is hard for the software and leads to mistakes. While we “read” words in context, the poor stupid software must recognize dot combinations individually.

Think of a handwritten or printed 1 and l and i and I and ! and |  -- or think of g and p and q and j. They look very much alike – especially when the print is tiny or the copy is bad.

Or visualize text in italics: 1liI!| - gpqj  especially when the print is very small -- or underlined: - 1liI!| - gpqj

Be ready to proofread all scanned text very carefully!

Make sure that what you post is accurate and easily legible!


Tips for OCR:

  1. It is faster to type short text than to scan and proofread it.
  2. It is important to have good clean pages for scanning.
  3. Computer generated pages scan very accurately.
  4. To scan pages or articles from a book or magazine, one needs to have a good copy of each page made in a copy shop so they will be straight and not have very dark shadows where the page does not touch because of the spine of the book or magazine.
  5. On a page with an illustration or printed in multiple columns, select the text areas before scanning.
  6. Usually, scan text at 100% -- but you should enlarge a page or part of a page to, say, 200%, when it has very small print, to get better character recognition results.
  7. If your copy is dark or crooked, scan it as an image and then edit the image before you use character recognition.

Finally:

If you want to use an illustrated page as is, scan the whole page as an image. If it is in a book, get a professional copy of each page from a copy shop – that way it is straight and free of dark shadows. Scan and then crop the image to get rid of shadows and, if needed, edit it for contrast and sharpness.

If you want to scan handwritten text, such as a letter or an ancestor’s birth certificate, scan it as an image.
If you want to scan just an image from an illustrated page, select it, scan it, then crop and edit it with the imaging software.

My preference is for posting pages as images, in Acrobat .pdf format.


Hoffmann, Nov. 2006