Chinese Optical Character Recognition (OCR)
As sent to the Fanyi-L Discussion ListBelow is a summary made from postings to the Fanyi-l forum regarding OCR. If you have anything you would like to contribute to this page, please send me an email.
Date: Wed, 4 Mar 1999
From: Bin Zhang <bzhang@hawaii.edu>
Subject: Chinese OCR software?
Can anyone recommend a good Chinese OCR software? Thanks.
Date: Wed, 3 Mar 1999
From: Vic Dickson/Transco <vic@transcogroup.com>
Subject: Re: Chinese OCR software?
I use TH-OCR V7.0 developed by Electronics Department, Tsinghua Univ. (Beijing). Support numbers: (86-10)62577418/19/20/21. It works well but I can't say if it's better than others or not 'cause I haven't tried any one else.
Date: Thu, 24 Feb 2000
From: Terry L. Thatcher <ironlady@fanyi.com>
Subject: [FANYI-L:1645] Re: TRADOS (long)
I scan **all** my hard-copy Chinese files and routinely translate them into English using Trados. I'm using a garden-variety cheapie Taiwanese scanner, plus a Chinese OCR program recommended to me by the charming young girls at computer store that morning (which actually has turned out to be quite good!!). It's called 丹青中英文文件辦事系統. It handles both simplified and traditional characters, and although I have to do some post-editing on the output, it's not usually enough to be really annoying. Anyhow, I'm the type of person who never wanted to read the project before translating it, so this kind of forces me to do so!!A couple of tips if you want to do this:
If you need any more information, or want to think about sharing Trados resources, please get in touch with me off list.
- The size of the characters doesn't seem to matter much, but I had one
document which simply wouldn't scan, and it turned out to be because
the lines were too close together. Even using a higher resolution didn't
help. The company (OCR product company) was very helpful and did
their best to find a solution, however.- Fax naturally doesn't work as well as nice crisp hard copy, but it's been OK.
- It's worth taking a minute or so and defining the exact areas of each page
to be OCR'd using their little drag-and-drop tool. That way, you don't get
into problems with the machine doing its darndest to see which Classical
character those fax dots really represent. ;-)- Or, you can try what I've been doing (thanks to various credit-card
merchants over the years for the idea: a "discount" for clients sending
computer files to me -- which really means that I've raised my price
for those who send hard copy, but it sounds better this way, right??Remember, Trados is NOT machine translation -- it's just a way to organize your past translation work in small units which the computer can "remember" for you at appropriate times. You can also use a Concordance type search which works quite well. Other features of Trados, like automatic term lookup, are not yet functional from double-byte source language documents, but it hasn't been much of a problem for me.
Let me know if you spring for the package -- I'd love to share glossaries, at least. For that matter, anyone who has glossaries in any format (CSV, etc.) - we could interchange them somehow -- I have storage space on my Web site if anyone is interested.
Last updated: Monday, August 12, 2002.