Perhaps the hardest thing to get right in the TILT design is to reliably recognise lines on a page where division into lines may be irregular. For example, you can easily have uneven line-spacing, inserted, warped and tilted lines, but in order to recognise the words on a page you first have to work out roughly here they are. TILT has shown, early in its life, that the basic idea for its line-recognition method works. There is a live demo here. It is slow only because the server is slow. TILT is actually pretty fast. Once you've loaded a page you can click on some buttons to see how TILT processes a page-image. First it reduces it to greyscale, then to pure black and white, then it removes residual borders (which are ordinary OCR steps). Finally it searches the page for lines, using a grid of rectangles, about 25 across and 200 down. The reason for this strange proportion is that lines are pretty much shaped that way. So if lines slant down or up or curve, it should be able to track their progress across the page. So far it has demonstrated that it can discover small lines in-between the main ones. Ordinary OCR programs can't do this. They assume that text has evenly-spaced lines. TILT's test interface draws a line over the top of each line of text just for the demonstration but in the real product these lines will be invisible. Along this line it will later attempt to recognise words, and to align those words with the already transcribed text. But this step brings that much closer.
What this makes possible is the offline processing of large numbers of page-images, creating page-image to text links, which can then be uploaded. They won't be perfect without editing (which is what the graphical user interface is needed for) but for a first pass it will suffice for now.
No comments:
Post a Comment