TILT has now been thoroughly revised for each stage except the final linking. The result is better separation of text from the image, better line and word-recognition. It is almost at the point now where for many manuscripts the automatic output will be good enough for a first pass set of text-to-image links. Here's the automatic results for the difficult De Roberto V2 manuscript of The Viceroys. Note the good automatic word-identification. This will be refined further in the linking step. By reference to the real text these shapes represents it will be possible to split some of the merged words, though not those where there is overwriting, of course. The main difference is seen in the way that words are restricted to their own line-region, which is now polygonal rather than rectangular. The latter strategy was too simple, and led to words being recognised that crossed line-boundaries. Joined ascenders and descenders are all too common in manuscripts.
No comments:
Post a Comment