Monday, September 15, 2014

Defining sub-regions of images

One of the things that often happens in manuscripts is that certain areas bear writing and the rest of the image can be more or less ignored. In some files I was sent recently by a group interested in TILT, most of the page, handwritten on the verso, is taken up by writing that shows through from the recto. Other cases include decorative borders or letterheads that ideally need to be excluded. Doubtless there are automatic methods to deal with some of these. The question is, how often do they go wrong? And are they not designed to work on regularly laid-out pages in printed books? What about manuscripts where the writing goes right into the margin? In such cases any automatic technique is likely to do more damage than it repairs. Manually specifying these active regions is worth the small effort it costs if the result is much improved word-recognition and text-to-image linking. All that was needed was to 'white-out' the regions of no interest, and so the same procedure as before would ignore those areas.

A case in point was my Harpur typescript, pasted into a scrapbook. Word recognition worked poorly because of interference between the text and residual elements on the border. Now it as good as the best cases. But this does mean that it is not entirely automatic.