TILT – text to image linking tool

Further development of TILT

2015-10-30T12:40:00.001-07:00

Funding from an US university has been secured to complete the TILT tool to a finished state. In the next few months I will be completing the editing GUI and also add a viewing tool for pages recognised using the TILT tool. This will allow texts to be scrolled in sync with the source images, linked at the word-level. This should allow very smooth scrolling. I will also be writing up a publication for a HCI-type journal to explain the design in preparation for user-testing.

TILT continues to improve

2015-05-07T21:10:00.000-07:00

TILT has now been thoroughly revised for each stage except the final linking. The result is better separation of text from the image, better line and word-recognition. It is almost at the point now where for many manuscripts the automatic output will be good enough for a first pass set of text-to-image links. Here's the automatic results for the difficult De Roberto V2 manuscript of The Viceroys. Note the good automatic word-identification. This will be refined further in the linking step. By reference to the real text these shapes represents it will be possible to split some of the merged words, though not those where there is overwriting, of course. The main difference is seen in the way that words are restricted to their own line-region, which is now polygonal rather than rectangular. The latter strategy was too simple, and led to words being recognised that crossed line-boundaries. Joined ascenders and descenders are all too common in manuscripts.

Binarising manuscript images

2015-04-13T03:12:00.000-07:00

The first step in trying to recognise words on a page is to obtain a good back and white representation of the colour original. Typically what you get with manuscript images, once you follow the standard binarisation techniques (like Sauvola's) is that thin pen-strokes disappear and the writing breaks up. This makes recognition of lines and words very difficult. Here's part of the Brewster journal (Biodiversity Heritage Library) rendered using the Ocropus toolset:

As you can see, the thin pen-strokes have broken up and the text is difficult to read. But then I realised that the information about these thin strokes was still in the original greyscale image. By comparing the local broken-characters with that I should be able to extend them so long as they were darker than the local pixel density. For this to work I had to create several copies of the image:

A gaussian blurred version of the original greyscale, with a blur radius of 1/80th of the image height.
The greyscale image
The regular binarised image
A mask, generated by blurring the binarised image, and rendering all the blurred pixels as pure black

By examining each pixel in 3. if it was black I recomputed the "blob" of connected pixels directly from the greyscale image at the same coordinates. To decide whether it should be black or white I used the value at the same coordinates in 1, which is effectively the local average pixel value. Since text and background will be included in the calculation the average at each point is likely to be somewhat higher than the background density, so any pixels at least that dark and in the vicinity of the originally recognised text are very likely to be the missing fragments of letters. Judge for yourself:

To stop this extension bleeding into the surrounding parts of the greyscale image I used a mask to restrict how far the local extensions would go. The blur radius I used was 1/200th of the image height. I've tried it on printed books, typescripts and manuscripts and it seems to work quite well. The beauty of this is how simple it is: no machine learning, no fancy transformations. The small lines under the words could be removed by a blue-filter on the original colour image. But I'm only interested in word-recognition, not recognition of letters, so mostly these lines do no damage.

A real editing interface

2015-02-25T23:52:00.001-08:00

TILT finally has an editing GUI of sorts. Although it doesn't do much yet, over the next few days it should acquire all the new parts that have been created in the past two weeks, and incorporate the TILT back-end service to recognise pages dynamically. It will also allow the user to edit and save alignments, which have mostly been produced automatically. So far all it can do is switch between justified view of the text and line-by-line mode, and zoom the image. But as can be seen from the buttons on the left, there is plenty to come.

Slicing and having multiple polygons on a page

2015-02-20T01:45:00.001-08:00

One of the problems I described in the last entry was slicing a polygon in two. Doing this required a pretty good understanding of high-school geometry. But it now works in the demo on this page.

But having more than one polygon creates a new problem. As you move your mouse over the page you have to decide over which polygon you are moving, or on which corner-point you are clicking. And there might be hundreds of polygons or points on a page, and you have only a few milliseconds to decide between them. Imagine that you have some way to test "is the point where the mouse is inside this polygon"? or "is the mouse over a corner point of this polygon"? If you have,say, 100 polygons, you will have to call those two tests 100 times whenever the mouse moves even a small distance. And that will be much too slow if you want the interface to be responsive.

So I decided to divide the page-image up into four rectangles (NW, NE, SW, SE) containing either at most four points, or nothing. If there is nothing then the rectangle acts as a container for four smaller rectangles that are inside it. And as the mouse can only be in one place at a time it can only ever be over one rectangle that has any points. As well as the points each such rectangle also contains a list of polygons that overlap with some part of it. Deciding which rectangle you are in is easy because they are nested. So now it is a simple matter to test "is the mouse currently over a polygon or a corner-point?" because there will only be a few of them in each rectangle. The only problem with this is that as you edit the polygons and the points the rectangles must be kept up to date. But that is a solvable problem. Click on the image below to see it in action:

So now what I have is point-delete, point-add, point-drag, and polygon-anchor (freeze and highlight points), polygon-highlight and polygon-split. That's enough to try to create a usable GUI for editing the output of the TILT recognition process. And that, of course, is the next step.

Second steps with user interface

2015-02-15T12:30:00.001-08:00

I have refined the test program in the previous post to add points. So now you can both add and delete as well as move points. Some would argue that this is enough. But there are some tools that would greatly speedup editing that no one else seems to have thought about yet:

Slicing a polygon in two. Imagine that you have a polygon that covers several words. You need to cut it quickly into two. With just the ability to delete and add points to existing polygons (no new polygons) how else can you do that? With the mouse all you would need to do would be to drag a line over an anchored polygon, then release the mouse to slice it along that line.
Merge two or more polygons. If you have fragmented polygons it would be great to just shift-click them and then merged them in a single stroke of the mouse. This could be done by dragging from inside one of the polygons and dragging across the ones to be merged, ending inside another polygon. Then all dragged-over polygons could be merged.
Create blobs. By clicking on a region that has no polygon you could send a message to the service to try to recognise a word in one go.

I've nearly got 1. to work.1 and 2 are a bit counter-intuitive because dragging in drawing programs is supposed to draw a square marquee. But marquees are just not very useful in this case, so I think overriding the default is a good idea. We need to facilitate the operations that the editor of a set of polygons will use all the time, or it will quickly become tedious.

First steps with user interface for TILT

2014-11-09T15:02:00.001-08:00

One of the key requests that emerged from the BL Labs seminar last Monday (3/11/14) was the need for a usable GUI to control TILT. I wanted something efficient and light-weight that would last more than a few months, so I set about trying to build the GUI using only standard Web-technologies – just jQuery and Javascript – no 'plugins'. My experience with the latter is that, once their creators have moved on, they don't tend to be updated, and are quickly replaced by the next latest fad. Also, learning how to use them, and more often than not discovering half-way through that they are missing some key feature often make more work than just writing your own solution from scratch. So, I thought I would solve the most difficult problem first: how to represent and edit polygons on screen.

What it currently lacks is the ability to add and subtract points. After that is added, I need to make it load the polygons from an entire page of them generated by TILT. But one step at a time.

Improved Line recognition

2014-10-14T19:46:00.000-07:00

I managed to improve line-recognition, which has been plaguing the effectiveness of later stages. Until you can do that in manuscripts, with lines that are not horizontal, evenly spaced or straight, you have no chance of even recognising words reliably. My previous method, by first dividing the pages into small rectangles, could detect the same line two or three times over - for example, once for the ascenders, descenders and the main body of the line. The new method first blurs the image, then subdivides it into narrow vertical strips. The strips are then reduced to a single pixel in width by averaging them horizontally. This produces a graph that indicates the rise and fall of blackness within the strip. But since the data has many small peaks and troughs that aren't really interesting, I first apply a smoothing function before trying to detect the main peaks of blackness. These will very likely correspond to the black lines of type or writing. The final step is to join up the lines detected in the strips by horizontally aligning them as before. The result is very good line-recognition on most of the examples. Here's how the De Roberto manuscript, which is fairly average in difficulty, looks with the lines recognised on top of the blurred image:

A side-effect of this approach is that it should improve word-recognition, not only by helping to locate words, but also by joining up word-fragments through blurring. However, I'm running out of time now as the deadline of November 3 looms.

Defining sub-regions of images

2014-09-15T23:14:00.001-07:00

One of the things that often happens in manuscripts is that certain areas bear writing and the rest of the image can be more or less ignored. In some files I was sent recently by a group interested in TILT, most of the page, handwritten on the verso, is taken up by writing that shows through from the recto. Other cases include decorative borders or letterheads that ideally need to be excluded. Doubtless there are automatic methods to deal with some of these. The question is, how often do they go wrong? And are they not designed to work on regularly laid-out pages in printed books? What about manuscripts where the writing goes right into the margin? In such cases any automatic technique is likely to do more damage than it repairs. Manually specifying these active regions is worth the small effort it costs if the result is much improved word-recognition and text-to-image linking. All that was needed was to 'white-out' the regions of no interest, and so the same procedure as before would ignore those areas.

A case in point was my Harpur typescript, pasted into a scrapbook. Word recognition worked poorly because of interference between the text and residual elements on the border. Now it as good as the best cases. But this does mean that it is not entirely automatic.

Woo-hoo! It works!

2014-08-01T03:47:00.000-07:00

I always like to capture the joyful moment when something finally 'works'. It makes all the labour seem worthwhile, even if, to an untrained eye, it would appear to suffer yet from many deficiencies. So to cut to the chase: the demo site now links the page image one word at a time to the text on the right, and vice versa. Just select an example and click on "upload", then "link". The best example is the Harpur Sonnets. There are problems with splitting shapes that belong to several words (try splitting a polygon some time), and also there are many other deficiencies: for example, I don't like the shape of the polygons – they're ugly convex ones and I want concave ones that surround the word elegantly. And I am painfully aware that my word and line-recognition modules still need some work. But all these improvements and others can be comfortably consigned to 'future work'.

The total development time from start to this point has been around 6 and a half weeks, part-time work for one programmer. When you compare other projects that have worked on this same problem, and didn't get as far, and how much they cost, that's pretty damn fast.

Addendum: There is now a better version but it is much slower. The problem is that words get recognised on the wrong lines. Once that is fixed it should work OK on all the examples. But I won't upload a new version until I'm happy with the speed.

Getting word-spacing right

2014-07-09T17:51:00.002-07:00

As mentioned in the previous post the hardest thing to get right in TILT is to accurately estimate the minimum space between words. A little reflection will show that manuscripts, typescripts and printed texts all employ very different conventions on the use of spaces between words. How can you estimate word-spaces in manuscripts? It's hopeless, surely?

In fact, there is a trivial solution. A page-image is made up of 'blobs', that is pixels that are joined together. Wrapping each such blob in a polygon allows you to compute the distance between blobs on a line. In a printed text there will be one blob per letter. In a manuscript, because of joined-up writing, there will be lots of characters per blob. And then every now and again there will be a gap between blobs that is not a word-space. So how can these informal gaps be distinguished from real word-spaces? Another problem is that there are columns where the inter-word gap is measured in hundreds of pixels. Just measuring gaps in a line and averaging the result thus has no hope of success. How can these huge inter-word gaps be excluded? But then I realised that the number of words on a page is already known, because TILT needs the text of each page to align it with the image. So all I had to do was to find all the gaps on a page, then sort them by decreasing size. By assigning one gap to each word in the text the last one chosen would be the width of the minimum word-gap.

This works so well, I have updated the test script to show it. It will work for any page from a manuscript, typescript, inscription or printed book. The only gotcha is that you must know the number of words, or use an estimate. Also it can never be perfect. There is no one setting for a minimal word-gap, since an author can write two words with less than this separation and separate two halves of one word with greater than this. But it's still an optimal solution.

Just to give you some idea of how much the minimum word-gap varies between the test examples:

Type	Author	Number of words	Minimum word-space (pixels)
Typescript	Harpur	150	4
Printed	De Roberto	291	12
Manuscript	De Roberto	353	6
Manuscript	Capuana	205	7

Now for the text-to-word alignment. That's the last stage.

TILT recognises words in manuscripts, typescripts, books

2014-07-06T02:05:00.001-07:00

The next milestone has been reached. TILT can now recognise words on the lines identified earlier with reasonable accuracy. What it does is pretty simple: it just looks for black text in a strip on either side of the lines identified in the previous step and then extends any black lines discovered outwards. It finally draws a polygon around the discovered word.

The next step will be to refine these words so they represent as closely as possible the words in the transcribed text. Then it should be easy to align them with the words in the transcription of the page (which we often already have) and hand over to Anna for development of the GUI. Here's a sample of TILT's current performance using polygons. These can be reduced to rectangles easily if desired.

The main problem in recognising words turned out to be the different way that spaces are used in printed and handwritten texts. In the former there are lots of little inter-character gaps that mostly aren't present in manuscripts. Try as I might I couldn't find a single setting that worked well for both. These images show the performance on a typescript, manuscript and a printed page. The colours are used alternately to show where word-divisions have been recognised. To get this performance in practice the GUI will have to specify the image type.

The next stage has some ability to split/merge words, based on their alignment with a known text, but it would be better if a good word identification can be attained at this stage.

TILT recognises lines in manuscript/print books

2014-06-27T13:49:00.001-07:00

Perhaps the hardest thing to get right in the TILT design is to reliably recognise lines on a page where division into lines may be irregular. For example, you can easily have uneven line-spacing, inserted, warped and tilted lines, but in order to recognise the words on a page you first have to work out roughly here they are. TILT has shown, early in its life, that the basic idea for its line-recognition method works. There is a live demo here. It is slow only because the server is slow. TILT is actually pretty fast. Once you've loaded a page you can click on some buttons to see how TILT processes a page-image. First it reduces it to greyscale, then to pure black and white, then it removes residual borders (which are ordinary OCR steps). Finally it searches the page for lines, using a grid of rectangles, about 25 across and 200 down. The reason for this strange proportion is that lines are pretty much shaped that way. So if lines slant down or up or curve, it should be able to track their progress across the page. So far it has demonstrated that it can discover small lines in-between the main ones. Ordinary OCR programs can't do this. They assume that text has evenly-spaced lines. TILT's test interface draws a line over the top of each line of text just for the demonstration but in the real product these lines will be invisible. Along this line it will later attempt to recognise words, and to align those words with the already transcribed text. But this step brings that much closer.

What this makes possible is the offline processing of large numbers of page-images, creating page-image to text links, which can then be uploaded. They won't be perfect without editing (which is what the graphical user interface is needed for) but for a first pass it will suffice for now.

TILT is born again

2014-06-11T13:00:00.000-07:00

This is a fresh start for the text-to-image linking tool (TILT). TILT is a tool for linking areas on a page-image taken from an old book, be it manuscript or print, and a clear transcription of its contents. As we rely more and more on the Web there is a danger that we will leave behind the great achievements of our ancestors in written form over the past 4,000 years. On the Web what happens to all those printed books, handwritten manuscripts on paper, vellum, papyrus, stone, or clay tablets etc.? Can we only see and study them by actually visiting a library or museum? Or is there some way that they can come to us, so they can be properly searched and studied, commented on and examined by anyone with a computer and an Internet link?

So how do we go about that? Haven't Google and others already put lots of old books onto the Web by scanning images of pages and their contents using OCR (optical character recognition)? Sure they have, and I don't mean to play down the significance of that, but for objects of greater than usual interest you need a lot more than mere page-images and unchecked OCR of its contents. For a start you can't OCR manuscripts, or not well enough at least. And OCR of even old printed books produces lots of errors. Laying the text directly on top of the page-images means that you can't see the transcription to verify its accuracy. Although you can search it you can't comment on it, format or edit it. And in an electronic world, where we expect so much more of a Web page than for it merely to sit there dumbly to be stared at, the first step in making the content more useful and interactive is to separate the transcription from the page-images.

Page-image and content side by side

Page images are useful because they show the true nature of the original artefact. Not so for transcriptions. These are composed of mere symbols that, by convention, were chosen to represent the contents of writing. You can't use just text on a line to represent complex mathematical formulae, drawings or wood-cuts, the typography, layout, or the underlying medium. So you still need an image of the original to provide supplementary information, and not least because you might want to verify that the transcription is a true representation of it. So the only practical way to do this is to put the transcription next to the image.

Now the problems start. One of the principles of HCI (human-computer interaction) design is that you have to to minimise the effort or ‘excise’ as the user goes about doing his or her tasks. And putting the text next to the image creates a host of problems that increase excise dramatically.

As the user scrolls down the transcription, reading it, at some point the page-image will need refreshing. And likewise if the user moves on to another page image, the transcription will have to move down also. So some linkage between the two is already needed even at the page-level of granularity.

And if the text is reformatted for the screen, perhaps on a small device like a tablet or a mobile phone, the line-breaks will be different from the original. So even if the printed text is perfectly clear, it won't be clear, as you read the transcription, where the corresponding part of the image is. You may say that this is easily solved by enforcing line-breaks exactly as they are in the original. But if you do that and the lines don't fit in the available width – and remember that half the screen is already taken up with the page-image – then the ends of each enforced line must wrap around onto the next line, or else they will become invisible off to the right. Either way it is pretty ugly and not at all readable. And consider also that the line height, or distance between lines in the transcription can never match that of the page-image. So at best you'll struggle to align even one line at a time in both halves of the display.

So what's the answer? It is, as several others have already pointed out (e.g. TILE, TBLE, EPT), to link the transcription to the page-image at the word-level. As the user moves the mouse over, or taps on, a word in the image or in the transcription the corresponding word can be highlighted in the other half of the display, even when the word is split over a line. And if needed the transcription can be scrolled up or down so that it automatically aligns with the word on the page. And now the ‘excise’ drops back to a low level.

Making it practical

The technology already exists to make these links, but the problem is, how? Creating them by hand is incredibly time-consuming and also very dull work. So automation is the key to making it work in practice. The idea of TILT is to make this task as easy and fast as possible, so we can create hundreds or thousands of such text-to-image linked pages at low cost, and make all this material truly accessible and usable. The old TILT was written at great speed for a conference in 2013. What it did well was outline how the process could be automated, but it had a number of drawbacks that can, now they are understood properly, be remedied in the next version. So this blog is to be a record of our attempts to make TILT into a practical tool. The British Library Labs ran a competition recently and we were one of two winners. They are providing us with support, materials and some publicity for the project. We aim to have it finished in demonstrable and usable form by October 2014.