Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.
This thoughtful article in the Guardian got me thinking about the fact that the choices we make about representation of Scottish Gaelic in new language tech tools are far from trivial. The balance of ages, genders and dialects, for instance, on a tool like Duolingo can impact the future of the language in ways that are hard to anticipate.
Developing something like an artificial voice is a resource-intensive endeavour, especially for a small language like Scottish Gaelic. How do we decide which dialects – which voices – survive digitally in the years to come? When a project is funded via public monies (unlike Duolingo), whose choice should it be?
With funding from UoE Challenge Investment Fund (Aug 2019), a small team of us have been busy developing the first handwriting recogniser for Scottish Gaelic. To do this, we have used Transkribus, a sophisticated, machine-learning based platform and on-line text repository.
Automatic transcription of Gaelic handwriting using Transkribus
The work began with the Digital Imaging Unit scanning about 2500 pages of handwritten manuscripts from the School of Scottish Studies Archives, supplemented by some additional scanning at the Centre for Research Collections.
Scanning manuscripts at the Centre for Research Collections
Once we received the texts, research assistant Michael Bauer manually transcribed about 18,000 words, which we used to generate our first Gaelic handwriting model. This achieved an impressive Character Error Rate (CER) of 2.53% – accuracy about 97.5%, but this was developed from and tested on one writer’s hand. We used this model to help transcribe a further 18,000 words and trained a second model. Again, this involved only one hand, but achieved a CER of 1.90%.
Using the updated model, we are moving towards our target of 500k words. We have focussed the transcription efforts recently on increasing the number of hands involved, so that our next model is more generalisable and useful. The project will finish in July 2020, when we intend to make the Gaelic handwriting recogniser available to the public through Transkribus.