Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.

Rannsachadh digiteach air a' Ghàidhlig ~ Goireasan digiteach airson nan Gàidheal

Month: February 2020

Scottish Gaelic and its representation in language tech tools

This thoughtful article in the Guardian got me thinking about the fact that the choices we make about representation of Scottish Gaelic in new language tech tools are far from trivial. The balance of ages, genders and dialects, for instance, on a tool like Duolingo can impact the future of the language in ways that are hard to anticipate.

Developing something like an artificial voice is a resource-intensive endeavour, especially for a small language like Scottish Gaelic. How do we decide which dialects – which voices – survive digitally in the years to come? When a project is funded via public monies (unlike Duolingo), whose choice should it be?

 

Building a Handwriting Recogniser for Scottish Gaelic

With funding from UoE Challenge Investment Fund (Aug 2019), a small team of us have been busy developing the first handwriting recogniser for Scottish Gaelic. To do this, we have used Transkribus, a sophisticated, machine-learning based platform and on-line text repository.

Transkribus

Automatic transcription of Gaelic handwriting using Transkribus

The work began with the Digital Imaging Unit scanning about 2500 pages of handwritten manuscripts from the School of Scottish Studies Archives, supplemented by some additional scanning at the Centre for Research Collections.

Scanning the Texts

Scanning manuscripts at the Centre for Research Collections

Once we received the texts, research assistant Michael Bauer manually transcribed about 18,000 words, which we used to generate our first Gaelic handwriting model. This achieved an impressive Character Error Rate (CER) of 2.53% – accuracy about 97.5%, but this was developed from and tested on one writer’s hand. We used this model to help transcribe a further 18,000 words and trained a second model. Again, this involved only one hand, but achieved a CER of 1.90%.

Using the updated model, we are moving towards our target of 500k words. We have focussed the transcription efforts recently on increasing the number of hands involved, so that our next model is more generalisable and useful. The project will finish in July 2020, when we intend to make the Gaelic handwriting recogniser available to the public through Transkribus.

Michael Bauer cataloguing the manuscripts

Project team

Dr William Lamb (PI): Celtic and Scottish Studies, LLC

Dr Beatrice Alex (Co-I): Edinburgh Futures Institute and LLC

Prof James Loxley (Co-I): English Literature, LLC

Dr Mark Sinclair (Consultant): Centre for Speech Technology Research (CSTR)

Mag Dr Muehlberger (Advisor): Transkribus (Innsbruck University)

Mr Michael Bauer (Research Assistant): Akerbeltz

Powered by WordPress & Theme by Anders Norén

css.php

Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.

  Cancel