The Celtic Linguistics Group at the University of Arizona invited Dr Will Lamb to speak to them about ‘Emerging NLP for Scottish Gaelic’ on 26 March 2021. This was as part of their Formal Approaches to Celtic Linguistics lecture series. The talk went out on Zoom and was recorded and uploaded on YouTube (provided below). About 43 min into the video, there is a short demonstration of the prototype ASR system, as it stood at the time. Since then, we have improved the system further, incorporating enhanced acoustic and language models, and a post-processing stage that re-inserts much punctuation back into the output.
Since September 2020, a collaborative team from the University of Edinburgh (UoE), the University of the Highlands and Islands (UHI), and Quorate Technology, has been working towards building an Automatic Speech Recognition (ASR) system for Scottish Gaelic. This is a system that is able to automatically transcribe Gaelic speech into writing.
The applications for a Gaelic ASR system are vast, as demonstrated by those already in use for other languages, such as English. Examples of applications include voice assistants (Alexa, Siri), video subtitling, automatic transcription, and so on. Our goal for this project is to build a full working system for Gaelic in order to facilitate these types of use-cases. In the long term, for example, we hope to enable the automatic generation of transcripts and/or subtitles for pre-existing Gaelic recordings and videos. This would add value to these resources by rendering them searchable by word or topic. In this blog post, we describe our progress so far.
Data and Resources
There are 3 main components needed to construct a full ASR system. These comprise the lexicon, which maps words to their component phonemes (e.g. hello = hh ah l ow), the language model, which identifies likely sequences of words in the target language, and the acoustic model, which learns to recognise the component phonemes making up a segment of speech. The combination of these three components enables the ASR system to pick up on a sequence of phonemes in the input speech, map these phonemes to written words, and output a full predicted transcription of the recording.
The United States of <?>
Audio (Speaker says “Good Morning”)
g uh d m ao r n ih ng
Of course, building these components requires resources. In terms of the lexicon, we are fortunate enough to have this resource already available to us. Am Faclair Beag is a digital Gaelic dictionary, developed by Michael Bauer, which includes phonetic transcriptions for over 30,000 Gaelic words. We simply pulled each word and pronunciation from this dictionary and combined them into a list to serve as our initial lexicon.
For training our language model (LM), we required a large corpus of Gaelic text. A LM counts occurrences of every 4-word sequence present in this text corpus, so as to learn which phrases are common in Gaelic. The following resources were drawn upon to build this:
The gd Corpus, which is a web-scraped text corpus assembled as part of the An Crúbadán project. This project aims to build corpora and other language technology resources for minority languages
Tobar an Dualchais/Kist o Riches, a collaborative project which aims to “preserve, digitise, catalogue and make available online several thousand hours of Gaelic and Scots recordings”. They supplied several hundred transcriptions of archive material from the School of Scottish Studies Archives
Finally, for training the acoustic model, we required a large number of speech recordings along with their corresponding transcriptions. This is so that the model can learn (with help from the lexicon) how the different speech sounds map to written words. We used recordings and transcriptions from the following sources to construct this dataset:
The School of Scottish Studies Archives (UoE) – see above
Clilstore, an educational website that provides Gaelic language videos at various different CEFR levels
A note on alignment
In order to train our ASR system to map speech sounds to written words, we must time-align each transcription to its corresponding recording. In other words, the transcriptions must be given time-stamps, specifying when each transcribed word occurs in the recording.
Time-aligning the transcriptions manually is lengthy and expensive, so we generally rely on automatic methods. In fact, we use a method very similar to speech recognition to generate these alignments. The issue here is that the automatic aligner also requires time-aligned speech data for training, which we don’t have for Gaelic.
We are fortunate in that we have been able to use a pre-built English speech aligner from Quorate Technology to carry out our Gaelic alignment task. As this was trained on English speech, it may be surprising that it is still effective for aligning our Gaelic data. However, despite noticeable high-level differences between the two languages (words, grammar etc.), the aligner is able to pick up on the lower-level features of speech (pitch, tone etc.), which are global across different languages. This means it can make a good guess at when specific words occur in each recording.
The alignment process – mapping text to audio.
Adapting the Lexicon
1. Mapping from IPA to the Aligner Phoneset
Because we are using a pre-built aligner on our speech data, we must ensure that the set of phones used to phonetically transcribe the words in our lexicon is the same as the set of phones recognised by the aligner’s acoustic model. Our lexicon, from Am Faclair Beag, uses a form of Gaelic-adapted IPA, whereas the Quorate aligner recognises a special, computer-readable set of English phones. For this reason, our first task was to map each phone in the lexicon’s phoneset to its equivalent (or closest) phone used in the aligner’s phoneset.
We first standardised the lexicon phoneset, mapping each specialised Gaelic IPA phone back to its standard IPA equivalent. We next mapped this standard IPA phoneset to ARPABET, an American-English phoneset that is widely used in language technology. This is the foundation of the aligner’s phoneset. We had to draw on our phonetic knowledge of Gaelic to create the mapping from IPA to ARPABET, because the set of phones used in English speech differs to that used in Gaelic: some Gaelic phones do not exist in English. For each additional Gaelic phone, we therefore selected the ARPABET phone that was deemed its ‘closest match’. Take the following Gaelic distinction between a non-aspirated, palatalised ( kʲ ) and non-aspirated non-palatalised ( k ) stop consonant, for example:
Our final mapping was from ARPABET to the aligner’s phoneset. Considering both of these phonesets are based on English, this was a fairly easy process; each ARPABET phone had an exact equivalent in the aligner phoneset. Once we had our final phoneset mapping, we converted all the phonetic transcriptions in the lexicon to their equivalent in the aligner’s phoneset, for example:
ɯ ʃ gʲ ə
ɯ ʃ kʲ ə
UX SH K AX
uh sh k ax
g ɔ r ɔ m
k ɔ ɾ ɔ m
K AO DX AO M
k ao r ao m
2. Adding new pronunciations
For our ASR system to learn to recognise the component phones of spoken words, we need to ensure that every word that appears in our training corpus is included in the lexicon.
Our initial phoneticised lexicon stood at an impressive 30,000 Gaelic words, however, the number of words in our training corpus exceeds 150,000. This leaves 120,000 missing pronunciations, many of which will simply be morphological variations on the dictionary entries. If our model were to come across any of these words in training, it would be unable to map the acoustics of that word to its component phoneme labels.
The ASR system maps the phones recognised by the acoustic model to words, using the pronunciations in the lexicon.
A solution to this is to train a Grapheme-to-Phoneme (G2P) model, which, given a written word as input, can predict a phonetic transcription for that word, based solely on the letters (graphemes) it contains. For example:
hh uh sh k ih n aa n
k aa el ax k aa n
f uw ax iy m aa en aa n
We trained a G2P model using all the words and pronunciations already in our lexicon. The model learns typical patterns of Gaelic grapheme to phoneme mappings using these as examples. Our model achieved a symbol error rate of 3.82%, which equates to an impressive 96.18% accuracy. We subsequently used this model to predict the pronunciation for the 120,000 missing words, and added them to our lexicon.
1. Punctuation, Capitalisation, and other Junk
Our next tasks focused on normalising our text corpus. We want to ensure that any text we input to our language model is free from punctuation and capitalisation, so that the model does not distinguish between, for example, a capitalised and lowercase word (e.g. ‘Hello’ vs. ‘hello’), where the meaning of these tokens is actually the same. A simple Python programme was written for this purpose which, along with punctuation and capitalisation, also stripped out any junk, such as turn-taking indicators. Here is an example of the programme at work:
A’ cur uèirichean ri pluga.
a cur uèirichean ri pluga
An ann ro theth a bha e?
an ann ro theth a bha e
EC―00:05: Dè bha ceàrr air, air obair a’ bhanca?
dè bha ceàrr air air obair a bhanca
2. Digit Verbalisation
Another useful type of text normalisation is the verbalisation of digits. Put simply, this involves converting any digits in our corpus into words, for example, ‘42’ -> ‘forty-two’. An easy way of doing this is by using a Python tool called num2words. The tool is functional for verbalising digits into numerous languages, but unfortunately did not support Gaelic. For this reason, we coded our own Gaelic digit verbaliser, in order to verbalise the digits present in our text corpus. As the num2words projects welcomes contributions, we also hope to be able to contribute our code, so as to make the tool accessible to others.
Our digit verbaliser is currently functional for the numbers 0-100, and for the years 1100-2099. Also, as Gaelic uses both the decimal (10s) and vigesimal (20s) numbering systems, we ensured that our tool is able to verbalise each digit using either system, as specified by the user. We hope to eventually extend this to a wider range of numbers. The following examples show our digit verbaliser at work:
Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu 80 pounds.
Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu ceithir fichead pounds.
Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu ochdad pounds.
Bha, bha e ann am Poll a’ Charra ann an 1860.
Bha, bha e ann am Poll a’ Charra ann an ochd ceud deug, trì fichead.
Bha, bha e ann am Poll a’ Charra ann an ochd ceud deug ‘s a seasgad.
Current Work and Next Steps
After carrying out all the data and lexicon preparation, we were able to align our Gaelic speech data using Quorate’s English aligner. We have started using this to train our first acoustic models, and will soon be able to build our first full speech recognition system – keep an eye out for our next update!
Automatically subtitled video (using provided script)
However, aside from creating acoustic model training data, alignment can actually be useful for other purposes: it enables us to create video subtitles, for example. This kind of use case actually enables us to present our first observable results, which have been extremely encouraging. The videos in the link below exhibit our time-aligned subtitles, originally a simple transcription, separated from the video: click here to see examples of our work so far!
English speakers never have to worry about grammatical gender – nouns are just nouns. When I began learning Gaelic in my early twenties, getting to grip with grammatical gender was a challenge. Until I learnt some of the patterns intuitively, I had to look up every new noun in the dictionary to determine its gender, and add this information to my stack of flash cards. As it happens, computers also struggle with identifying gender. When we built the first part-of-speech tagger for Gaelic a few years ago, gender was one of the things that our statistically-based model often got wrong.
Some grammars supply a list of suffixes that are typically feminine or masculine, and these can be helpful to new students of Gaelic. For instance, once you know that just about all nouns ending in -chd are feminine, you can take a new noun with that suffix and be relatively confident about how to use it.In the grammar at the end of my 2008 book, Scottish Gaelic Speech and Writing, I list (pp 206-207) the suffixes provided by Calder (1923: 76-77):
Masc: -adh, -an/-ean, -as, -ach, -aiche, and -air.
But how reliable are these endings for predicting gender? And what proportion of nouns ending in a particular suffix takes the expected gender? Furthermore, are there any Gaelic suffixes that are useful for predicting gender that Gaelic grammarians haven’t noticed already?
Over the last few months, I’ve been assembling some code and resources that will allow me to do some new research on Gaelic grammar. Thanks to Michael Bauer of Am Faclair Beag, I have a large list of Gaelic words accompanied by useful lexicographical info. Last night, I wondered how well a machine learning algorithm could model the relationships between Gaelic orthography and gender.
I began by extracted all the nouns from the lexicon (17207 nouns total) along with their gender, and put them into a Python list of tuples, like this:
Then I randomised the noun list – which simply included the root form and its gender – and divided it into a training (90%) and and testing set (10%). I defined three features: 1) the last letter; 2) the last two letters and 3) the last three letters. I then built a model using a Naive Bayes Classifier from the Python package, NLTK.
When applied to the test set, the model was about 83% accurate. So, knowing the ending of Gaelic noun can definitely help if you are trying to determine its gender. Putting this in an POS tagging context, if your tagger can’t guess the gender of a word because it hasn’t seen the word before, you could use a model like this to hazard a guess and be accurate most of the time.
Calling up the most informative features from the model confirmed many expectations, but also some patterns that I didn’t expect. These are the 30 most informative features of the model – all with ratios of 10:1 or more (i.e. these endings are at least10 times more likely to be one gender than the other):
suffix2 = 'ag' f : m = 84.9 : 1.0
suffix3 = 'eag' f : m = 72.9 : 1.0
suffix3 = 'adh' m : f = 72.6 : 1.0
suffix3 = 'has' m : f = 70.6 : 1.0
suffix3 = 'nag' f : m = 54.2 : 1.0
suffix3 = 'tag' f : m = 40.2 : 1.0
suffix3 = 'rag' f : m = 37.1 : 1.0
suffix3 = 'eid' f : m = 36.3 : 1.0
suffix3 = 'gan' m : f = 34.3 : 1.0
suffix2 = 'an' m : f = 24.7 : 1.0
suffix3 = 'ear' m : f = 24.7 : 1.0
suffix3 = 'lag' f : m = 21.9 : 1.0
suffix3 = 'chd' f : m = 21.6 : 1.0
suffix2 = 'hd' f : m = 21.0 : 1.0
suffix2 = 'on' m : f = 20.3 : 1.0
suffix3 = 'ilt' f : m = 19.9 : 1.0
suffix2 = 'ar' m : f = 17.6 : 1.0
suffix3 = 'ing' f : m = 16.1 : 1.0
suffix3 = 'tan' m : f = 15.2 : 1.0
suffix3 = 'ait' f : m = 14.7 : 1.0
suffix3 = 'ean' m : f = 14.3 : 1.0
suffix2 = 'as' m : f = 14.2 : 1.0
suffix3 = 'oil' f : m = 14.2 : 1.0
suffix3 = 'lan' m : f = 13.7 : 1.0
suffix3 = 'ith' f : m = 12.7 : 1.0
suffix2 = 'am' m : f = 12.2 : 1.0
suffix3 = 'tar' m : f = 11.3 : 1.0
suffix3 = 'ram' m : f = 11.0 : 1.0
suffix2 = 'al' m : f = 10.9 : 1.0
suffix3 = 'oin' f : m = 10.8 : 1.0
It is easier to view these in bar plots. Here are 14 most typically feminine suffixes (the y axis shows the ratio ‘x:1’):
And here are the 15 most typically masculine ones:
There is some crossover here clearly (e.g. –eag, -nag, -tag, -rag and -lag are all forms of the diminutive female suffix –ag), so the model could be better specified. But these tell us that gender in Gaelic is well encoded in the suffix. For example, if you see a noun that ends with -adh, it is 76 times more likely to be masculine than feminine. Indeed, there are very few feminine nouns in the lexicon that end with -adh:
>>> [noun for (noun,gender) in nounslow if noun.endswith('adh') and gender == 'f']
['cneadh', 'dearg-chriadh', 'leasradh', 'stuadh', 'speireag-ruadh', 'muirgheadh', 'buadh', 'riadh', 'criadh', 'ealadh', 'pìob-chriadh', 'ceòlradh', 'roinn-phàigheadh']
These results can be generalised as:
Masc nouns tend to end in: -adh, -as, -an, -ar, -am, -al and broad consonant or clusters (e.g. -al), except for a vowel + –g
Fem nouns tend to end in: –ag, -chd and slender consonants or clusters (e.g. –ilt, ing, -in, -il)
As any intermediate Gaelic learner knows, a good rule to follow is that masc nouns end broad and feminine nouns end slender. But are there any exceptions? Well, we already saw that nouns ending in -ag, -achd are largely feminine. Are there any others? Digging a little deeper into the model, we find the following (ratios rounded to whole numbers):
Fem: –ìob (7:1), -ng (7:1), –lb (6:1)
Masc: –che (6:1)
Calder had the last one already (e.g. fulangaiche), but he didn’t notice that combinations of a sonorant (l, n, r) and the non-aspirated stops (b and g) tend to be feminine – words like fang and sgealb.
With the model, we can check to see what it would make of a made-up word — how it might classify a nonce word or unusual dialectal form, for instance — if we used it as part of a part-of-speech tagger:
All in all, this is useful and — if you are a language geek — pretty interesting stuff. What is exciting about doing NLP with Gaelic is that, while this type of work is old hat for many languages now, it is brand new for Scottish Gaelic.
So, by generating this model and testing it upon a hold-out of 10% of the nouns in the lexicon, we have shown that it confirms certain expectations, discovers some unexpected patterns in the language, allows us to quantify relationships and provides a pragmatic solution to the quandary of how to guess the gender of unknown words as part of an NLP pipeline.