Gaelic Algorithmic Research Group

Rannsachadh digiteach air a' Ghàidhlig ~ Goireasan digiteach airson nan Gàidheal

Category: Content Page 1 of 2

Agallamh le Roibeart MacThòmais / An interview with Robert Thomas

Anns an t-sreath seo, tha sinn a’ toirt sùil air laoich a rinn adhartas cudromach ann an teicneolas nan cànanan Gàidhealach. Airson a’ cheathramh agallaimh, cluinnidh sinn bho Roibeart MacThòmais. Coltach ri Lucy Evans, the Rob air ùr thighinn gu saoghal na Gàidhlig. Chaidh fhastadh airson còig mìosan ann an 2021 mar phàirt de phròiseact a mhaoinich Data-Driven Innovations (DDI), far a robh an sgioba a’ cruthachadh teicneolas aithneachadh labhairt airson na Gàidhlig. Dh’obraich Rob  air inneal coimpiutaireachd ùr-nòsach eile, An Gocair.

Nuair a bhios tu a’ feuchainn ri teicneòlas cànain a chruthachadh airson mhion-chànain, ’s e an trioblaid as bunasaiche ach dìth dàta. Chan eil an suidheachadh a thaobh na Gàidhlig buileach cho truagh ri cuid a mhion-chànanan eile, ach tha deagh chuid dhen dàta seann-fhasanta a thaobh dhòighean-sgrìobhaidh. Tha sin a’ fàgail nach gabh e cleachdadh gus modailean Artificial Intelligence a thrèanadh gun a bhith a’ cosg airgead mòr air ath-litreachadh.

Bidh An Gocair ag ath-litreachadh theacsaichean gu fèin-obrachail – tha e glè choltach ri dearbhadair-litrichidh. Chan eil ann ach ro-shamhla (prototype) an-dràsta agus tha sinn a’ sireadh taic a bharrachd airson a leasachadh. Aon uair ‘s gum bi e deiseil, b’ urrainnear a chur gu feum ann an iomadach suidheachadh, leithid foillseachadh, foghlam aig gach ìre, prògraman coimpiutaireachd eile agus rannsachadh sgoileireil. Cuiridh e gu mòr cuideachd ri pròiseact rannsachaidh ùr a tha a’ tòiseachadh an dràsta eadar còig oilthighean ann am Breatainn, Ameireaga agus Èirinn: ‘Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-mining and Phylogenetics’.

In this interview series, we are looking at individuals who have significantly advanced the field of Gaelic, Irish and Manx language technology. For the fourth interview, we hear from Mr Rob Thomas. Like Lucy Evans, whom we interviewed a few months ago, Rob has come to the world of Gaelic language technology only recently. He was chosen from a strong field to work with us on project funded by Data-Driven Innovations (DDI), in which we were developing the world’s first automatic speech recogniser for Scottish Gaelic. Rob worked on an important strand of this project – developing a brand-new piece of software called An Gocair.

When trying to develop language technology for minority languages, the most fundamental problem is data sparsity. The situation for Gaelic is not as dire as for some other minority languages, but much of the textual data available is outdated in terms of orthography. That makes it impossible to train machine learning models – at least without spending a lot of money on editing spelling.

An Gocair re-spells texts automatically – it’s basically an unsupervised spell-checker with some extra bells and whistles. It is currently only a prototype, however, and we are seeking additional support for its development. Once completed, it will be able to be used in a wide range of contexts, including publishing, education at all levels, as part of other computer programs and within academic research. It will also make a significant contribution to a new research project currently underway between five universities in Britain, America and Ireland: ‘Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-mining and Phylogenetics’.

Interview with Rob Thomas

Agallamh le Roibeart MacThòmais

Tell us a little bit about your background. For instance, where are you from, and what got you into language technology work?

Hello! I’m from a small town in South Wales called Monmouth. I grew up mostly in the countryside, quite far from civilisation. My interest in linguistics probably stems from having a fantastic English teacher in my high school. (Shout out to Mr Jones.) I don’t know if it was the content or how he taught it, but I remember at the time really enjoying the subject and his lessons.

Rob Thomas

I went on to study English Language and Linguistics at the University of Portsmouth. After graduating, I worked for a while at Marks and Spencer as I was not yet sure what kind of career I was looking for. Still kind of directionless, I spent a year and a bit traveling and on return began working in tech support. I managed to find a course in Language Technology at the University of Gothenburg, I had recently found a new interest in programming and this was a great way to merge my new interest and my academic foundation. After a few years living, studying and working in Sweden, I returned to the UK and began the job hunt and was lucky to find the position at the University of Edinburgh.

You mention studying language technology at the University of Gothenburg. What did you find most interesting about the course? Do you have any advice for someone who is thinking about studying language technology?

The course was fascinating and it attracted students from quite a broad background. The first meeting was like The Time Machine by H.G Wells: we were all introduced as the linguist or the mathematician, cognitive scientist, computer scientist, philosopher etc. I think what stood out is that language technology, as a field, relies on input and experience from a multitude of academical backgrounds. This is due to the complex nature of language. I think I would advise anyone who is not from a technical or STEM background to think about how important your knowledge and perspective is for the future of language-based AIs, systems and services. But if, like me, you do come from a humanities background be prepared to dive straight back in to the maths that you thought you managed to escape after you completed your GCSEs.

You are developing a tool for Scottish Gaelic that automatically corrects misspelled words and makes text conform to a Gaelic orthographical standard. That’s impressive for someone with Gaelic, and even more so for someone who doesn’t speak it. How did you manage to do this?

I am quite lucky to be supported by Gaelic linguists and other programmers. I found a way to integrate Am Faclair Beag, an online Gaelic dictionary developed by our resident Gaelic domain expert, Michael Bauer. Alongside the dictionary we translated complicated linguistic rules into something a computer could understand. We have managed to develop a program that takes a text and, line by line, attempts to identify spelling that don’t belong to the modern orthography and searches for the right word from our dictionary. If it has no luck, it then attempts to resolve the issue algorithmically. From the start I knew it was important that I was able to compare the program’s output to work done by Gaelic experts so that I could see whether I was improving the tool or just breaking it.

An Gocair

Since you’ve been born, you’ve seen language technology change and permeate how we work and live. What’s been your own experience of the changes that it has brought?

It has been very interesting witnessing the exponential growth of language technology in the mainstream. It wasn’t until I studied it that I realised how much it was already embedded in websites and services that I’ve been using for years. The more visible applications such as smart assistants are becoming much more normalised in our society. Even my grandma uses her smart assistant to turn on classic FM and put on timers which I think is really cool. My grandma is pretty tech savvy to be fair!

With the dominance of world languages in mass media and on the internet, some would say that technology is an existential threat to minority languages like Gaelic and Welsh. What do you think about this? Are there ways for minority languages to survive or even thrive today?

I think one of the issues in language technology is that most of the work is dedicated to languages that already have huge amounts of resources, for example English. Most of the breakthroughs are being made by large companies that ultimately aim to increase the value of their services. There are a lot of companies that sell language technology as a service (e.g. machine translation) rather than serving communities per se. The latter may not have direct monetary value, but it’s essential to keep that focus in order to allow minority languages to gain access to state-of-the-art technology.

What are your predications for language technology in the year 2050? If you had your own way, what would you like to see by that time?

I imagine smart assistants will be present in more spaces in society, perhaps even in a more official capacity. The county council in Monmouthshire already use a smart chatbot for questions about what days your bins are being collected. Imagine if they were given greater powers such as being able to make important decisions (scary thought). The more time goes on, the more I think we are going to end up with malevolent AIs like HAL from 2001, Space Odyssey, rather than ones like C3PO from Star Wars.

I’m not sure what I would like to see. It would be nice if there was more community-developed and open-source alternatives to what the main large tech companies provide, so a consumer would be able to be sure their data was being used in a safe and respectable way.

New AHRC-funded project on Gaelic & Irish folktales and the Digital Humanities

Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-Mining and Phylogenetics

This exciting new three-year study is funded by the AHRC and IRC jointly under the UK–Ireland collaboration in digital humanities programme. It brings together five international universities, two folklore archives and two online folklore portals.

October 2021–Sept 2024

‘Morraha’ by John Batten. From Celtic Fairy Tales (Jacobs 1895)

Summary

This project will fuse deep, qualitative analysis with cutting-edge computational methodologies to decode, interpret and curate the hidden heritages of Gaelic traditional narrative. In doing so, it will provide the most detailed account to date of convergence and divergence in the narrative traditions of Scotland and Ireland and, by extension, a novel understanding of their joint cultural history. Leveraging recent advances in Natural Language Processing, the consortium will digitise, convert and help to disseminate a vast corpus of folklore manuscripts in Irish and Scottish Gaelic.

The project team will create, analyse and disseminate a large text corpus of folktales from the Tale Archive of the School of Scottish Studies Archives and from the Main Manuscript Collection of the Irish National Folklore Collection. The creation of this corpus will involve the scanning of c.80k manuscript pages (and will also include pages scanned by the Dúchas digitisation project), the recognition of handwritten text on these pages (as well as some audio material in Scotland), the normalisation of non-standard text, and the machine translation of Scottish Gaelic into Irish. The corpus will then be annotated with document-level and motif-level metadata.

Analysis of the corpus will be carried out using data mining and phylogenetic techniques. Both the data mining and phylogenetic workstreams will encompass the entire corpus, however, the phylogenetic workstream will also focus on three folktale types as case studies, namely Aarne–Thompson–Uther (ATU) 400 ‘The Search for the Lost Wife’, ATU 425 ‘The Search for the Lost Husband’, and ATU 503 ‘The Gifts of the Little People’. The results of these analyses will be published in a series of articles and in a book entitled Digital Folkloristics. The corpus will be disseminated via Dúchas and Tobar an Dualchais, and via a new aggregator website (under construction) that will include map and graph visualisations of corpus data and of the results of our analysis.

Project team

UK

  • Principal Investigator Dr William Lamb, The University of Edinburgh (School of Literatures, Languages and Cultures)
  • Co-Investigator Prof. Jamshid Tehrani, Durham University (Department of Anthropology)
  • Co-Investigator Dr Beatrice Alex, The University of Edinburgh (School of Literatures, Languages and Cultures)

Ireland

  • Co-Principal Investigator Dr Brian Ó Raghallaigh, Dublin City University (Fiontar & Scoil na Gaeilge)
  • Co-Investigator Dr Críostóir Mac Cárthaigh, University College Dublin (National Folklore Collection)
  • Co-Investigator Dr Barbara Hillers, Indiana University (Folklore and Ethnomusicology)

Contact

 

‘An Gocair’: Gaelic Normalisation at a Click

By Rob Thomas

While some of our research group has been busy creating the world’s first Scottish Gaelic Speech Recognition system, others been creating the world’s first Scottish Gaelic Text Normaliser. Although it might not turn the heads of AI enthusiasts and smart device lovers in the same way, the normaliser is an invaluable tool for unlocking historical Gaelic, enhancing its use for machine learning and giving people a way to correct Gaelic spelling with no hassle.

Rob Thomas

Why do we need a Gaelic text normaliser? Well, this program takes pre-standardised texts, which can vary in their orthography, and rewrites them in the modern Gaelic Orthographic Conventions (GOC). GOC is a document published by the SQA which details the modern standards for writing in Gaelic. Text normalisation is an important step in text pre-processing for machine learning applications. It’s also useful when reprinting older texts for modern readers, or if you just want to quickly spellcheck something in Gaelic.

I joined the project towards the end and have been fast at work trying to understand Gaelic orthography, how it has developed over the centuries, and what is possible in regards to automated normalisation. I have been working alongside Michael ‘Akerbeltz’ Bauer, a Gaelic linguist with extensive credentials. He has literally written the dictionary on Gaelic as well as a book on Gaelic phonology: it is safe to say I am in good hands. We have been working together to find a way of teaching a program exactly how to normalise Gaelic text. Whereas a human can explain why a word should be spelt a specific way, programming this takes quite a bit of figuring out.

An early ancestor to Scottish Gaelic (Archaic Irish) was written in Ogham, and interestingly enough was carved vertically into stone.

Luckily historical text normalisation is a well-trodden path, and there are plenty of papers and theses online to help. In her thesis, Eva Pettersson describes four main methods for normalising text and, inspired by these, we got started. The first method relies on possessing an extensive lexicon of the target language, which we so happen to have, thanks to Michael.

Lexicon Based Normalisation

This method relies upon having a large lexicon stored that can cover the majority of words in the target language. Using this, you can check to see if a word is spelt correctly, whether it is in a traditional spelling, or if the writer has made a mistake.

The advantage of this method is that you do not have to be an expert in the language yourself (lucky for me!). Our first step was finding a way to integrate the world’s most comprehensive digital Scottish Gaelic dictionary, Am Faclair Beag. The dictionary contains traditional and misspelt words mapped to their correct spellings. This meant that we can have the program go through a text and swap words if it identifies one that needs correcting.

The table above shows some modern words with pre-GOC variants or misspellings. Michael has been collecting Gaelic words and their spelling variants for decades. If our program finds a word that is ‘out of dictionary’, we pass it on to the next stage of normalisation, which involves the hand crafting of linguistic rules.

‘An Gocair’

Rule-based Text Normalisation

Once we have filtered out all of the words that can be handled by our lexicon alone, we try to make use of linguistic rules. It’s not always easy to program a rule so that a computer can understand it. For example, we all know the English rule ‘i before e except after c’ (which of course is an inconsistent rule in English). We can program this by getting the computer to catch all the i’s before e’s and make sure they don’t come after a c.

With guidance from Michael, we went about identifying rules in Gaelic that can be intuitively programmed. One common feature of traditional Gaelic is the replacement of vowels with apostrophes at the end of words if the following word begins with a vowel. This is called ellipsis and is due to the fact that, if one were to speak the phrase, one wouldn’t pronounce both vowels: the writer is simply writing how they would speak. For example, native Gaelic speakers wouldn’t say is e an cù a tha ann ‘it is the dog’: they would say ’s e ’n cù a th’ ann, dropping three vowels. But in writing, we want these vowels to appear – at least for most machine learning situations.

It is not always straightforward working out which vowel an apostrophe replaces, but we can use a rule to help us. Gaelic vowels come in two categories, broad (a, o, u) and slender (e, i). In writing, vowels conform to the ‘broad to broad and slender to slender rule’, so when reinstating a vowel at the end of a word we need to check the form of the first vowel to the left of our apostrophe and ensure that, if it is a broad vowel, we add in a matching vowel.

Pattern Matching with Regular Expression

For this method of normalisation we make use of regular expressions for catching common examples that require normalisation, but are not covered by the lexicon or our previous rules. For example, consider the following example, which is a case of hyper-phonetic spelling, when a person writes like they speak:

Tha sgian ann a sheo tha mis’ a’ toir dhu’-sa.

Here, the word mis’ is given an apostrophe as a final character, because the following word begins with a vowel. GOC suggests that we restore the final vowel. To restore this vowel, we’re helped by the regularity of the Gaelic orthography, a form of vowel harmony, whereby each consonant has to be surrounded either by slender letters (e, i) or broad letters (a, o, u). So in the example above we need to make sure the final vowel of mis’ is a slender vowel (mise), because the first vowel to the left is also slender. We have managed to program this and, using a nifty algorithm, we can then decipher what the correct word should be. When the word is resolved we check to see if the resolved form is in the lexicon and if it is, we save it and move on to the next word.

Evaluation

Now you might be wondering how I managed to learn Scottish Gaelic so comprehensively in five months that I was able to write a program that corrects spelling and also confirm that it is working properly. Well, I didn’t. From the start of the task, I knew there was no way I would be able to gain enough knowledge about the language that I could confidently assess how well the tool was performing. Luckily I did have a large amount of text that was corrected by hand, thanks to Michael’s hard work.

To be able to verify that the tool is working, I had to write some code that automatically compares the output of the tool to the gold standard that Michael created, and then provide me with useful metrics. Eva Peterssonn describes in her thesis on Historical Text Normalisation two such metrics: error reduction and accuracy. Error reduction provides you with the percentage of errors in a text that are successfully corrected using the following formula:

Accuracy simply evaluates the number of words in the gold standard text which has an identical spelling in the normalised version. Below you can see the results of normalisation on a test set of sentences. The green line shows the percentage or errors that are corrected whilst the red and blue line show the accuracy before and after normalisation, respectively. As you can see the normaliser manages to successfully improve the accuracy, sometimes even to 100%.

From GOC to ‘An Gocair’

With a play of words on GOC, we have named the program An Gocair ‘The Un-hooker’. We have tried to make it as easy as possible to update it with new rules. We hope to have the opportunity to create more rules in the future ourselves. The program will also improve with the next iteration of Michael’s fabulous dictionary. We hope to release the first version of An Gocair to the world by the end of October 2021. Keep posted!

Acknowledgement

This program was funded by the Data-Driven Innovation initiative (DDI), delivered by the University of Edinburgh and Heriot-Watt University for the Edinburgh and South East Scotland City Region Deal. DDI is an innovation network helping organisations tackle challenges for industry and society by doing data right to support Edinburgh in its ambition to become the data capital of Europe. The project was delivered by the Edinburgh Futures Institute (EFI), one of five DDI innovation hubs which collaborates with industry, government and communities to build a challenge-led and data-rich portfolio of activity that has an enduring impact.

References

Pettersson, E. (2016). Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction, University of Uppsala.

The Acoustic Model and Scottish Gaelic Speech Recognition Results

By Lucy Evans

In our last blog post, we outlined some of the data preparation that is necessary to train the acoustic model for our Scottish Gaelic speech recognition system. This includes normalization and alignment. Normalization is where speech transcriptions are stripped of punctuation, casing, and any unspoken text. Alignment is where each word in a transcription is stamped with a start and end time to show where it occurs in an audio recording.

After these steps, speech data can be used to train an acoustic model. Once combined with our lexicon and language model (as described in our last blog post), this forms the full speech recognition system. In this blog post, we explain the function of the acoustic model and outline two common forms. We also report on our most recent Gaelic speech recognition results.

The Acoustic Model

The acoustic model is the component of a speech recogniser that recognises short speech sounds. Given an audio input where a speaker says, “She said hello”, for example, the acoustic model will try to predict which phonemes make up that utterance:

Audio Input Acoustic Model Output
Speaker says “She said hello” sh iy s eh d hh ah l ow

The acoustic model is able to recognise speech sounds by relying on its component phoneme models. Each phoneme model provides information about the expected range of acoustic features for one particular phoneme in the target language. For example, the ‘sh’ model will capture the typical pitch, energy, or formant structure of the ‘sh’ phoneme. The acoustic model uses the knowledge from these models to recognise the phonemes in an input stream of speech, based on its acoustic features. Combining this prediction with the lexicon, as well as the prediction of the language model, the system can transcribe the input sentence:

ASR System Component(s) Output, given a speaker saying: “She said hello”
Acoustic Model Prediction sh iy s eh d hh ah l ow
+ Lexicon sh iy = she

s eh d = said 

hh ah l ow = hello

+ Language Model Prediction She said hello

Training the Acoustic Model

In order to train our acoustic model, we feed it a large quantity of recorded speech in the target language. These are split up into sequences of 10ms ‘chunks’, or frames. Alongside the recordings, we also feed in their corresponding time-aligned transcription:

Aligned Gaelic speech

Using the lexicon, the system maps each word in the transcript to its component phonemes. Then, according to the start and end times of that word, it can estimate which phoneme is being pronounced during each 10ms frame where the word is being spoken. By gathering acoustic information from every frame in which each particular phoneme is pronounced, the set of phoneme models can be generated. 

Training procedure for the Acoustic Model

Types of Acoustic Model: Gaussian Mixture Models vs Deep Neural Networks 

Early acoustic modelling approaches incorporated the Gaussian Mixture Model (GMM) for building phoneme models. This is a generative type of model, meaning that it recognises the phonemes in a spoken utterance by estimating, for every 10ms frame, how likely each phoneme model is to generate that frame. For each frame, the phoneme label of the model with the highest likelihood is output.

More recent, state-of-the-art approaches use the Deep Neural Network (DNN) model. This is a discriminative model. The model directly classifies each input frame of speech with a predicted phoneme label, based on the discriminatory properties of that frame (such as its pitch or formant structure). The outputs of the two models are therefore the same – a sequence of phoneme labels – but generated in different ways. 

The reason that the DNN has overtaken the GMM in speech recognition applications is largely due to its modelling power. DNNs are models with a number of different ‘layers’, and consequently a larger number of parameters. Parameters are variables contained within the model, whose values are estimated from the training data. Put simply, having more parameters enables DNNs to retain much more information about each phoneme than GMMs, and as such, they perform better on speech recognition tasks.

Another key difference between the two types of acoustic model is the training data they require. For GMMs, we can simply input recordings with their time-aligned transcriptions, as we already prepared using Quorate’s English aligner. On the other hand, training the DNN requires that every frame of each recording is classified with its corresponding Gaelic phoneme label. We obtain these labels by training a GMM acoustic model, which, once trained on the Gaelic recordings and time-aligned transcriptions, can be used for forced alignment. During forced alignment, each frame of the speech data is aligned to a ‘gold standard’ phoneme label. This output can then be used to train the DNN model directly.

Speech Recognition Results

Having carried out the training of our GMM and DNN acoustic models, we are now in a position to report our first speech recognition results. We initially trained our models using only the Clilstore data, which amounted to 21 hours of speech training data. Next, we added the Tobar an Dualchais data to our training set, which increased the size of the dataset to 39.9 hours of speech (NB: the texts in this data are transcriptions of traditional narrative from the School of Scottish Studies Archives, made by Tobar an Dualchais staff). Finally, we added data from the School of Scottish Studies Archives via the Automatic Handwriting Recognition Project to train our third, most recent model, on 63.5 hours of speech. 

We evaluated our models on a subset of the Clilstore data, which was excluded from the training data. This evaluation set comprises 54 minutes of speech, from 21 different speakers. Each recording was passed through the speech recogniser to produce a predicted transcription. We then measured the system’s performance using Word Error Rate (WER). The WER value is the proportion of words that the speech recogniser transcribes incorrectly for each input recording. The measure can also be inverted to reflect accuracy. 

As can be seen from the table below, our results have been encouraging, especially considering that DNN models perform best when trained on much larger quantities (100s of hours) of data. We are particularly pleased to report that our latest model passed below 30% WER (i.e. > 70% accuracy), an initial goal of our Gaelic speech recognition project. 

Model

Training Corpus (hours of speech) Word Error Rate (WER) Accuracy

WER Reduction (from previous model)

A Clilstore (21) 35.8% 64.2%
B Clilstore 

+ Tobar an Dualchais (39.9)

31.0% 69.0% 4.8%
C Clilstore 

+ Tobar an Dualchais

+ Handwriting (63.5)

28.2% 71.8% 2.8%

To showcase our speech recogniser’s current performance, we have put together some demo videos. These are subtitled with the speech recogniser’s predicted transcription for each video. Please note that the subtitles will have imperfections, given that we are using our speech recogniser (with 71.8% accuracy) to generate them. Take a look by clicking this link!

Demo video screenshot

Next Steps…

With just 2 months left of the project, the countdown is on! We plan to spend this time adding a final dataset to the model’s training data, with the hopes of further reducing the WER of our system. After this, we plan to experiment with speech recognition techniques, such as data augmentation, to maximise the performance of the system on the data we have collected thus far. Make sure to look out for further updates coming soon!

Acknowledgements

With thanks to Data-Driven Innovation Initiative for funding this part of the project within their ‘Building Back Better’ open funding call

Emerging NLP for Scottish Gaelic: Lecture

The Celtic Linguistics Group at the University of Arizona invited Dr Will Lamb to speak to them about ‘Emerging NLP for Scottish Gaelic’ on 26 March 2021. This was as part of their Formal Approaches to Celtic Linguistics lecture series. The talk went out on Zoom and was recorded and uploaded on YouTube (provided below). About 43 min into the video, there is a short demonstration of the prototype ASR system, as it stood at the time. Since then, we have improved the system further, incorporating enhanced acoustic and language models, and a post-processing stage that re-inserts much punctuation back into the output.

 

Automatic Speech Recognition for Scottish Gaelic: Background and Update

By Lucy Evans

Since September 2020, a collaborative team from the University of Edinburgh (UoE), the University of the Highlands and Islands (UHI), and Quorate Technology, has been working towards building an Automatic Speech Recognition (ASR) system for Scottish Gaelic. This is a system that is able to automatically transcribe Gaelic speech into writing.

The applications for a Gaelic ASR system are vast, as demonstrated by those already in use for other languages, such as English. Examples of applications include voice assistants (Alexa, Siri), video subtitling, automatic transcription, and so on. Our goal for this project is to build a full working system for Gaelic in order to facilitate these types of use-cases. In the long term, for example, we hope to enable the automatic generation of transcripts and/or subtitles for pre-existing Gaelic recordings and videos. This would add value to these resources by rendering them searchable by word or topic. In this blog post, we describe our progress so far.

Data and Resources

There are 3 main components needed to construct a full ASR system. These comprise the lexicon, which maps words to their component phonemes (e.g. hello = hh ah l ow), the language model, which identifies likely sequences of words in the target language, and the acoustic model, which learns to recognise the component phonemes making up a segment of speech. The combination of these three components enables the ASR system to pick up on a sequence of phonemes in the input speech, map these phonemes to written words, and output a full predicted transcription of the recording.

  Input Output Prediction
Language Model The United States of <?> America
Acoustic Model Audio (Speaker says “Good Morning”) g uh d m ao r n ih ng

Of course, building these components requires resources. In terms of the lexicon, we are fortunate enough to have this resource already available to us. Am Faclair Beag is a digital Gaelic dictionary, developed by Michael Bauer, which includes phonetic transcriptions for over 30,000 Gaelic words. We simply pulled each word and pronunciation from this dictionary and combined them into a list to serve as our initial lexicon.

For training our language model (LM), we required a large corpus of Gaelic text. A LM counts occurrences of every 4-word sequence present in this text corpus, so as to learn which phrases are common in Gaelic. The following resources were drawn upon to build this:

  • The School of Scottish Studies Archives (UoE), which has provided hundreds of digitised manuscripts (via the earlier project, Building a Handwriting Recogniser for Scottish Gaelic)
  • The gd Corpus, which is a web-scraped text corpus assembled as part of the An Crúbadán project. This project aims to build corpora and other language technology resources for minority languages
  • Tobar an Dualchais/Kist o Riches, a collaborative project which aims to “preserve, digitise, catalogue and make available online several thousand hours of Gaelic and Scots recordings”. They supplied several hundred transcriptions of archive material from the School of Scottish Studies Archives

Finally, for training the acoustic model, we required a large number of speech recordings along with their corresponding transcriptions. This is so that the model can learn (with help from the lexicon) how the different speech sounds map to written words. We used recordings and transcriptions from the following sources to construct this dataset:

  • The School of Scottish Studies Archives (UoE) – see above
  • Clilstore, an educational website that provides Gaelic language videos at various different CEFR levels

A note on alignment

In order to train our ASR system to map speech sounds to written words, we must time-align each transcription to its corresponding recording. In other words, the transcriptions must be given time-stamps, specifying when each transcribed word occurs in the recording.

Time-aligning the transcriptions manually is lengthy and expensive, so we generally rely on automatic methods. In fact, we use a method very similar to speech recognition to generate these alignments. The issue here is that the automatic aligner also requires time-aligned speech data for training, which we don’t have for Gaelic.

We are fortunate in that we have been able to use a pre-built English speech aligner from Quorate Technology to carry out our Gaelic alignment task. As this was trained on English speech, it may be surprising that it is still effective for aligning our Gaelic data. However, despite noticeable high-level differences between the two languages (words, grammar etc.), the aligner is able to pick up on the lower-level features of speech (pitch, tone etc.), which are global across different languages. This means it can make a good guess at when specific words occur in each recording.

The alignment process – mapping text to audio.

Adapting the Lexicon

1. Mapping from IPA to the Aligner Phoneset

Because we are using a pre-built aligner on our speech data, we must ensure that the set of phones used to phonetically transcribe the words in our lexicon is the same as the set of phones recognised by the aligner’s acoustic model. Our lexicon, from Am Faclair Beag, uses a form of Gaelic-adapted IPA, whereas the Quorate aligner recognises a special, computer-readable set of English phones. For this reason, our first task was to map each phone in the lexicon’s phoneset to its equivalent (or closest) phone used in the aligner’s phoneset.

We first standardised the lexicon phoneset, mapping each specialised Gaelic IPA phone back to its standard IPA equivalent. We next mapped this standard IPA phoneset to ARPABET, an American-English phoneset that is widely used in language technology. This is the foundation of the aligner’s phoneset. We had to draw on our phonetic knowledge of Gaelic to create the mapping from IPA to ARPABET, because the set of phones used in English speech differs to that used in Gaelic: some Gaelic phones do not exist in English. For each additional Gaelic phone, we therefore selected the ARPABET phone that was deemed its ‘closest match’. Take the following Gaelic distinction between a non-aspirated, palatalised ( kʲ ) and non-aspirated non-palatalised ( k ) stop consonant, for example:

Gaelic IPA

(Gaelic phoneset)

Standard IPA

(global phoneset)

ARPABET

(English phoneset)

g k K
K

Our final mapping was from ARPABET to the aligner’s phoneset. Considering both of these phonesets are based on English, this was a fairly easy process; each ARPABET phone had an exact equivalent in the aligner phoneset. Once we had our final phoneset mapping, we converted all the phonetic transcriptions in the lexicon to their equivalent in the aligner’s phoneset, for example:

Word Original

(Gaelic IPA)

Standard IPA ARPABET Aligner
uisge ɯ ʃ gʲ ə ɯ ʃ kʲ ə UX SH K AX uh sh k ax
gorm g ɔ r ɔ m k ɔ ɾ ɔ m K AO DX AO M k ao r ao m

2. Adding new pronunciations

For our ASR system to learn to recognise the component phones of spoken words, we need to ensure that every word that appears in our training corpus is included in the lexicon.

Our initial phoneticised lexicon stood at an impressive 30,000 Gaelic words, however, the number of words in our training corpus exceeds 150,000. This leaves 120,000 missing pronunciations, many of which will simply be morphological variations on the dictionary entries. If our model were to come across any of these words in training, it would be unable to map the acoustics of that word to its component phoneme labels.

The ASR system maps the phones recognised by the acoustic model to words, using the pronunciations in the lexicon.

A solution to this is to train a Grapheme-to-Phoneme (G2P) model, which, given a written word as input, can predict a phonetic transcription for that word, based solely on the letters (graphemes) it contains. For example:

Input Output Prediction
h-uisgeanan hh uh sh k ih n aa n
galachan k aa el ax k aa n
fuaimeannan f uw ax iy m aa en aa n

We trained a G2P model using all the words and pronunciations already in our lexicon. The model learns typical patterns of Gaelic grapheme to phoneme mappings using these as examples. Our model achieved a symbol error rate of 3.82%, which equates to an impressive 96.18% accuracy. We subsequently used this model to predict the pronunciation for the 120,000 missing words, and added them to our lexicon.

Text Normalisation

1. Punctuation, Capitalisation, and other Junk

Our next tasks focused on normalising our text corpus. We want to ensure that any text we input to our language model is free from punctuation and capitalisation, so that the model does not distinguish between, for example, a capitalised and lowercase word (e.g. ‘Hello’ vs. ‘hello’), where the meaning of these tokens is actually the same. A simple Python programme was written for this purpose which, along with punctuation and capitalisation, also stripped out any junk, such as turn-taking indicators. Here is an example of the programme at work:

Input Output
A’ cur uèirichean ri pluga. a cur uèirichean ri pluga
An ann ro theth a bha e? an ann ro theth a bha e
EC―00:05: Dè bha ceàrr air, air obair a’ bhanca? dè bha ceàrr air air obair a bhanca

2. Digit Verbalisation

Another useful type of text normalisation is the verbalisation of digits. Put simply, this involves converting any digits in our corpus into words, for example, ‘42’ -> ‘forty-two’. An easy way of doing this is by using a Python tool called num2words. The tool is functional for verbalising digits into numerous languages, but unfortunately did not support Gaelic. For this reason, we coded our own Gaelic digit verbaliser, in order to verbalise the digits present in our text corpus. As the num2words projects welcomes contributions, we also hope to be able to contribute our code, so as to make the tool accessible to others.

Our digit verbaliser is currently functional for the numbers 0-100, and for the years 1100-2099. Also, as Gaelic uses both the decimal (10s) and vigesimal (20s) numbering systems, we ensured that our tool is able to verbalise each digit using either system, as specified by the user. We hope to eventually extend this to a wider range of numbers. The following examples show our digit verbaliser at work:

a) Numbers
Original Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu 80 pounds.
Vigesimal Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu ceithir fichead pounds.
Decimal Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu ochdad pounds.
b) Years
Original Bha, bha e ann am Poll a’ Charra ann an 1860.
Vigesimal Bha, bha e ann am Poll a’ Charra ann an ochd ceud deug, trì fichead.
Decimal Bha, bha e ann am Poll a’ Charra ann an ochd ceud deug ‘s a seasgad.

Current Work and Next Steps

After carrying out all the data and lexicon preparation, we were able to align our Gaelic speech data using Quorate’s English aligner. We have started using this to train our first acoustic models, and will soon be able to build our first full speech recognition system – keep an eye out for our next update!

Automatically subtitled video (using provided script)

However, aside from creating acoustic model training data, alignment can actually be useful for other purposes: it enables us to create video subtitles, for example. This kind of use case actually enables us to present our first observable results, which have been extremely encouraging. The videos in the link below exhibit our time-aligned subtitles, originally a simple transcription, separated from the video: click here to see examples of our work so far!

Agallamh le Lucy Evans / An interview with Lucy Evans

Anns an t-sreath seo, tha sinn a’ toirt sùil air laoich a rinn adhartas cudromach ann an teicneolas nan cànanan Gàidhealach. Airson an treasamh agallaimh, cluinnidh sinn bho thè Lucy Evans. Tha Lucy air ùr thighinn gu saoghal na Gàidhlig agus gu saoghal teicneolas cànain, ach tha i an sàs ann am pròiseact a bhios glè chudromach san àm ri teachd, thathar an dòchas. Chuir i crìoch san Lùnastal 2020 air MSc ann an Pròiseasadh Cànan is Cainnt aig Oilthigh Dhùn Èideann. Goirid an dèidh sin, thòisich i mar phàirt de sgioba rannsachaidh a bhios a’ feuchainn ris a’ chiad aithneachar cainnt a chruthachadh dhan Ghàidhlig. Thòisich am pròiseact san t-Sultain 2020 le maoineachas bho Shoillse, an lìonradh nàiseanta rannsachaidh airson glèidheadh agus ath-bheothachadh na Gàidhlig. Tha am pròiseact rannsachaidh na chom-pàirteachas eadar Oilthigh Dhùn Èideann, Oilthigh na Gàidhealtachd is nan Eilean (OGE) agus Quorate Technology Ltd. Anns a’ phìos seo, innsidh Lucy dhuinn ciamar a ghabh i ùidh anns a’ chuspair agus ciamar a bhios cuideigin aig nach eil ach glè bheag de Ghàidhlig ag obair air pròiseact toinnte mar seo.

In this series, we look at persons who have significantly advanced the field of Gaelic, Irish and Manx language technology. For the third interview, we hear from Ms Lucy Evans. Lucy has only recently come to the worlds of Gaelic and language technology, but she is involved in a project that hopefully will come to have great importance in the future. In August 2020, she finished her MSc in Speech and Language Processing at the University of Edinburgh. Shortly after that, she joined a research team that is working to develop the first working speech recogniser for Scottish Gaelic. The project began in September 2020 with funding from Soillse, the national research network for the maintenance and revitalisation of Gaelic language and culture. The research project is a collaboration between the University of the Highlands and Islands, the University of Edinburgh and Quorate Technology. In the interview, Lucy tells us how she took an interest in the subject of speech and language technology and how someone with little Gaelic, at present, is able to work on such a complicated project. 

Interview with Lucy Evans

Agallamh le Lucy Evans

“You’ve recently joined the research team developing an automatic speech recogniser for Scottish Gaelic. Tell us a little bit about your background. For example, where are you from, and what got you into language technology work?”

Lucy Evans

I grew up bilingually in Switzerland, speaking English and Italian, before moving to the UK for secondary school. Being bilingual at a young age definitely sparked a curiosity about language, and I went on to study French and Linguistics at the University of Leeds. There, I absolutely loved studying linguistics, so started looking for jobs where I could apply my knowledge from the subject. This led me to discover the field of computational linguistics, and through this I found the MSc in Speech and Language Processing. The MSc encompasses all aspects of language technology, and so was a perfect introduction to the field!

“You’ve just finished the MSc in Speech and Language Processing at the University of Edinburgh. What did you find particularly interesting about the course? Do you have any advice for someone who is thinking about doing it in the future?”

Honestly, I found the whole course really interesting! I was constantly in awe of what I was learning –  the interface between computer science and linguistics is niche, and so the techniques used are really specialised. I just find the ability of computers to pick up on all the complexities of language so interesting.

My advice for anyone taking the MSc in the future is simply to be prepared for a really intense year – you’ll be challenged constantly, not only academically, but with time management too. Having said this, the stress is definitely worth it! The course covers a huge amount of content in such a short period of time, which means you’ll be left with a really strong background in the field. A second piece of advice is to get friendly with your peers – there is such a sense of community within the course, and this is undoubtedly one of the loveliest aspects of the MSc. You’ll also get a huge amount of support from Simon King, the course director – make the most of this. Everyone really is there to help and support you, and there is so much more to the MSc than just the course content.

“For those not involved in speech technology, it might seem incredible that someone without Gaelic could develop a speech recogniser for the language. Can you explain how this is possible? And how is working with a minority language going to be different from working with a large language like English?”

As long as you have the necessary resources, it’s only the computer that has to do the language learning! One of the resources I’m talking about here is the dictionary – which essentially maps any written Gaelic word to its phonetic pronunciation. Using this and some transcribed speech data, we can split the speech into its smaller phonetic units, depending on the words in the transcription. Then we train the speech recogniser to learn what these smaller units generally sound like. When new speech is input to the speech recogniser, it can use this lower-level acoustic knowledge to predict which phones (and consequent words) make up the input speech. In this way, as long as you have appropriate (and high-quality) resources, you don’t actually need to learn the language you’re working on – the computer can do that itself!

Working with a minority language adds a challenge in that we won’t necessarily have these resources available. Luckily, for Scottish Gaelic, a digital dictionary has already been created. But this is definitely not the case for most minority languages, making the task significantly harder for non-native speakers to attempt. Furthermore, good quality, transcribed speech data is generally not so easy to come by in minority languages. In the world of machine learning, the general pattern is that the more data you have, the better your system will be. So, with less data available for these languages, it’s harder to get a better system up and running. But there are many mediating methods we can use to boost the performance of a low-resource system – it’s really about finding what works best for the dataset.

“In your own lifetime, you’ve seen language technology change and permeate how we work and live. What’s been your own experience of the changes that it has brought?”

When I was younger, I used language technology but was never really aware of what was going on in the background. Take something like a sat-nav: this is probably one of the first speech technologies I came across, and I remember just laughing about the robotic quality of the synthesised speech – I had no idea how complex the problem actually is! But the amount this has progressed in the last 10 years is crazy – it’s really impressive to see how far things have come in such a short time. For example, we can now ask a mobile phone any question and have it answer us instantly, in near-perfect speech. Things like predictive text and spell-check are other language technologies that are now so embedded in my day-to-day life that I almost forget the complex things they’re doing behind the scenes.

“What are your predications for language technology in the year 2050? If you had your own way, what would you like to see by that time?” 

This is a tricky question – considering just the changes in my lifetime, who knows where we’ll be in 30 years from now! In an ideal world, I’d love to see language tech being used more to help people and cultures. This project is an example of that – creating modern technology for endangered languages is an important way to revitalise and preserve those languages! Something I’m also really interested in is using technology to help people with speech disorders, which is definitely something that’s gaining momentum at the moment – it’ll be interesting to see how this can be further improved in years to come.

 

 

 

Agallamh le Mìcheal Bauer

Anns an t-sreath seo, tha sinn a’ toirt sùil air laoich a rinn adhartas mòr ann an teicneolas nan cànanan gàidhealach. Airson an dàrna agallaimh, cluinnidh sinn bho fhear a tha cho cudromach san 21mh linn ri Eideard Dwelly: Mìcheal Bauer. Tha Mìcheal aithnichte airson na h-obrach ealanta a rinn e le Uilleam MacDhunnchaidh airson Am Faclair Beag–faclair air loidhne a thòisich e o chionn còrr is 20 bliadhna is e na oileanach aig Oilthigh Dhùn Èideann. Chan b’ urrainn cus a ràdh air cho feumail agus cho cudromach ’s a tha am faclair seo. Ach tha e air a bhith an sàs ann an iomadach pròiseact eile an lùib teicneolas a’ chànain on a thòisich e air AFB, leithid inneal-bruidhinn Gàidhlig agus aithnichear làmh-sgrìobhainn. Tha e air leabhraichean feumail a chur a-mach leithid Blas na Gàidhlig, a tha a’ teagasg fhuaimean na Gàidhlig. A bharrachd, tha fèill mhòr air na sgilean eadar-theangachaidh aige, gu h-àraid ann an riaghaltas agus saoghal a’ ghnìomhachais. Mòran taing do Mhìcheal airson a bhith deònach an t-agallamh seo a dhèanamh.

In this series, we look at heroes of Gaelic, Irish and Manx language technology . For our second interview, we hear from someone who is perhaps as important to the Gaelic world in the 21st century as the famous lexicographer, Edward Dwelly: Michael Bauer.  Michael is best known for the work he did with Will Robertson on Am Faclair Beag, the important on-line Gaelic dictionary that he began when still a student at Edinburgh University, over 20 years ago. But he has been involved in a wide variety of projects connected to Gaelic language technology since then. For instance, he has been instrumental in the recent development of a Gaelic speech synthesiser and handwriting recogniser. He has also produced a number of excellent Gaelic-related books, such as Blas na Gàidhlig–a superb, linguistically informed guide to Gaelic pronunciation. He is also in high demand as a translator, especially in the government and commercial sectors. Many thanks to Michael for taking the time out to do this interview with us. 

(NB: We’re presenting some of these interviews in a Gaelic or Irish only format. If required, they can be translated to English using Google Translate.)  

Agallamh le Mìcheal Bauer

Interview with Michael Bauer
“Cò às a tha thu is ciamar a chaidh thu an lùib saoghal na Gàidhlig an toiseach?”

’S ann às a’ Ghearmailt a tha mi, taobh a deas na dùthcha. ’S e co-thuiteamas a thug an-seo mi–bha mi aig Oilthigh LMU mun bhliadhna 1997 agus thachair mi ri cuideigin a bha a’ fuireach faisg air Inbhir Nis. ’S ann air an eadar-lìon a bha sin.

Mìcheal Bauer (Akerbeltz)

Thàinig mi an-seo air saor-làithean fada an uairsin agus rinn mi imrich an ath-bhliadhna an dèidh dha Oilthigh Dhùn Èideann àite a thairgsinn dhomh. ’S e cànanachas agus fòn-eòlas a bha mi a’ dèanamh aig an LMU an uairsin agus bha e ’na rud nàdarra dhomh-sa m’ ainm a chur sìos airson Gàidhlig a bharrachd air cànanachas. Sin mar a thachair.

“Dè thug ort a bhith ag obair le teicneolas a’ chànain? Ciamar a thòisich thu san raon seo?”

Ag innse na fìrinn, co-thuiteamas eile. Cha robh mi cho dèidheil–no math–air teicneolas nuair a bha mi òg. Chan urrainn dhomh spot a ruitheas air sgrìn a phrògramachadh fiù an-diugh agus b’ fheudar dha m’ athair maoidheadh orm an aiste mhòr a nì oileanach sa bhliadhna mu dheireadh san àrd-sgoil a sgrìobhadh air a’ PC seach air clò-sgrìobhadair. Mean air mhean, dh’fhàs mi eòlach air an eadar-lìon is rudan mar sin. Bha mi sa chiad bhliadhna aig Oilthigh Dhùn Èideann nuair a tharraing caraid m’ aire do phròiseact a bha a’ dol aig an àm air an robh Google in Your Language. Chuir Google às dhan phròiseact ud beagan bhliadhnaichean air ais ach fad grunn bhliadhnaichean, b’ urrainn dhut d’ ainm a chur sìos mar eadar-theangadair saor-thoileach agus do chànan a chur air na goireasan a bha fosgailte aca airson eadar-theangachadh, mar an search interface aca. Bha mi air mo bheò-ghlacadh leis an nòisean sin, gun robh e nas fhasa–gu ìre–san t-saoghal digiteach ceàrn a dhèanamh airson cànain bheaga leis gun robh bits agus bytes nas saoire na soidhnichean-rathaid no leabhraichean clò-bhuailte. Agus cha do leig am beò-ghlacadh às mi on àm sin.

Bha mi air mo bheò-ghlacadh leis an nòisean sin, gun robh e nas fhasa–gu ìre–san t-saoghal digiteach ceàrn a dhèanamh airson cànain bheaga leis gun robh bits agus bytes nas saoire na soidhnichean-rathaid no leabhraichean clò-bhuailte. Agus cha do leig am beò-ghlacadh às mi on àm sin.

“Am measg nam pròiseactan teicneolais san robh thu an sàs, cò am fear bu chudromaiche no bu thlachdmhoire a bh’ ann dhut fhèin?”

Am faod mi a dhà dhiubh ainmeachadh? [d. Dall ort!] A’ chiad fhear, sin na gleusan airson teacsadh ro-innseach san robh mi an sàs, predictive texting. Bha mi airson sin a dhèanamh fad bhliadhnaichean on chiad turas a chunnaic mi dè cho luath ’s a bha sgrìobhadh air uidheaman mobile le gleus mar sin, seach a bhith sgrìobhadh rudan litir air litir. Ach cha robh comas prògramachaidh sam bith agam mar a thuirt mi roimhe agus an dèidh mar a thachair dha na h-Èireannaich, cha robh mi airson an aon mhearachd a dhèanamh às ùr. ’S e na thachair ann an Èirinn gun do stèidhich Foras na Gaeilge pròiseact Téacs, app airson teacsadh ro-innseach airson na Gaeilge. Dh’obraich sin math ge leòr fad bhliadhnaichean ach cha robh iad ’ga nuadhachadh agus cha do dh’obraich e ach air grunn handsets agus bhàsaich e mu dheireadh thall. Bha mi a’ sireadh pròiseact mòr le iomadh cànan ’na lùib agus sgioba de luchd-leasachaidh a chumadh air dol e. Ach cha robh a leithid idir furasta ri lorg. Ach mu dheireadh thall, thachair mi ri Adaptxt agus le taic o Kevin Scannell, gaisgeach-d nan cànan beaga, chaidh agam air an dàta air an robh feum a chruinneachadh agus chuir Adaptxt Gàidhlig, Gaelg agus Gaeilge ris na cànain aca. B’ fheudar dhuinn gluasad gu gleus eile bliadhnaichean an dèidh sin, Swiftkey, agus tha Gàidhlig air nochdadh ann an gleus eile no dhà on àm sin. Ach bha mi cho sona ri sagart is eallach leabhraichean air nuair a thàinig Adaptxt a-mach. Bha gleusan eile, mar Firefox, air nochdadh sa Ghàidhlig roimhe sin ach bha–agus tha–e doirbh daoine a thàladh air falbh o coimpiutairean làn-Bheurla. Bidh a’ chuid as motha ’gan cleachdadh dìreach mar a thàinig iad às a’ Bhùth agus a ghnàth, tha sin a’ ciallachadh Beurla, Beurla, Beurla. Ach bha uiread a dhaoine deònach Adaptxt a chur air na fònaichean is tablaidean aca gun robh mi fo iongnadh mòr–agus cho toilichte ’s a ghabhas.

Chan eil dad nas fheàrr na a bhith ag obair air seann chlàradh no teacsa le bodach no cailleach a chaochail deicheadan air ais agus dàta a chur ris na mapaichean, a dh’innseadh gur e, aig àm, ponach am facal a bha aig daoine air balach ann am baile Inbhir Nis. Tha e cha mhòr mar séance beag, a’ bruidhinn ris na linntean a dh’aom.

An rud eile, sin fo-phròiseact aig an Fhaclair Bheag, gleus nam mapaichean. Tha sinn uile eòlach air na deasbadan ud a thaobh faclan “nach canadh duine air eilean seo no siud”. Cha robh mi riamh deònach pàirt a ghabhail annta, ged nach eil mi nam matamataigear, tha mi a’ tuigsinn na th’ ann an representative sample agus chan eil aonan, ge be dè cho eòlach ’s a tha iad air cànan, na representative sample. Bhuail na mapaichean a thug Rob Ó Maolalaigh dhuinn sa chùrsa aige air dual-chainntean na Gàidhlig a thaobh na diofar sgìrean a chleachd, can, siobhag seach buaic agus bha guth olc ’nam cheann ag innse dhomh gum biodh rud mar sin snasail san Fhaclair Bheag. Agus ri linn sin, gu math tràth ann am beatha an fhaclair, chuir sinn gleus ris a chumadh dàta mu na h-àitichean ris an robh faclan a’ buntainn. Chan eil dad nas fheàrr na a bhith ag obair air seann chlàradh no teacsa le bodach no cailleach a chaochail deicheadan air ais agus dàta a chur ris na mapaichean, a dh’innseadh gur e, aig àm, ponach am facal a bha aig daoine air balach ann am baile Inbhir Nis. Tha e cha mhòr mar séance beag, a’ bruidhinn ris na linntean a dh’aom. Agus tha e a’ cur solas beag, mu dheireadh thall, air cuid dhe na faclan ann am faclairean mar Dwelly a dh’fhàgadh thu a’ sgròbadh do chinn roimhe a thaobh cò às a thàinig am facal annasach seo no siud.

Mapa airson ‘mand’ (Am Faclair Beag)

“Dè na duilgheadasan a th’ ann ceangailte ri bhith a’ leasachadh teicneolas airson mion-chànan mar a’ Ghàidhlig?”

Tha iomadh rud ann a tha ’ga fhàgail doirbh ach aig deireadh an latha, an dèidh dhomh a bhith an sàs ann an iomairtean teicneolais d’ an leithid fad fichead bliadhna, chanainn gur e gleus sgaoilidh an rud as motha a tha a dhìth oirnn. Innsidh mi dhut carson. Feuch na stràcan. Chan eil e doirbh PC no Mac a chur air dòigh airson ’s gun toireadh iad dhut na stràcan anns gach prògram, gun a bhith a’ tionndadh gu na gleusan àrsaidh ’s toinnte mar ‘Alt 0224’ airson ‘à’. Ach mur eil earbsa annad ann a bhith a’ fiolcadh leis a’ choimpiutair agad, mar is trice cuiridh e eagal do bheatha ort ma mholas cuideigin dhut a dhol a-steach dha na settings. Air an làimh eile, tha daoine a bu chòir a bhith eòlach air rudan mar sin, can muinntir tech supp, no na daoine a dhèiligeas ri riarachadh coimpiutaireachd sna sgoiltean, cho aineolach d’ a thaobh iad fhèin. Nach pailt na litrichean a sgrìobh mi gu comhairlean a thaobh rudan mar an “UK Extended keyboard layout” air coimpiutairean nan sgoiltean agus shaoileadh tu gun do dh’iarr mi orra an space shuttle a phrògramachadh… ’S e na tha a dhìth oirnn buidheann a thèid mun cuairt nan coimhearsnachdan Gàidhlig–agus oifisean nan daoine a nì co-dhùnaidhean a bhuineas ri saoghal digiteach na Gàidhlig–a bheir taic dhaibh leis an teicneolas Gàidhlig a th’ ann an-diugh eadar keyboard layouts agus Firefox ann an Gàidhlig agus a sgaoileas fiosrachaidh mu an dèidhinn. Ach a-rèir coltais, chan eil sin sexy gu leòr airson nam buidhnean stèidhichte… agus ri linn sin, tha aonadan Gàidhlig againn fhathast aig a bheil coimpiutairean air nach urrainn dhut à a sgrìobhadh gun copypaste no rud gòrach mar sin.

Nach pailt na litrichean a sgrìobh mi gu comhairlean a thaobh rudan mar an “UK Extended keyboard layout” air coimpiutairean nan sgoiltean agus shaoileadh tu gun do dh’iarr mi orra an space shuttle a phrògramachadh…

“Anns an làimh eile, bheil cothroman sam bith ann ma bhios tu ag obair le mion-cànan? Cò iad?”

Tha agus chan eil. Aig amannan tha e mar a bhith ’nad shuidhe air dùn-gainmhich. Chan eil stèidh dhaingeann fodhad idir agus an rud a sheas an-dè, falbhaidh e a-màireach. Can Google in Your Language--chaidh a chur ann gun làmh a bhith aig a’ choimhearsnachd ann agus chaidh a spìonadh air falbh gun làmh aig a’ choimhearsnachd. No can rudan mar Adaptxt agus Swiftkey–dìreach nuair a thug sinn ceum air adhart, tha Amazon is Google a’ cur bogsa ’nar dachaighean nach bruidhinn ach Beurla. Agus ma bhruidhneas tu ri teaghlaichean sa Chuimrigh nach bruidhinn dad ach Cuimris aig an taigh, chan e deagh-bhuaidh a th’ aig na h-innealan ud. Tha iomadh cothrom ann ach feumaidh sinn stèidh beagan nas co-ionnan. Feumaidh sinn seasamh còmhla ris na cànain bheaga eile–agus tha mi a’ gabhail feadhainn mar Eastoinis agus Catalanais a-staigh an-sin–agus cothachadh airson stèidh laghail aig ìre an Aonaidh Eòrpaich a sparras air companaidhean mòra cothrom a thoirt do chànain mar a’ Ghàidhlig agus a’ Lugsamburgais ceum a chumail ri ruith nan teicneolasan ùra.

“Nad bheachd fhèin, dè an dùbhlan as motha a th’ ann airson teicneolas na Gàidhlig anns a’ chòig bhliadhna ri teachd?”

Sasamach [d. facal snasail airson ‘Brexit’] na mallachd. Cha suarach an t-airgead a thàinig à diofar sporan an Aonaidh Eòrpaich a chur taic ri pròiseactan teicneolais Ghàidhlig thairis air na bliadhnaichean, eadar maoineachadh acadaimigeach agus maoineachadh nan roinnean, can. Cuiridh mi mo cheann an geall nach cùm Lunnainn an aon taic rinn.

“Dè an fhàisneachd a th’ agad airson teicneolas cànain anns a’ bhliadhna 2050? Dè bu mhath leat fhaicinn airson teicneolas na Gàidhlig ron àm sin?”

An lagh ud a mhol mi gu h-àrd! Ach mas e gleus teicneolais fhèin a bha thu faighneachd, bhiodh e math gleus a nì sgrìobhadh de chainnt math, leis cho dona ’s a tha daoine air sgrìobhadh na Gàidhlig san fharsaingeachd. Ach air an làimh eile, nan cuireamaid sgoil Ghàidhlig anns gach clachan sna h-Eileanan mar a bha againn roimhe, bhiodh sin a cheart cho math, nach biodh?

Sgoil Staoineabrig: an sgoil mu dheireadh ann an Uibhist far an robh a’ chlann uileag ag ionnsachadh tro mheadhan na Gàidhlig. Chaidh a dùnadh ann an 2010 (© Ailean Dòmhnallach 2010)

Ceanglaichean

Agallamh leis an Ollamh Kevin Scannell

Anns an t-sreath seo, bidh sinn a’ coimhead air sàr-laoich a rinn adhartas mòr ann an teicneolas nan cànanan gàidhealach. Airson a’ chiad agallaimh, cha b’ urrainn dhuinn na b’ fheàrr fhaighinn na ‘n t-Ollamh Kevin Scannell à Oilthigh San Louis, anns na Stàitean Aonaichte. Tha Kevin air an t-uabhas de ghoireasan a chur a-mach airson nan trì cànanan Gàidhlig, agus tha e o chionn ghoirid air duais Fulbright fhaighinn gus goireasan airson Gàidhlig na h-Èireann a chruthachadh a chleachdas teicneolas niùrail agus ionnsachadh domhainn. Mòran taing do Kevin a bhith deònach an t-agallamh seo a dhèanamh.

In this series, we look at heroes of language technology who have made significant progress for the Gaelic languages. For the first interview, we couldn’t do better than Professor Kevin Scannell of St. Louis University (USA). Kevin has produced a vast number of resources for the three Gaelic languages (Gaelic, Irish and Manx), and has recently been awarded a Fulbright Award (2019) to develop tools for Irish Gaelic that utilise neural networks and deep learning techniques. Many thanks to Kevin for agreeing to do this interview with us. 

We’re presenting some of these interviews in a Gaelic or Irish only format. If required, they can be translated to English using Google Translate.  

Agallamh leis an Ollamh Kevin Scannell

Interview with Professor Kevin Scannell

An tOllamh Kevin Scannell

Tá Kevin Scannell ina Ollamh le Matamaitic agus Ríomheolaíocht in Ollscoil San Louis, Missouri. Oibríonn sé i gcomhar le grúpaí ar fud an domhain le hacmhainní ríomhaireachta a fhorbairt a chuidíonn leo a dteanga dhúchais a úsáid ar líne. Tá suim ar leith aige sa Ghaeilge agus sna teangacha Ceilteacha eile; tá gramadóir, litreoir, agus teasáras Gaeilge forbartha aige, chomh maith le foclóirí agus inneall aistriúcháin Gàidhlig-Gaelg-Gaeilge.  Glacann sé páirt i dtogra a sholáthraíonn leaganacha Gaeilge de roinnt táirgí ríomhaireachta mór-le-rá: Mozilla Firefox, LibreOffice, Gmail, agus Twitter mar shampla. I 2011, bhunaigh sé an suíomh Indigenous Tweets chun mionteangacha agus teangacha dúchasacha a chur chun cinn sna meáin shóisialta.

“Cá as tú agus cá bhfuair tú Gaeilge ar dtús?”

Is as Bostún Mheiriceá mé ó dhúchas. Thosaigh mé ag foghlaim na Gaeilge i Meiriceá sa 1990idí, i m’aonar, ó leabhair agus ó fhoclóirí. Bhí go leor eolais agam ar litríocht na Gaeilge agus gramadach na Gaeilge ach ní raibh mé compordach leis an teanga labhartha ar feadh blianta fada. Thosaigh mé ag teacht go hÉirinn thart ar 2006 agus tháinig feabhas ar mo chumas labhartha de réir a chéile.

“Cad a thug ort oibriú le teicneolaíocht na teanga? Conas a thosaigh tú sa réimse seo?”

Go bunúsach, thosaigh mé ar an obair seo mar gheall ar na riachtanais a bhí ormsa féin mar fhoghlaimeoir. Sna 1990idí, ghlac mé páirt sna liostaí r-phoist Gaelic-L agus Gaeilge-A agus bhí díomá orm nach raibh seiceálaí litrithe ar fáil. Mar a tharlaíonn sé, bhí mé ag bailiú bunachar sonraí foclóireachta mar chuid de mo phróiseas foghlamtha. Ní raibh mórán oibre i gceist seiceálaí litrithe a chruthú as sin — bá é sin GaelSpell — foilsíodh an chéad leagan 20 bliain ó shin. Ní raibh aon saineolas agam ar an réimse seo ag an am — bhí mé i mo mhatamaiticeoir, ach bhí scileanna ríomhaireachta sách maith agam. Agus ba léir dom ag an am gurbh fhiú corpas a thógáil chun cabhrú liom an bunachar foclóireachta a thógáil níos sciobtha, agus le bheith cinnte go raibh na focail is coitianta agam. Bhailigh mé b’fhéidir milliún focal Gaeilge ón Idirlíon sna 90idí, agus lean mé ar aghaidh leis an obair sin (i dteangacha eile freisin), agus anois tá níos mó ná 200 milliún focal sa gcorpas Gaeilge ar mo ríomhaire!

“I measc na dtionscadal teicneolaíochta a raibh tú páirteach iontu, cé acu ceann ba thábhachtaí nó ba thaitneamhaí duit?”

Creid nó ná creid, déarfainn gurb é GaelSpell an tionscadal is tábhachtaí (de réir líon daoine atá ag baint úsáid as) cé nach bhfuil sé róspéisiúil ó thaobh cúrsaí teicneolaíochta. Rinne mé seiceálaí gramadaí darb ainm An Gramadóir freisin, agus bíonn go leor daltaí scoile agus mac léinn ollscoile á úsáid chun aistí a sheiceáil. Ach an ceann is tábhachtaí dar liomsa ná “An Caighdeánaitheoir”, tionscadal nach bhfuil i mbéal an phobail ar chor ar bith. Rud thar a bheith simplí atá ann — déanann sé caighdeánú ar litriú agus ar ghramadach téacsanna Gaeilge a bhí scríofa roimh an gCaighdeán Oifigiúil. D’fhoilsigh Rialtas na hÉireann mórán leabhar Gaeilge sna 1930idí, ach úsáidtear an seanlitriú iontu (agus an seanchló chomh maith). Mar sin, tá sé i bhfad níos deacra tairbhe a bhaint astu i gcúrsaí NLP, mar shampla, agus bíonn fadhbanna ag an foclóirithe Gaeilge cuardach a dhéanamh sna téacsanna seo.

Leaganacha den fhocal “Gaeilge” sa chorpas

Tá an tionscadal foclóireachta focloir.ie (An Gúm) agus Foclóir na Nua-Ghaeilge (Acadamh Ríoga na hÉireann) ag baint úsáid as an gCaighdeánaitheoir. Agus is féidir é a úsáid chun seantéacsanna a réiteach do lucht léitheoireachta sa lá atá inniu ann, daoine nach bhfuil cleachta leis an seanlitriú. Tá sé sin déanta agam le roinnt seanleabhar.

“Cad iad cuid de na fadhbanna le teicneolaíocht a fhorbairt do mhionteanga mar an Ghaeilge?”

Ba mhaith liom tuilleadh daoine óga a mhealladh chun obair a dhéanamh sa réimse. Tá grúpaí taighde ann in áiteanna éagsúla in Éirinn (DCU, Trinity, NUIG go háirithe) agus bíonn mic léinn máistreachta/PhD acu anois is arís, ach ní leor é sin chun an obair a chur ar bhonn slán fadtéarmach. Ba chóir do Rialtas na hÉireann infheistíocht mhór a dhéanamh sna grúpaí sin (agus i gcinn eile nach iad!); níos mó mac léinn, léachtóirí, ollúna, srl. Tá na daoine céanna i mbun oibre ar phlean teicneolaíocht don Ghaeilge anois, faoi scáth Roinn na Gaeltachta, agus le cúnamh Dé tiocfaidh tuilleadh airgid chun cinn mar thoradh ar an bplean.

An rud eile atá ag teastáil ná comhoibriú níos fearr leis na mórchomhlachtaí teicneolaíochta. Tá saineolas teicneolaíochta agus sonraí againne nach bhfuil ag Google, mar shampla, agus bheadh sé an-éasca feabhas mór a chur ar tháirgí Google cosúil le Google Translate. Agus tá an t-ardán atá acu ag teastáil uainne! Mar shampla, rinne mé aistritheoir Gàidhlig > Gaeilge agus Gaelg > Gaeilge roinnt blianta ó shin, ach is annamh a bhaineann éinne úsáid as; tá sé i bhfad níos éasca rudaí a aistriú go díreach in Chrome.

“Ar an láimh eile, an bhfuil aon deiseanna ann má oibríonn tú le mionteanga? Cé hiad?”

Tá! Tá pobal na Gaeilge an-díograiseach maidir leis an teanga, agus den chuid is mó bíonn siad réidh troid a dhéanamh ar son na teanga nó ar son cearta teanga. Sna cásanna ina rabhamar in ann comhoibriú a dhéanamh leis na comhlachtaí teicneolaíochta, mar shampla an t-aistriúchán a rinneamar ar GMail nó ar WhatsApp, obair dheonach a bhí ann. Ach mar sin féin, bhí sé an-éasca grúpa mór daoine a earcú chun an obair a dhéanamh; thuig siad láithreach an tábhacht a bhaineann leis na táirgí seo a bheith ar fáil i nGaeilge.

“I do thuairim, cad é an dúshlán is mó don teicneolaíocht Ghaelach do na cúig bliana amach romhainn?”

Gan a bheith fágtha as an rás chun samhlacha mór néaracha a chruthú. Tá mé i mbun oibre ar na cúrsaí seo faoi láthair, agus feicim an chumhacht agus na féidearthachtaí atá ann. Ach an taighde atá ar siúl in Google, Facebook, NVIDIA, srl., tá sé dírithe céad faoin gcéad ar Bhéarla. Agus ciallaíonn sé sin go bhfuil an taighde “overfit” ar theangacha gan mórán moirfeolaíochta mar shampla, agus (níos measa) ar theangacha a bhfuil na céadta billiún focal acu le haghaidh traenála. Caithfimid ár dtaighde féin a dhéanamh: cad iad na teicnící is fearr nuair nach bhfuil mórán sonraí traenála agat? Conas is féidir tairbhe a bhaint as na hacmhainní eile atá againn; foclóirí den chéad scoth, saineolas teangeolaíochta, srl.

“Cén fhís atá agat maidir le teicneolaíocht teanga sa bhliain 2050? Cad ba mhaith leat a fheiceáil don teicneolaíocht Ghaelach roimh sin?”

Comhéadain ghutha ar gach gléas/ríomhaire/táirge leictreonach i mbeagnach teanga ar bith. Is é sin an scoilt dhigiteach nua a bheidh ann; beidh an teicneolaíocht ghutha ar fáil i dteangacha áirithe, agus ní bheidh sí ar fáil i dteangacha eile. An chontúirt atá ann ná nach mbeidh daoine sásta cloí leis na teangacha sa dara grúpa. Is é mo thuairim go bhfuil an Ghaeilge go díreach ar an teorainn faoi láthair. Le fís fhadtéarmach, tuilleadh infheistíochta ón Rialtas, agus comhoibriú le comhlachtaí teic, beimid in ann an sliabh a dhreapadh. Ach níl sé deacair malairt an scéil a shamhlú ach oiread.

Naisc

  • Cadhan: na hacmhainní Gaeilge go léir ag Kevin
  • Intergaelic: Aistriúchán meaisín idir Gàidhlig agus Gaeilge, agus Gaelg  agus Gaeilge
  • Léacht a thug Kevin faoi “ailtireacht seirbhís-bhunaithe” do Teicneolaíochtaí Gaeilge

New Gaelic language technology website launched

A Linguistic Toolkit for Scottish Gaelic

Dr Loïc Boizou (Vytautas Magnus University) and Dr William Lamb (University of Edinburgh) have collaborated on a new bilingual website that provides a linguistic toolkit for Scottish Gaelic. Called Mion-sgrùdaiche Cànanachais na Gàidhlig or the Gaelic Linguistic Analyser, the site provides users with tools for analysing the words and structures of Gaelic sentences. The information provided by these tools can be used for additional natural language processing (NLP) tasks, or just for exploring the language further. This new website presents the tools together for the first time and provides users with two ways of interacting with them: a graphical interface and a command line method.

‘Like black magic’

The website’s development goes back to the late 1990s, when Lamb was working on his PhD. In order to investigate grammatical variation in Gaelic, Lamb constructed the first linguistically annotated corpus of Scottish Gaelic, spending over a year annotating 80,000 words of Gaelic by hand. He says, ‘It was a slog. Typing in 100,000 tags by hand… just don’t do it. I developed a nasty case of repetitive strain injury and vowed never to do this sort of thing by hand again.’ After returning to the University of Edinburgh in 2010, after 10 years at Lews Castle College Benbecula, he revisited his corpus to develop an automatic part-of-speech tagger and make the corpus available to other researchers. Today, the corpus is known as the ‘Annotated Reference Corpus of Scottish Gaelic’ or ARCOSG and is available freely online.

The corpus forms the backbone of two of the tools on the new website: the part-of-speech tagger and the syntactic parser. They were created using machine learning techniques, modelling the kinds of patterns that you find in Gaelic speech and writing. Lamb said, ‘what you can do today even with a relatively small amount of text is tremendously exciting. When we looked at developing a POS tagger in the 90s, we would have had to program each type of pattern manually to enable the computer to recognise it properly. Now, you can just run the corpus through a set of algorithms and the computer works the patterns out itself. It’s like black magic’.

Dr Will Lamb

The lemmatiser was developed in a different way, using a form of the popular online dictionary, Am Faclair Beag. Lamb explains: ‘When we were working on the part-of-speech tagger in 2013 or 14, Sammy Danso and I got in touch with Michael Bauer and Will Robertson, who put together the fantastic Am Faclair Beag. We were going to try to leverage some of the information in the dictionary, and they generously offered their data for this purpose. While that plan didn’t materialise, I was able to create a root finder or lemmatiser with it years later, which we used to help create the first neural network for Gaelic. The lemmatiser sat in the virtual cupboard for a while, until I was contacted by Loïc in 2017. Loïc wanted to create a proper Gaelic lemmatiser, and I was onboard.’

Dr Loïc Boizou

Dr Loïc Boizou is a Swiss French NLP specialist working in Lithuania (Vytautas Magnus University) who is interested in computational tools for under-resourced languages. He received his PhD in Natural Language Processing at Inalco (Institute of Eastern Languages and Civilisations) in Paris. About the project, he said, ‘I am very supportive of cultural diversity and Gaelic is one of the few endangered languages that provides serious opportunities for distance learning, thanks to Sabhal Mòr Ostaig. I really enjoyed learning the language and I decided to use my NLP skills to give it a bit of a boost. I learned about Will’s corpus and found we could cooperate very nicely.’

Roots, Trees and Tags

The website provides different ways of exploring  Gaelic text. Lemmatisation is simplest of the tools and involves retrieving a word’s root form. If you were to input a sentence like tha na coin mhòra ann (‘the big dogs are here’), the website would return ‘bi’, ‘cù’ and ‘mòr’ as the lemmas (root forms) of bha, coin and mhòra. The website also offers part-of-speech tagging, which provides grammatical information about words in a sentence. Using the previous example, the website’s algorithms would assign ‘POS tags’ to each word, as in the third tab-separated value in each line below (glossed in inverted commas):

tha	bi	V-p       'Verb: present tense'
na	na	Tdpm      'Article: pl masc def'
coin	cù	Ncpmn     'Noun: common pl masc nom'
mhòra	mòr	Aq-pmn    'Attributive adjective: plur masc nom'
ann	e	Pr3sm     'Prep pronoun: 3rd person sing masc'

The grammatical information in this example is quite precise, but such precision comes at a cost: the default tagger is subject to error about 9% of the time. For users who want simpler POS tags and more accurate tagging, the website also offers a ‘simplified tagset’ option, which provides 95% accuracy. The same sentence above, submitted with this option would provide the following:

tha	bi	Vp    'Verb: present tense'
na	na	Td    'Article: definite'
coin	cù	Nc    'Noun: common'
mhòra	mòr	Aq    'Adjective: attributive'
ann	e	Pr    'Prepositional pronoun'

In addition to lemmatisation and POS-tagging, the site also offers syntactic parsing, using a syntactically annotated corpus developed by Dr Colin Batchelor (Royal Society of Chemistry). Again, using the same sentence, the website returns the following if parsing is selected:

1	tha	bi	V-p	0	root
2	na	na	Tdpm	3	det
3	coin	cù	Ncpmn	1	nsubj
4	mhòra	mòr	Aq-pmn	3	amod
5	ann	e	Pr3sm	1	xcomp:pred

The number in the 4th column indicates which element in the sentence the word is governed by. In the case of tha, the number is 0, because it is the syntactic root. Both na and mhòra, on the other hand, are parts of a noun phrase governed by element 3, coin. This is a numerical way of displaying the kind of information that is often conveyed in a syntactic tree, such as in the example below. The information in column 5 indicates the function of the element in the sentence. For example, the function of coin is nsubj or ‘nominal subject’. More information on Dr Batchelor’s parser can be found here.

Syntactic tree for tha na coin mhòra ann

Next Steps

When asked what the next steps are for the language, Lamb explains that it’s an exciting time: ‘Well, this is really just an interim step and there is a lot to do. For a start, we hope to improve the accuracy of the tools gradually and perhaps augment them. Gaelic is, in some ways, in a very fortunate position when it comes to language technology. Advanced tools are starting to come online — like Google Translate, a handwriting recogniser and speech synthesiser — and we can exploit great resources like DASG, ARCOSG and recordings from the School of Scottish Studies Archives to push into territory that would have seemed like science fiction a few years ago.’

The dream is artificial general intelligence. ‘Elon Musk is famous for saying that one day, he’d like to die on Mars – just not on impact. Before I kick the proverbial bucket, I’d like to chat with a computer that has better Gaelic than I do’.

Page 1 of 2

Powered by WordPress & Theme by Anders Norén

css.php

Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.

  Cancel