Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.
(English Synopsis: Musings about what the words fith fath fuathagaich /fi fa fuəgɪç/ which are spoken by giants in certain tales such as Gille an Fheadain Duibh ‘The Lad of the Black Whistle’ could mean and whether there might possible be a link to the fee-fi-fo-fum from Jack and the Beanstalk.)
Ann an seann-sgeulachdan, tachraidh e gu math tric gun nochd facal, abairt no gnàthas-cainnte annasach. Ach chan iongnadh mòr sin: b’ i a’ Ghàidhlig a’ chiad chànan aig an fheadhainn a dh’innis na sgeulaichean ud. Bhiodh iad a’ toirt dealbh air an t-saoghal ann an Gàidhlig sa chiad dol a-mach, agus bha Gàidhlig èasgaidh shùbailte shiùbhlach aca a tha a leithid mar rionnagan san oidhche fhrasaich an-diugh. Ach leis gun dàinig na sgeulachdan seo a-nuas thuca o ghinealach gu ginealach, uaireannan nochdaidh rud-eigin annta air a bheil coltas fìor-aosta agus nach eil furasta ri thuigsinn idir.
Tha sgeulachd ann a nochdas ann an diofar cruthan ach aig cridhe na sgeulachd tha balach òg a nì sabaid an aghaidh trì fuamhairean agus am màthair. Fhuair am balach obair buachailleachd aig cailleach ann am baile air chor-eigin agus bidh e a’ falbh le gobhair na caillich. Ged a thoirmisg a’ chailleach dha falbh rathad nam fuaimhairean, sin a nì e. Agus nuair a ruigeas e an gàrradh a tha mun cuairt air taigh a’ chiad fhuamhaire, cuiridh e toll ann agus leigidh am balach na gobhair a-steach. Bidh am balach crosta seo (an-dà, tha e dìreach air dochann a dhèanamh air gàrradh fuamhaire bochd agus na gobhair ag ithe a’ bharra aige a-nis!) an uair sin a’ sreap suas craobh agus a’ cluich fhìdeag ann. Thig an uairsin fuamhair ’s e airson facal modhail fhaighinn air mac an ànraidh seo shuas sa chraobh agus bidh rann àraidh aig an fhuamhair ’s e a’ tighinn:
Air fith fath fuathagaich¹ air barraibh an albhagaich,²
’S fhada bha mo chorp air feadh ga meirgeadh ’s tolladh
a’ feitheamh air greim dhe d’ fheòil is
balgam dhe d’ fhuil, a mhic an Albannaich.³
¹ no fuagaich/fuamhaich
² no almhagaich/all(a)mharaich agus fiù air baile nan Albannaich uaireannan
³ no rìgh
Nise, tha an dàrna, treas is ceathramh sreath furasta gu leòr ri thuigsinn, fuilteach ’s gu bheil iad. Ach bha a’ chiad sreath a-riamh a’ cur iongnadh orm. Dè th’ ann an albhaga(i)ch? Agus dè dìreach a tha air fith fath fuathagaich a’ ciallachadh? Feumaidh mi aideachadh nach eil fhios a’m, ged a tha nàdar de dh’amharas agam. (Ma tha sibh airson èisteachd ris, seo aon dhe na clàraidhean aig Sgoil Eòlais na h-Alba. Tha am fith fath fuathagaich a’ nochdadh san dàrna clàradh, ’s dòcha dà mhionaid an dèidh toiseach a’ chlàraidh.)
A’s a’ chiad dol a-mach, saoilidh mi gu bheil baile nan Albannaich dìreach na mhearachd is an sgeulaiche a’ dol car iomrall (no fiù an neach a rinn an tar-sgrìobhadh, chan eil na clàraidhean cho soilleir uaireannan). Ged nach eil mi cinnteach idir mun albhagaich, leis gu bheil gach tionndadh dhen sgeulachd ag innse gun do shreap e suas craobh, chanainn gu bheil air barraibh (< bàrr + -aibh) ag innse gu bheil e na shuidhe air rudeigin, ge be dè th’ ann an albhagaich. Tha albhagaich a’ toirt ailbh(eag) “creag” nam inntinn ach carson a bhiodh e air creag is e dìreach air craobh a shreap?
Ach co-dhiù, ’s e a’ chiad phàirt a tha a’ fàgail tachais nam inntinn bhochd. Dè th’ anns na faclan seo? An e faclan fuadain a th’ annta, vocables mar gum biodh? Cha phìobaire mi ach chan eil coltas canntaireachd air — chan ann air fuathagaich co-dhiù. Chan eil cus ciallach sna faclairean a bharrachd. Tha aon fhacal ann, fìth-fàth, sin cleòca a dh’fhàgas do-fhaicsinneach thu. ’S e facal gu math aosta a th’ ann; tha e a’ nochdadh san t-Seann-Ghaeilge mar fía fé (is cruthan eile). Ach cha chreid mi gur e cleòca mar a sin a th’ againn an-seo. Chan eil dad ann an gin dhe na sgeulachdan a tha a’ toirt iomradh air do-fhaicsinneachd.
An aon rud – agus sin an leth-amharas air an dug mi iomradh roimhe – a bhuail orm, sin an rann ud a tha a’ nochdadh ann an ‘Jack and the Beanstalk’:
Fee-fi-fo-fum,
I smell the blood of an Englishman,
Be he alive, or be he dead
I’ll grind his bones to make my bread
Chan e dìreach gu bheil fee-fi-fo-fum car coltach ri fith fath fuathagaich ach tha an rann air fad gu math coltach na nàdar ris an rann Ghàidhlig, nach eil?
A-rèir coltais, ’s ann aig Shakespeare a tha seo a’ nochdadh ann an sgrìobhadh a’ chiad turas (mar fie, foh, and fum). Tha an Oxford English Dictionary (aig a bheil e mar fee-faw-fum) ag innse dhuinn gur e doggerel a th’ ann ach cha do lorg mi cus a mhìnicheas air na tha fee-fi-fo-fum a’ ciallachadh ann, no cò às a thàinig e. ’S e sin, an e faclan fuadain Beurla a th’ annta no saoil an do ghoid a’ Bheurla seo air cànan eile? Ged a tha eòlaichean sgeulachdan ag innse dhuinn gu bheil ‘Jack and the Beanstalk’ a’ buntainn ri roinn sgeulachdan ris an canar “neach a’ marbhadh dràgan”, chan fhaighear a’ phònair draoidheachd ud ach ann am Breatainn. Cha chuireadh e iongnadh orm nam biodh freumh no freumhag Cheilteach aig Jack, car mar a dh’fhàg àireamhan nam Breatannach lorg san yan tan tethera.
Ach ged a tha pailteas iongnaidh orm, chan eil dad a dh’fhios. Saoil a bheil sgeulachd mar seo aig na Cuimrich? No a bheil mi fada ceàrr ’s mìneachadh gu tur eadar-dhealaichte air? Dè ur beachd-ne?
As Copyright Administrator for the Decoding Hidden Heritages project, it’s my role to investigate the copyright status of the sound material and transcriptions in the Tale Archive.
Everyone involved with a sound recording has copyright to their material. As a result, it can be a lengthy process when checking which individuals are involved with a recording, and if The School of Scottish Studies Archives (SSSA) hold records of copyright assignation. Typically, the search must go outwith SSSA and that’s when I feel like donning my deerstalker! Today I will highlight some of that process.
We come across a number of contributors who are down as Unknown or Anonymous in our collections. There can be a few different reasons why this happened; not everyone’s names were captured by the fieldworker, or it was a cataloguing error. Sometimes people just wished to remain anonymous – either they were too shy to go on record, or the material may have been deemed too sensitive. These days, we have distinct copyright and Data Protection rules to safeguard sensitive material. We also have methods to close or mute someone’s material for a set period of time rather than anonymising completely or forever. So there is some flexibility in the approaches that we can take.
If persons are not explicitly named for a recording, it doesn’t mean we can assume that copyright can be cleared on that basis alone – we still have to do our due diligence. Given that this week marks International Woman’s Day 2022 (March 8) – let’s look at an example of two anonymous women in the Tale Archive. This card refers to part of the recording SA1959.027 – but it doesn’t give us much information, other than it was recorded in Smearisary, Glenuig and the story is of Murchag is Meanchag/Murchag A’s Mionachag.
My next step was to look at what is included on the whole recording. On checking the Summary book for 1959, it shows that this was a recording made by Calum Maclean, Basil Megaw and Ian Whittaker. The other contributors on the tape were Angus MacNeill, Sandy Gillies and “Anon Woman A”, “Anon Woman B / Mrs MacDonald (?)” and “Anon Woman C”. Not terribly enlightening, in the grand scheme of things! Even if Anon Woman B might be a Mrs MacDonald, it doesn’t give us anything to go on. From here I went down to the archive store room to look at the original tape box – sometimes there was a listing completed at the time of recording and included in the box.
A beautiful listing both outside and inside the box, but – as an archive colleague from the past has noted, in pencil – “Who are the informants?”
Listening to the recording itself can be helpful in some cases, because often the name is given at some point – but my Gaelic is not yet good enough for this tape.
So, what now? I will contact our colleagues at Tobar an Dualchais because parts of this tape are available to listen to online; these recordings are by the named contributors. It is possible when their researchers were seeking copyright that they were able to find out who these anonymous female storytellers were. I will keep you updated.
It’s really important to find out who unknown people in our collections are and, if possible, put a name to the voice and acknowledge their important contribution in the archives. I include a very short extract of Anon Woman A and Anon Woman B / Mrs Macdonald (?), of Smearisary and thank them for making my job so interesting!
This clip is placed here on a risk-balanced approach and that is another part of the process for another blog post!
Extract from SA1959.027 from collection of School of Scottish Studies Archives.
A Faroese stamp featuring the legend of Kópakonan (the Seal Woman).
Whilst working on data capture for the Decoding Hidden Heritages Project, I came across this tale of a seal-woman, or selkie (ScG: ròn ‘seal’), that struck a chord with me. Stanley Robertson from Aberdeen tells of the story he heard from his father, ‘The Selkie o the River Dee’, which Stanley was told was a true story and referred to an ancestor of his.
A man spies a seal-woman coming out of the River Dee and shedding her skin on the shore. The man takes the skin and hides it from her to force the woman to go home with him and be his wife. They have several children together and one day the children find the seal skin the man had hidden. The woman takes the skin and disappears back into the River Dee, never to return, with the man arriving just in time to see her go.
Upon reading the story, it immediately occurred to me that this seems to have been a tragic incident that was ‘spun’ into a whimsical tale, likely for the benefit of the children involved. Surely enough, Stanley goes on to say that he thinks the woman might have actually committed suicide, as the story referred to ”real” people. I think this story is fascinating because it has the ability to truly resonate with the listener or reader on a very emotional level. You don’t need to be an expert on folktales to understand why or how it came to be. Similar seal-woman (or mermaid) stories are found across the North Atlantic, for example in Irish, Icelandic and Faroese folklore.
A much more cheerful selkie reference in popular culture can be seen in the beautifully-illustrated Irish feature film: Song of the Sea.
For an automatic translation into English, click here. For a version in Irish, click here.
15 Am Faoilleach 2022
Ùghdar: Dr Andrea Palandri, Rannsaiche Iar-Dhotaireil
Andrea Palandri
As t-samhradh 2021, fhuair Gaois maoineachadh fo sgeama AHRC-IRC gus pròiseact a thòiseachadh air a’ Phrìomh Chruinneachadh Làmh-sgrìobhainnean bho thasg-lann Coimisean Beul-aithris na h-Èireann (Cumann Béaloideasa Éireann, University College Dublin). Canar Decoding Hidden Heritages ris a’ phròiseact seo. Is e cuspair a’ bhlog seo an obair dhigiteachaidh a tha a’ dol air adhart mar phàirt den phròiseact air làmh-sgrìobhainnean a’ Phrìomh Chruinneachaidh.
Thathas a’ meas gu bheil timcheall air 700,000 duilleag làmh-sgrìobhainn anns a’ Phrìomh Chruinneachadh Làmh-sgrìobhainnean, ga fhàgail mar aon de na cruinneachaidhean as motha de stuth beul-aithris air taobh an iar na Roinn Eòrpa. Bhiodh seo air a bhith na dhùbhlan mòr airson digiteachadh mura biodh Transkribus air teicneòlas AI airson aithne làmh-sgrìobhaidh a leasachadh thar nam beagan bhliadhnaichean a dh’fhalbh. Tha Decoding Hidden Heritages gu mòr an urra air an teicneòlas seo agus leigidh e leis a’ phròiseact a innealan-aithne làmh-sgrìobhaidh fhèin a dhèanamh stèidhichte air sgrìobhadairean sònraichte sa chruinneachadh.
On a thòisich ar luchd-rannsachaidh a bhith ag obair leis a’ bhathar-bog Transkribus tràth san Dàmhair, tha sinn air trì innealan làmh-sgrìobhaidh aithnichte a dhèanamh a tha ag obair aig ìre mionaideachd nas àirde na 95%: aon airson Seosamh Ó Dálaigh, aon airson Seán Ó hEochaidh agus aon airson Liam Mac Coisdealbha, trì de an luchd-cruinneachaidh as dealasaiche a bha ag obair don Choimisean.
Figear 1 (Clí) Seosamh Ó Dálaigh a’ cruinneachadh beul-aithris bho Tomás Mac Gearailt (Paraiste Márthain, Corca Dhuibhne) agus (deas) làmh-sgrìobhainn a sgrìobh e bho chlàradh a rinn e de Tadhg Ó Guithín (Baile na hAbha, Dún Chaoin, Corca Dhuibhne) ga ath-sgrìobh ann an Transkribus.
Tha Transkribus feumail air tar-sgrìobhadh ceart a rèir duilleag na làmh-sgrìobhainne – a rèir làmh-sgrìobhadh agus dual-chainnt an neach-cruinneachaidh – gus an einnsean a thrèanadh. An dèidh a bhith ag aithneachadh timcheall air leth-cheud duilleag san dòigh seo, thrèan sinn modal làmh-sgrìobhaidh aig ìre gu math èifeachdach (90% +). Is e dòigh-obrach a’ phròiseict na dhèidh seo ath-sgrìobhadh a dhèanamh air àireamh mhòr de dhuilleagan gu fèin-ghluasadach agus luchd-taic rannsachaidh (Emma McGee, Kate Ní Ghallchóir agus Róisín Byrne) a chur gan ceartachadh mean air mhean. Na dhèidh sin, faodaidh sinn na modailean a dh’ath-thrèanadh air stòr-dàta nas fharsainge gus modalan cànain nas fheàrr (~ 95%) a fhaighinn. Tha toraidhean eadar-amail na h-obrach seo a’ toirt dòchas dhuinn gum bi e comasach don phròiseact ìre mionaideachd nas àirde a choileanadh anns na mìosan a tha romhainn, a leigeas leinn a bhith ag ath-sgrìobhadh gu fèin-ghluasadach mòran den Phrìomh Chruinneachadh Làmh-sgrìobhainnean cha mhòr thar oidhche.
Figear 2 An lúb ionnsachaidh de mhodalan cànain a chaidh a dhèanamh le Transkirbus gu ruige seo: Seán Ó Dálaigh (clí), Seán Ó hEochaidh (meadhan) agus Liam Mac Coisdealbha (deas).
Tha làmh-sgrìobhainnean a’ Phrìomh Chruinneachaidh am measg nan teacsaichean as motha anns a bheil lorg nan dual-chainntean ann an corpas litreachas Gaeilge an latha an-diugh. Is e dòigh-obrach agus dòighean deasachaidh Shéamuis Ó Duilearga fhèin a tha a’ nochdadh ann an Leabhar Sheáin Í Chonaill. Bhrosnaich agus stèidhidh e Comann Beul-aithris na hÈireann ann an 1927 agus chan eil mìneachadh nas fheàrr air an dòigh-obrach seo na na faclan a sgrìobh Séamus Ó Duilearga fhèin ann an ro-ràdh an leabhair:
Ní raibh ionnam ach úirlis sgríte don tseanachaí: níor atharuíos siolla dá nduairt sé, ach gach aon ní a sgrí chô maith agus d’fhéadfainn é.
Cha robh annam ach inneal sgrìobhaidh dhan t-seanchaidh: cha do dh’atharraich mi lide dhe na thuirt e, ach sgrìobh e a h-uile rud cho math ’s a b’ urrainn dhomh.
(S. Ó Duilearga, Leabhar Sheáin Í Chonaill, xxiv)
Cha deach mòran leabhraichean fhoillseachadh ann an litreachas na Gaeilge bhon uairsin a dh’fhuirich cho dìleas ri dual-chainnt an neach-labhairt ’s a rinn Leabhar Sheáin Í Chonaill: tha cruthan dualchainnteach mar bheadh saé an àite bheadh sé (bhiodh e), no buaileav an àite buaileadh (chaidh a bhualadh) no fáilthiú an àite fáiltiú (fàilteachadh). Mar sin, tha cànan nan làmh-sgrìobhainnean anns a’ Phrìomh Chruinneachadh a’ taisbeanadh dual-chainnt, no eadhon ideo-chainnt, an luchd-fiosrachaidh gu làidir. Mar eisimpleir, bidh claonadh dual-chainnte, do raibh an àite go raibh (gun robh) ga ràdh; bha sin aig cuid de dhaoine à Corca Dhuibhne ann an Chonntaidh Chiarraí, m.e. anns na sgeulachdan a sgrìobh Seosamh Ó Dálaigh bho Thadhg Ó Guithín (Baile na hAbha, Dún Chaoin).
Figear 3 Thug Diarmuid Ó Sé iomradh air an iongantas dualchaint seo ann an Gaeilge Chorca Dhuibhne (§619)
Tha làmh-sgrìobhainnean a’ chruinneachaidh seo car neònach air sàillibh nan cruthan beaga dual-chainnteach a chlàraich an luchd-cruinneachaidh fhad ’s a bha iad gan ath-sgrìobhadh. Is ann air sgàth an iomadachd cànain seo anns a’ chorpas nach eil am pròiseact ag amas air aon mhodail mòr a chruthachadh gus an Cruinneachaidh ath-sgrìobhadh air fad. A bharrachd air sin, chan e a-mhàin gu bheil sinn a’ dèiligeadh ri diofar dhual-chainntean ach tha sinn cuideachd a’ dèiligeadh ri diofar luchd-cruinneachaidh aig nach robh làmh-sgrìobhadh is litreachadh dhual-chainntean co-ionnan. Tha na duilgheadasan seo a’ fàgail gu bheil an corpas Gaeilge seo gu math measgaichte. Feumar dèiligeadh ris le cùram agus le taic bho leabhraichean dhual-chànanachais a bhios a’ toirt cunntas air na puingean beaga cànain a gheibhear ann.
Anns an t-sreath seo, tha sinn a’ toirt sùil air laoich a rinn adhartas cudromach ann an teicneolas nan cànanan Gàidhealach. Airson a’ cheathramh agallaimh, cluinnidh sinn bho Roibeart MacThòmais. Coltach ri Lucy Evans, the Rob air ùr thighinn gu saoghal na Gàidhlig. Chaidh fhastadh airson còig mìosan ann an 2021 mar phàirt de phròiseact a mhaoinich Data-Driven Innovations (DDI), far a robh an sgioba a’ cruthachadh teicneolas aithneachadh labhairt airson na Gàidhlig. Dh’obraich Rob air inneal coimpiutaireachd ùr-nòsach eile, An Gocair.
Nuair a bhios tu a’ feuchainn ri teicneòlas cànain a chruthachadh airson mhion-chànain, ’s e an trioblaid as bunasaiche ach dìth dàta. Chan eil an suidheachadh a thaobh na Gàidhlig buileach cho truagh ri cuid a mhion-chànanan eile, ach tha deagh chuid dhen dàta seann-fhasanta a thaobh dhòighean-sgrìobhaidh. Tha sin a’ fàgail nach gabh e cleachdadh gus modailean Artificial Intelligence a thrèanadh gun a bhith a’ cosg airgead mòr air ath-litreachadh.
Bidh An Gocair ag ath-litreachadh theacsaichean gu fèin-obrachail – tha e glè choltach ri dearbhadair-litrichidh. Chan eil ann ach ro-shamhla (prototype) an-dràsta agus tha sinn a’ sireadh taic a bharrachd airson a leasachadh. Aon uair ‘s gum bi e deiseil, b’ urrainnear a chur gu feum ann an iomadach suidheachadh, leithid foillseachadh, foghlam aig gach ìre, prògraman coimpiutaireachd eile agus rannsachadh sgoileireil. Cuiridh e gu mòr cuideachd ri pròiseact rannsachaidh ùr a tha a’ tòiseachadh an dràsta eadar còig oilthighean ann am Breatainn, Ameireaga agus Èirinn: ‘Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-mining and Phylogenetics’.
In this interview series, we are looking at individuals who have significantly advanced the field of Gaelic, Irish and Manx language technology. For the fourth interview, we hear from Mr Rob Thomas. Like Lucy Evans, whom we interviewed a few months ago, Rob has come to the world of Gaelic language technology only recently. He was chosen from a strong field to work with us on project funded by Data-Driven Innovations (DDI), in which we were developing the world’s first automatic speech recogniser for Scottish Gaelic. Rob worked on an important strand of this project – developing a brand-new piece of software called An Gocair.
When trying to develop language technology for minority languages, the most fundamental problem is data sparsity. The situation for Gaelic is not as dire as for some other minority languages, but much of the textual data available is outdated in terms of orthography. That makes it impossible to train machine learning models – at least without spending a lot of money on editing spelling.
An Gocair re-spells texts automatically – it’s basically an unsupervised spell-checker with some extra bells and whistles. It is currently only a prototype, however, and we are seeking additional support for its development. Once completed, it will be able to be used in a wide range of contexts, including publishing, education at all levels, as part of other computer programs and within academic research. It will also make a significant contribution to a new research project currently underway between five universities in Britain, America and Ireland: ‘Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-mining and Phylogenetics’.
Interview with Rob Thomas
Agallamh le Roibeart MacThòmais
Tell us a little bit about your background. For instance, where are you from, and what got you into language technology work?
Hello! I’m from a small town in South Wales called Monmouth. I grew up mostly in the countryside, quite far from civilisation. My interest in linguistics probably stems from having a fantastic English teacher in my high school. (Shout out to Mr Jones.) I don’t know if it was the content or how he taught it, but I remember at the time really enjoying the subject and his lessons.
Rob Thomas
I went on to study English Language and Linguistics at the University of Portsmouth. After graduating, I worked for a while at Marks and Spencer as I was not yet sure what kind of career I was looking for. Still kind of directionless, I spent a year and a bit traveling and on return began working in tech support. I managed to find a course in Language Technology at the University of Gothenburg, I had recently found a new interest in programming and this was a great way to merge my new interest and my academic foundation. After a few years living, studying and working in Sweden, I returned to the UK and began the job hunt and was lucky to find the position at the University of Edinburgh.
You mention studying language technology at the University of Gothenburg. What did you find most interesting about the course? Do you have any advice for someone who is thinking about studying language technology?
The course was fascinating and it attracted students from quite a broad background. The first meeting was like The Time Machine by H.G Wells: we were all introduced as the linguist or the mathematician, cognitive scientist, computer scientist, philosopher etc. I think what stood out is that language technology, as a field, relies on input and experience from a multitude of academical backgrounds. This is due to the complex nature of language. I think I would advise anyone who is not from a technical or STEM background to think about how important your knowledge and perspective is for the future of language-based AIs, systems and services. But if, like me, you do come from a humanities background be prepared to dive straight back in to the maths that you thought you managed to escape after you completed your GCSEs.
You are developing a tool for Scottish Gaelic that automatically corrects misspelled words and makes text conform to a Gaelic orthographical standard. That’s impressive for someone with Gaelic, and even more so for someone who doesn’t speak it. How did you manage to do this?
I am quite lucky to be supported by Gaelic linguists and other programmers. I found a way to integrate Am Faclair Beag, an online Gaelic dictionary developed by our resident Gaelic domain expert, Michael Bauer. Alongside the dictionary we translated complicated linguistic rules into something a computer could understand. We have managed to develop a program that takes a text and, line by line, attempts to identify spelling that don’t belong to the modern orthography and searches for the right word from our dictionary. If it has no luck, it then attempts to resolve the issue algorithmically. From the start I knew it was important that I was able to compare the program’s output to work done by Gaelic experts so that I could see whether I was improving the tool or just breaking it.
An Gocair
Since you’ve been born, you’ve seen language technology change and permeate how we work and live. What’s been your own experience of the changes that it has brought?
It has been very interesting witnessing the exponential growth of language technology in the mainstream. It wasn’t until I studied it that I realised how much it was already embedded in websites and services that I’ve been using for years. The more visible applications such as smart assistants are becoming much more normalised in our society. Even my grandma uses her smart assistant to turn on classic FM and put on timers which I think is really cool. My grandma is pretty tech savvy to be fair!
With the dominance of world languages in mass media and on the internet, some would say that technology is an existential threat to minority languages like Gaelic and Welsh. What do you think about this? Are there ways for minority languages to survive or even thrive today?
I think one of the issues in language technology is that most of the work is dedicated to languages that already have huge amounts of resources, for example English. Most of the breakthroughs are being made by large companies that ultimately aim to increase the value of their services. There are a lot of companies that sell language technology as a service (e.g. machine translation) rather than serving communities per se. The latter may not have direct monetary value, but it’s essential to keep that focus in order to allow minority languages to gain access to state-of-the-art technology.
What are your predications for language technology in the year 2050? If you had your own way, what would you like to see by that time?
I imagine smart assistants will be present in more spaces in society, perhaps even in a more official capacity. The county council in Monmouthshire already use a smart chatbot for questions about what days your bins are being collected. Imagine if they were given greater powers such as being able to make important decisions (scary thought). The more time goes on, the more I think we are going to end up with malevolent AIs like HAL from 2001, Space Odyssey, rather than ones like C3PO from Star Wars.
I’m not sure what I would like to see. It would be nice if there was more community-developed and open-source alternatives to what the main large tech companies provide, so a consumer would be able to be sure their data was being used in a safe and respectable way.
Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-Mining and Phylogenetics
This exciting new three-year study is funded by the AHRC and IRC jointly under the UK–Ireland collaboration in digital humanities programme. It brings together five international universities, two folklore archives and two online folklore portals.
October 2021–Sept 2024
‘Morraha’ by John Batten. From Celtic Fairy Tales (Jacobs 1895)
Summary
This project will fuse deep, qualitative analysis with cutting-edge computational methodologies to decode, interpret and curate the hidden heritages of Gaelic traditional narrative. In doing so, it will provide the most detailed account to date of convergence and divergence in the narrative traditions of Scotland and Ireland and, by extension, a novel understanding of their joint cultural history. Leveraging recent advances in Natural Language Processing, the consortium will digitise, convert and help to disseminate a vast corpus of folklore manuscripts in Irish and Scottish Gaelic.
The project team will create, analyse and disseminate a large text corpus of folktales from the Tale Archive of the School of Scottish Studies Archives and from the Main Manuscript Collection of the Irish National Folklore Collection. The creation of this corpus will involve the scanning of c.80k manuscript pages (and will also include pages scanned by the Dúchas digitisation project), the recognition of handwritten text on these pages (as well as some audio material in Scotland), the normalisation of non-standard text, and the machine translation of Scottish Gaelic into Irish. The corpus will then be annotated with document-level and motif-level metadata.
Analysis of the corpus will be carried out using data mining and phylogenetic techniques. Both the data mining and phylogenetic workstreams will encompass the entire corpus, however, the phylogenetic workstream will also focus on three folktale types as case studies, namely Aarne–Thompson–Uther (ATU) 400 ‘The Search for the Lost Wife’, ATU 425 ‘The Search for the Lost Husband’, and ATU 503 ‘The Gifts of the Little People’. The results of these analyses will be published in a series of articles and in a book entitled Digital Folkloristics. The corpus will be disseminated via Dúchas and Tobar an Dualchais, and via a new aggregator website (under construction) that will include map and graph visualisations of corpus data and of the results of our analysis.
Project team
UK
Principal Investigator Dr William Lamb, The University of Edinburgh (School of Literatures, Languages and Cultures)
Co-Investigator Prof. Jamshid Tehrani, Durham University (Department of Anthropology)
Co-Investigator Dr Beatrice Alex, The University of Edinburgh (School of Literatures, Languages and Cultures)
University of Edinburgh
Language Technician, Michael Bauer
Louise Scollay, Copyright Administrator
Ireland
Co-Principal Investigator Dr Brian Ó Raghallaigh, Dublin City University (Fiontar & Scoil na Gaeilge)
Co-Investigator Dr Críostóir Mac Cárthaigh, University College Dublin (National Folklore Collection)
Co-Investigator Dr Barbara Hillers, Indiana University (Folklore and Ethnomusicology)
While some of our research group has been busy creating the world’s first Scottish Gaelic Speech Recognition system, others been creating the world’s first Scottish Gaelic Text Normaliser. Although it might not turn the heads of AI enthusiasts and smart device lovers in the same way, the normaliser is an invaluable tool for unlocking historical Gaelic, enhancing its use for machine learning and giving people a way to correct Gaelic spelling with no hassle.
Rob Thomas
Why do we need a Gaelic text normaliser? Well, this program takes pre-standardised texts, which can vary in their orthography, and rewrites them in the modern Gaelic Orthographic Conventions (GOC). GOC is a document published by the SQA which details the modern standards for writing in Gaelic. Text normalisation is an important step in text pre-processing for machine learning applications. It’s also useful when reprinting older texts for modern readers, or if you just want to quickly spellcheck something in Gaelic.
I joined the project towards the end and have been fast at work trying to understand Gaelic orthography, how it has developed over the centuries, and what is possible in regards to automated normalisation. I have been working alongside Michael ‘Akerbeltz’ Bauer, a Gaelic linguist with extensive credentials. He has literally written the dictionary on Gaelic as well as a book on Gaelic phonology: it is safe to say I am in good hands. We have been working together to find a way of teaching a program exactly how to normalise Gaelic text. Whereas a human can explain why a word should be spelt a specific way, programming this takes quite a bit of figuring out.
An early ancestor to Scottish Gaelic (Archaic Irish) was written in Ogham, and interestingly enough was carved vertically into stone.
Luckily historical text normalisation is a well-trodden path, and there are plenty of papers and theses online to help. In her thesis, Eva Pettersson describes four main methods for normalising text and, inspired by these, we got started. The first method relies on possessing an extensive lexicon of the target language, which we so happen to have, thanks to Michael.
Lexicon Based Normalisation
This method relies upon having a large lexicon stored that can cover the majority of words in the target language. Using this, you can check to see if a word is spelt correctly, whether it is in a traditional spelling, or if the writer has made a mistake.
The advantage of this method is that you do not have to be an expert in the language yourself (lucky for me!). Our first step was finding a way to integrate the world’s most comprehensive digital Scottish Gaelic dictionary, Am Faclair Beag. The dictionary contains traditional and misspelt words mapped to their correct spellings. This meant that we can have the program go through a text and swap words if it identifies one that needs correcting.
The table above shows some modern words with pre-GOC variants or misspellings. Michael has been collecting Gaelic words and their spelling variants for decades. If our program finds a word that is ‘out of dictionary’, we pass it on to the next stage of normalisation, which involves the hand crafting of linguistic rules.
‘An Gocair’
Rule-based Text Normalisation
Once we have filtered out all of the words that can be handled by our lexicon alone, we try to make use of linguistic rules. It’s not always easy to program a rule so that a computer can understand it. For example, we all know the English rule ‘i before e except after c’ (which of course is an inconsistent rule in English). We can program this by getting the computer to catch all the i’s before e’s and make sure they don’t come after a c.
With guidance from Michael, we went about identifying rules in Gaelic that can be intuitively programmed. One common feature of traditional Gaelic is the replacement of vowels with apostrophes at the end of words if the following word begins with a vowel. This is called ellipsis and is due to the fact that, if one were to speak the phrase, one wouldn’t pronounce both vowels: the writer is simply writing how they would speak. For example, native Gaelic speakers wouldn’t say is e an cù a tha ann ‘it is the dog’: they would say ’s e ’n cù a th’ ann, dropping three vowels. But in writing, we want these vowels to appear – at least for most machine learning situations.
It is not always straightforward working out which vowel an apostrophe replaces, but we can use a rule to help us. Gaelic vowels come in two categories, broad (a, o, u) and slender (e, i). In writing, vowels conform to the ‘broad to broad and slender to slender rule’, so when reinstating a vowel at the end of a word we need to check the form of the first vowel to the left of our apostrophe and ensure that, if it is a broad vowel, we add in a matching vowel.
Pattern Matching with Regular Expression
For this method of normalisation we make use of regular expressions for catching common examples that require normalisation, but are not covered by the lexicon or our previous rules. For example, consider the following example, which is a case of hyper-phonetic spelling, when a person writes like they speak:
Tha sgian ann a sheo tha mis’ a’ toir dhu’-sa.
Here, the word mis’ is given an apostrophe as a final character, because the following word begins with a vowel. GOC suggests that we restore the final vowel. To restore this vowel, we’re helped by the regularity of the Gaelic orthography, a form of vowel harmony, whereby each consonant has to be surrounded either by slender letters (e, i) or broad letters (a, o, u). So in the example above we need to make sure the final vowel of mis’ is a slender vowel (mise), because the first vowel to the left is also slender. We have managed to program this and, using a nifty algorithm, we can then decipher what the correct word should be. When the word is resolved we check to see if the resolved form is in the lexicon and if it is, we save it and move on to the next word.
Evaluation
Now you might be wondering how I managed to learn Scottish Gaelic so comprehensively in five months that I was able to write a program that corrects spelling and also confirm that it is working properly. Well, I didn’t. From the start of the task, I knew there was no way I would be able to gain enough knowledge about the language that I could confidently assess how well the tool was performing. Luckily I did have a large amount of text that was corrected by hand, thanks to Michael’s hard work.
To be able to verify that the tool is working, I had to write some code that automatically compares the output of the tool to the gold standard that Michael created, and then provide me with useful metrics. Eva Peterssonn describes in her thesis on Historical Text Normalisation two such metrics: error reduction and accuracy. Error reduction provides you with the percentage of errors in a text that are successfully corrected using the following formula:
Accuracy simply evaluates the number of words in the gold standard text which has an identical spelling in the normalised version. Below you can see the results of normalisation on a test set of sentences. The green line shows the percentage or errors that are corrected whilst the red and blue line show the accuracy before and after normalisation, respectively. As you can see the normaliser manages to successfully improve the accuracy, sometimes even to 100%.
From GOC to ‘An Gocair’
With a play of words on GOC, we have named the program An Gocair ‘The Un-hooker’. We have tried to make it as easy as possible to update it with new rules. We hope to have the opportunity to create more rules in the future ourselves. The program will also improve with the next iteration of Michael’s fabulous dictionary. We hope to release the first version of An Gocair to the world by the end of October 2021. Keep posted!
Acknowledgement
This program was funded by the Data-Driven Innovation initiative (DDI), delivered by the University of Edinburgh and Heriot-Watt University for the Edinburgh and South East Scotland City Region Deal. DDI is an innovation network helping organisations tackle challenges for industry and society by doing data right to support Edinburgh in its ambition to become the data capital of Europe. The project was delivered by the Edinburgh Futures Institute (EFI), one of five DDI innovation hubs which collaborates with industry, government and communities to build a challenge-led and data-rich portfolio of activity that has an enduring impact.
In ourlast blog post, we outlined some of the data preparation that is necessary to train the acoustic model for our Scottish Gaelic speech recognition system. This includes normalization and alignment. Normalization is where speech transcriptions are stripped of punctuation, casing, and any unspoken text. Alignment is where each word in a transcription is stamped with a start and end time to show where it occurs in an audio recording.
After these steps, speech data can be used to train an acoustic model. Once combined with our lexicon and language model (as described in our last blog post), this forms the full speech recognition system. In this blog post, we explain the function of the acoustic model and outline two common forms. We also report on our most recent Gaelic speech recognition results.
The Acoustic Model
The acoustic model is the component of a speech recogniser that recognises short speech sounds. Given an audio input where a speaker says, “She said hello”, for example, the acoustic model will try to predict which phonemes make up that utterance:
Audio Input
Acoustic Model Output
Speaker says “She said hello”
sh iy s eh d hh ah l ow
The acoustic model is able to recognise speech sounds by relying on its component phoneme models. Each phoneme model provides information about the expected range of acoustic features for one particular phoneme in the target language. For example, the ‘sh’ model will capture the typical pitch, energy, or formant structure of the ‘sh’ phoneme. The acoustic model uses the knowledge from these models to recognise the phonemes in an input stream of speech, based on its acoustic features. Combining this prediction with the lexicon, as well as the prediction of the language model, the system can transcribe the input sentence:
ASR System Component(s)
Output, given a speaker saying: “She said hello”
Acoustic Model Prediction
sh iy s eh d hh ah l ow
+ Lexicon
sh iy = she
s eh d = said
hh ah l ow = hello
+ Language Model Prediction
She said hello
Training the Acoustic Model
In order to train our acoustic model, we feed it a large quantity of recorded speech in the target language. These are split up into sequences of 10ms ‘chunks’, or frames. Alongside the recordings, we also feed in their corresponding time-aligned transcription:
Aligned Gaelic speech
Using the lexicon, the system maps each word in the transcript to its component phonemes. Then, according to the start and end times of that word, it can estimate which phoneme is being pronounced during each 10ms frame where the word is being spoken. By gathering acoustic information from every frame in which each particular phoneme is pronounced, the set of phoneme models can be generated.
Training procedure for the Acoustic Model
Types of Acoustic Model: Gaussian Mixture Models vs Deep Neural Networks
Early acoustic modelling approaches incorporated the Gaussian Mixture Model (GMM) for building phoneme models. This is a generative type ofmodel, meaning that it recognises the phonemes in a spoken utterance by estimating, for every 10ms frame, how likely each phoneme model is to generate that frame. For each frame, the phoneme label of the model with the highest likelihood is output.
More recent, state-of-the-art approaches use the Deep Neural Network (DNN) model. This is a discriminative model. The model directly classifies each input frame of speech with a predicted phoneme label, based on the discriminatory properties of that frame (such as its pitch or formant structure). The outputs of the two models are therefore the same – a sequence of phoneme labels – but generated in different ways.
The reason that the DNN has overtaken the GMM in speech recognition applications is largely due to its modelling power. DNNs are models with a number of different ‘layers’, and consequently a larger number of parameters. Parameters are variables contained within the model, whose values are estimated from the training data. Put simply, having more parameters enables DNNs to retain much more information about each phoneme than GMMs, and as such, they perform better on speech recognition tasks.
Another key difference between the two types of acoustic model is the training data they require. For GMMs, we can simply input recordings with their time-aligned transcriptions, as we already prepared using Quorate’s English aligner. On the other hand, training the DNN requires that every frame of each recording is classified with its corresponding Gaelic phoneme label. We obtain these labels by training a GMM acoustic model, which, once trained on the Gaelic recordings and time-aligned transcriptions, can be used for forced alignment. During forced alignment, each frame of the speech data is aligned to a ‘gold standard’ phoneme label. This output can then be used to train the DNN model directly.
Speech Recognition Results
Having carried out the training of our GMM and DNN acoustic models, we are now in a position to report our first speech recognition results. We initially trained our models using only the Clilstore data, which amounted to 21 hours of speech training data. Next, we added the Tobar an Dualchais data to our training set, which increased the size of the dataset to 39.9 hours of speech (NB: the texts in this data are transcriptions of traditional narrative from the School of Scottish Studies Archives, made by Tobar an Dualchais staff). Finally, we added data from the School of Scottish Studies Archives via theAutomatic Handwriting Recognition Project to train our third, most recent model, on 63.5 hours of speech.
We evaluated our models on a subset of the Clilstore data, which was excluded from the training data. This evaluation set comprises 54 minutes of speech, from 21 different speakers. Each recording was passed through the speech recogniser to produce a predicted transcription. We then measured the system’s performance using Word Error Rate (WER). The WER value is the proportion of words that the speech recogniser transcribes incorrectly for each input recording. The measure can also be inverted to reflect accuracy.
As can be seen from the table below, our results have been encouraging, especially considering that DNN models perform best when trained on much larger quantities (100s of hours) of data. We are particularly pleased to report that our latest model passed below 30% WER (i.e. > 70% accuracy), an initial goal of our Gaelic speech recognition project.
Model
Training Corpus (hours of speech)
Word Error Rate (WER)
Accuracy
WER Reduction (from previous model)
A
Clilstore (21)
35.8%
64.2%
–
B
Clilstore
+ Tobar an Dualchais (39.9)
31.0%
69.0%
4.8%
C
Clilstore
+ Tobar an Dualchais
+ Handwriting (63.5)
28.2%
71.8%
2.8%
To showcase our speech recogniser’s current performance, we have put together some demo videos. These are subtitled with the speech recogniser’s predicted transcription for each video. Please note that the subtitles will have imperfections, given that we are using our speech recogniser (with 71.8% accuracy) to generate them. Take a look by clicking this link!
Demo video screenshot
Next Steps…
With just 2 months left of the project, the countdown is on! We plan to spend this time adding a final dataset to the model’s training data, with the hopes of further reducing the WER of our system. After this, we plan to experiment with speech recognition techniques, such as data augmentation, to maximise the performance of the system on the data we have collected thus far. Make sure to look out for further updates coming soon!
Acknowledgements
With thanks to Data-Driven Innovation Initiative for funding this part of the project within their ‘Building Back Better’ open funding call
The Celtic Linguistics Group at the University of Arizona invited Dr Will Lamb to speak to them about ‘Emerging NLP for Scottish Gaelic’ on 26 March 2021. This was as part of their Formal Approaches to Celtic Linguistics lecture series. The talk went out on Zoom and was recorded and uploaded on YouTube (provided below). About 43 min into the video, there is a short demonstration of the prototype ASR system, as it stood at the time. Since then, we have improved the system further, incorporating enhanced acoustic and language models, and a post-processing stage that re-inserts much punctuation back into the output.
Since September 2020, a collaborative team from the University of Edinburgh (UoE), the University of the Highlands and Islands (UHI), and Quorate Technology, has been working towards building an Automatic Speech Recognition (ASR) system for Scottish Gaelic. This is a system that is able to automatically transcribe Gaelic speech into writing.
The applications for a Gaelic ASR system are vast, as demonstrated by those already in use for other languages, such as English. Examples of applications include voice assistants (Alexa, Siri), video subtitling, automatic transcription, and so on. Our goal for this project is to build a full working system for Gaelic in order to facilitate these types of use-cases. In the long term, for example, we hope to enable the automatic generation of transcripts and/or subtitles for pre-existing Gaelic recordings and videos. This would add value to these resources by rendering them searchable by word or topic. In this blog post, we describe our progress so far.
Data and Resources
There are 3 main components needed to construct a full ASR system. These comprise the lexicon, which maps words to their component phonemes (e.g. hello = hh ah l ow), the language model, which identifies likely sequences of words in the target language, and the acoustic model, which learns to recognise the component phonemes making up a segment of speech. The combination of these three components enables the ASR system to pick up on a sequence of phonemes in the input speech, map these phonemes to written words, and output a full predicted transcription of the recording.
Input
Output Prediction
Language Model
The United States of <?>
America
Acoustic Model
Audio (Speaker says “Good Morning”)
g uh d m ao r n ih ng
Of course, building these components requires resources. In terms of the lexicon, we are fortunate enough to have this resource already available to us. Am Faclair Beag is a digital Gaelic dictionary, developed by Michael Bauer, which includes phonetic transcriptions for over 30,000 Gaelic words. We simply pulled each word and pronunciation from this dictionary and combined them into a list to serve as our initial lexicon.
For training our language model (LM), we required a large corpus of Gaelic text. A LM counts occurrences of every 4-word sequence present in this text corpus, so as to learn which phrases are common in Gaelic. The following resources were drawn upon to build this:
The gd Corpus, which is a web-scraped text corpus assembled as part of the An Crúbadán project. This project aims to build corpora and other language technology resources for minority languages
Tobar an Dualchais/Kist o Riches, a collaborative project which aims to “preserve, digitise, catalogue and make available online several thousand hours of Gaelic and Scots recordings”. They supplied several hundred transcriptions of archive material from the School of Scottish Studies Archives
Finally, for training the acoustic model, we required a large number of speech recordings along with their corresponding transcriptions. This is so that the model can learn (with help from the lexicon) how the different speech sounds map to written words. We used recordings and transcriptions from the following sources to construct this dataset:
The School of Scottish Studies Archives (UoE) – see above
Clilstore, an educational website that provides Gaelic language videos at various different CEFR levels
A note on alignment
In order to train our ASR system to map speech sounds to written words, we must time-align each transcription to its corresponding recording. In other words, the transcriptions must be given time-stamps, specifying when each transcribed word occurs in the recording.
Time-aligning the transcriptions manually is lengthy and expensive, so we generally rely on automatic methods. In fact, we use a method very similar to speech recognition to generate these alignments. The issue here is that the automatic aligner also requires time-aligned speech data for training, which we don’t have for Gaelic.
We are fortunate in that we have been able to use a pre-built English speech aligner from Quorate Technology to carry out our Gaelic alignment task. As this was trained on English speech, it may be surprising that it is still effective for aligning our Gaelic data. However, despite noticeable high-level differences between the two languages (words, grammar etc.), the aligner is able to pick up on the lower-level features of speech (pitch, tone etc.), which are global across different languages. This means it can make a good guess at when specific words occur in each recording.
The alignment process – mapping text to audio.
Adapting the Lexicon
1. Mapping from IPA to the Aligner Phoneset
Because we are using a pre-built aligner on our speech data, we must ensure that the set of phones used to phonetically transcribe the words in our lexicon is the same as the set of phones recognised by the aligner’s acoustic model. Our lexicon, from Am Faclair Beag, uses a form of Gaelic-adapted IPA, whereas the Quorate aligner recognises a special, computer-readable set of English phones. For this reason, our first task was to map each phone in the lexicon’s phoneset to its equivalent (or closest) phone used in the aligner’s phoneset.
We first standardised the lexicon phoneset, mapping each specialised Gaelic IPA phone back to its standard IPA equivalent. We next mapped this standard IPA phoneset to ARPABET, an American-English phoneset that is widely used in language technology. This is the foundation of the aligner’s phoneset. We had to draw on our phonetic knowledge of Gaelic to create the mapping from IPA to ARPABET, because the set of phones used in English speech differs to that used in Gaelic: some Gaelic phones do not exist in English. For each additional Gaelic phone, we therefore selected the ARPABET phone that was deemed its ‘closest match’. Take the following Gaelic distinction between a non-aspirated, palatalised ( kʲ ) and non-aspirated non-palatalised ( k ) stop consonant, for example:
Gaelic IPA
(Gaelic phoneset)
Standard IPA
(global phoneset)
ARPABET
(English phoneset)
g
k
K
gʲ
kʲ
K
Our final mapping was from ARPABET to the aligner’s phoneset. Considering both of these phonesets are based on English, this was a fairly easy process; each ARPABET phone had an exact equivalent in the aligner phoneset. Once we had our final phoneset mapping, we converted all the phonetic transcriptions in the lexicon to their equivalent in the aligner’s phoneset, for example:
Word
Original
(Gaelic IPA)
Standard IPA
ARPABET
Aligner
uisge
ɯ ʃ gʲ ə
ɯ ʃ kʲ ə
UX SH K AX
uh sh k ax
gorm
g ɔ r ɔ m
k ɔ ɾ ɔ m
K AO DX AO M
k ao r ao m
2. Adding new pronunciations
For our ASR system to learn to recognise the component phones of spoken words, we need to ensure that every word that appears in our training corpus is included in the lexicon.
Our initial phoneticised lexicon stood at an impressive 30,000 Gaelic words, however, the number of words in our training corpus exceeds 150,000. This leaves 120,000 missing pronunciations, many of which will simply be morphological variations on the dictionary entries. If our model were to come across any of these words in training, it would be unable to map the acoustics of that word to its component phoneme labels.
The ASR system maps the phones recognised by the acoustic model to words, using the pronunciations in the lexicon.
A solution to this is to train a Grapheme-to-Phoneme (G2P) model, which, given a written word as input, can predict a phonetic transcription for that word, based solely on the letters (graphemes) it contains. For example:
Input
Output Prediction
h-uisgeanan
hh uh sh k ih n aa n
galachan
k aa el ax k aa n
fuaimeannan
f uw ax iy m aa en aa n
We trained a G2P model using all the words and pronunciations already in our lexicon. The model learns typical patterns of Gaelic grapheme to phoneme mappings using these as examples. Our model achieved a symbol error rate of 3.82%, which equates to an impressive 96.18% accuracy. We subsequently used this model to predict the pronunciation for the 120,000 missing words, and added them to our lexicon.
Text Normalisation
1. Punctuation, Capitalisation, and other Junk
Our next tasks focused on normalising our text corpus. We want to ensure that any text we input to our language model is free from punctuation and capitalisation, so that the model does not distinguish between, for example, a capitalised and lowercase word (e.g. ‘Hello’ vs. ‘hello’), where the meaning of these tokens is actually the same. A simple Python programme was written for this purpose which, along with punctuation and capitalisation, also stripped out any junk, such as turn-taking indicators. Here is an example of the programme at work:
Input
Output
A’ cur uèirichean ri pluga.
a cur uèirichean ri pluga
An ann ro theth a bha e?
an ann ro theth a bha e
EC―00:05: Dè bha ceàrr air, air obair a’ bhanca?
dè bha ceàrr air air obair a bhanca
2. Digit Verbalisation
Another useful type of text normalisation is the verbalisation of digits. Put simply, this involves converting any digits in our corpus into words, for example, ‘42’ -> ‘forty-two’. An easy way of doing this is by using a Python tool called num2words. The tool is functional for verbalising digits into numerous languages, but unfortunately did not support Gaelic. For this reason, we coded our own Gaelic digit verbaliser, in order to verbalise the digits present in our text corpus. As the num2words projects welcomes contributions, we also hope to be able to contribute our code, so as to make the tool accessible to others.
Our digit verbaliser is currently functional for the numbers 0-100, and for the years 1100-2099. Also, as Gaelic uses both the decimal (10s) and vigesimal (20s) numbering systems, we ensured that our tool is able to verbalise each digit using either system, as specified by the user. We hope to eventually extend this to a wider range of numbers. The following examples show our digit verbaliser at work:
a) Numbers
Original
Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu 80 pounds.
Vigesimal
Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu ceithir fichead pounds.
Decimal
Uill, tha, tha messages na seachdaine a chaidh agam ri phàigheadh agus bidh e timcheall air mu ochdad pounds.
b) Years
Original
Bha, bha e ann am Poll a’ Charra ann an 1860.
Vigesimal
Bha, bha e ann am Poll a’ Charra ann an ochd ceud deug, trì fichead.
Decimal
Bha, bha e ann am Poll a’ Charra ann an ochd ceud deug ‘s a seasgad.
Current Work and Next Steps
After carrying out all the data and lexicon preparation, we were able to align our Gaelic speech data using Quorate’s English aligner. We have started using this to train our first acoustic models, and will soon be able to build our first full speech recognition system – keep an eye out for our next update!
Automatically subtitled video (using provided script)
However, aside from creating acoustic model training data, alignment can actually be useful for other purposes: it enables us to create video subtitles, for example. This kind of use case actually enables us to present our first observable results, which have been extremely encouraging. The videos in the link below exhibit our time-aligned subtitles, originally a simple transcription, separated from the video: click here to see examples of our work so far!