Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.

Gaelic Algorithmic Research Group

Rannsachadh digiteach air a' Ghàidhlig ~ Goireasan digiteach airson nan Gàidheal

Decoding Hidden Heritages: Connections

One of the best things about having worked in The School of Scottish Studies Archives & Library for the past five years is seeing how people are connected with the archive recordings here.

I don’t only mean seeing how our readers are affected by connecting their own research to the myriad depths and layers of oral record testimony – though that process is rather like watching someone discovering treasure every single time. Once the Decoding Hidden Heritages project reaches its culmination, the transcript material from the Tale Archive will be another important layer to these recordings, and available for all to discover.

I also refer to the connections than extend outside the archives.

Black ad white image of a man sitting outside a house, he is talking into a recorder, being held out to him by a fieldworker. the fieldworker is cut out of the left side of the image

Angus MacNeil of Smirasary, Glenuig, being interviewed by Calum Maclean (out of image) 1959 Image copyright: SSSA

Many of the people who were recorded in the first decades of the School of Scottish Studies are no longer alive, but their material lives on in those who have connections to the people, their native area, or work or traditions. For example. Shetland fiddle players come to the collections to learn the playing style of the isles; Gaelic singers have used the archive to learn a regional variation of a song for performance; local heritage groups using material recorded in their location for museum exhibitions; storytellers learning tales…and the Carrying Stream flows on.

For me,  these connections are particularly palpable when tied to family and we have great links with relations of some of our contributors and fieldworkers. Some can fill in details for us, such as other family members who were recorded or give background information that adds more depth. Often they give permission for re-use of material, if the archive do not hold the rights. Sometimes people come to us looking for recordings their relatives made and at other times we are connected with people who did not know their relation was recorded at all. In all of these instances those connections between contributor, recording, family and us, as the archive, are further strengthened and emboldened.

For those who read my previous blog on seeking the unknown person in the collections – I have an update! A former colleague (connections, again!) sent the post on to her friend from Glenuig who may have known my two unknown women of Smirasary – “You never know, he might have an idea who they are?”

Within a few hours I had a response – not only did he know who they were, but one of the women was his grandmother. The woman that was down in our records as “Anon Woman B / Mrs MacDonald?” was indeed a Mrs MacDonald. She was Johanna MacDonald (1880-1973) and there is more material in the Archives attributed to her from other fieldwork trips to Smirasary in the mid-late 1950s. You can hear some of those other recordings on Tobar an Dualchais. Another fantastic set of connections and one which will hopefully lead to these transcriptions becoming more accessible.

I never fail to be surprised at the connections people have to The School of Scottish Studies Archives, or the weight and strength of those connections!

“Magic Flight”: A Mi’kmaq Tale

There are 28 versions of Aarne–Thompson–Uther (ATU) Index tale type 313 at the School of Scottish Studies Archives, but this particular one stands out. It is a tale told by Isabel Morris Googoo from the Mi’kmaq (or Micmac) tribe in Whycocomagh, Nova Scotia, to folklorist Elsie Clews Parsons in 1923. It was originally published in the Journal of American Folklore in 1925, but the copy we have on file is from a 1986 edition of the Cape Breton Magazine, with added illustrations of Mi’kmaq petroglyphs from a publication by the Nova Scotia Museum. The article notes that “it is an example of elements of European stories and religion that have been worked into Micmac tradition.” In this part of Canada, this European influence would have come more specifically from Scottish and French settlers, although this tale type has variations that can be found across the globe. It even has ties to Ancient Greek mythology. In Scots, it is best known as Nicht, Nought, Nothing collected by Andrew Lang from “an aged old lady in Morayshire” (In Lang’s words). Unfortunately, the lady is not named as is too often the case with female narrators, and actually what makes the Mi’kmaq Magic Flight story so interesting is that it was told and collected by women, and they are specifically named. In the Journal of American Folklore article, we are even given the names of the source of the story: Googoo’s grandmother, Mary Doucet Newell. The collector, Elsie Clews Parsons, was one of the earliest figures for the feminist movement and was outspoken on the negative effects of gender role expectations, publishing works on the topic in the early 20th century.

An Irish version of the Magic Flight tale, also collected by a woman, can be read on the Duchas website here.

The Mi’kmaq Magic Flight tale from the Cape Breton Magazine is attached here in its entirety. Note the adverts, providing a wonderful glimpse into the social history of 1980’s Nova Scotia!

A Micmac Tale – Magic Flight

 

Bibliography:

Parsons, Elsie Clews. “Micmac Folklore.” The Journal of American Folklore 38, no. 147 (1925): 55–133. https://doi.org/10.2307/534961. *Warning: this article contains some offensive language*

Peverill, L.., Robertson, M.. Rock Drawings of the Micmac Indians. Petroglyphs. N.p.: n.p., 1973.

Cape Breton’s magazine. 1972. Wreck Cove, N.S.: R. Caplan (Edition no. 41, 1986).

Lang, Andrew. Custom and Myth. United States: Harper & brothers, 1893.

Dè a’ Ghàidhlig air fee-fi-fo-fum?

(English Synopsis: Musings about what the words fith fath fuathagaich /fi fa fuəgɪç/ which are spoken by giants in certain tales such as Gille an Fheadain Duibh ‘The Lad of the Black Whistle’ could mean and whether there might possible be a link to the fee-fi-fo-fum from Jack and the Beanstalk.)

Ann an seann-sgeulachdan, tachraidh e gu math tric gun nochd facal, abairt no gnàthas-cainnte annasach. Ach chan iongnadh mòr sin: b’ i a’ Ghàidhlig a’ chiad chànan aig an fheadhainn a dh’innis na sgeulaichean ud. Bhiodh iad a’ toirt dealbh air an t-saoghal ann an Gàidhlig sa chiad dol a-mach, agus bha Gàidhlig èasgaidh shùbailte shiùbhlach aca a tha a leithid mar rionnagan san oidhche fhrasaich an-diugh. Ach leis gun dàinig na sgeulachdan seo a-nuas thuca o ghinealach gu ginealach, uaireannan nochdaidh rud-eigin annta air a bheil coltas fìor-aosta agus nach eil furasta ri thuigsinn idir.

Tha sgeulachd ann a nochdas ann an diofar cruthan ach aig cridhe na sgeulachd tha balach òg a nì sabaid an aghaidh trì fuamhairean agus am màthair. Fhuair am balach obair buachailleachd aig cailleach ann am baile air chor-eigin agus bidh e a’ falbh le gobhair na caillich. Ged a thoirmisg a’ chailleach dha falbh rathad nam fuaimhairean, sin a nì e. Agus nuair a ruigeas e an gàrradh a tha mun cuairt air taigh a’ chiad fhuamhaire, cuiridh e toll ann agus leigidh am balach na gobhair a-steach. Bidh am balach crosta seo (an-dà, tha e dìreach air dochann a dhèanamh air gàrradh fuamhaire bochd agus na gobhair ag ithe a’ bharra aige a-nis!) an uair sin a’ sreap suas craobh agus a’ cluich fhìdeag ann. Thig an uairsin fuamhair ’s e airson facal modhail fhaighinn air mac an ànraidh seo shuas sa chraobh agus bidh rann àraidh aig an fhuamhair ’s e a’ tighinn:

Air fith fath fuathagaich¹ air barraibh an albhagaich,²
’S fhada bha mo chorp air feadh ga meirgeadh ’s tolladh
a’ feitheamh air greim dhe d’ fheòil is
balgam dhe d’ fhuil, a mhic an Albannaich.³

¹ no fuagaich/fuamhaich
² no almhagaich/all(a)mharaich agus fiù air baile nan Albannaich uaireannan
³ no rìgh

Nise, tha an dàrna, treas is ceathramh sreath furasta gu leòr ri thuigsinn, fuilteach ’s gu bheil iad. Ach bha a’ chiad sreath a-riamh a’ cur iongnadh orm. Dè th’ ann an albhaga(i)ch? Agus dè dìreach a tha air fith fath fuathagaich a’ ciallachadh? Feumaidh mi aideachadh nach eil fhios a’m, ged a tha nàdar de dh’amharas agam. (Ma tha sibh airson èisteachd ris, seo aon dhe na clàraidhean aig Sgoil Eòlais na h-Alba. Tha am fith fath fuathagaich a’ nochdadh san dàrna clàradh, ’s dòcha dà mhionaid an dèidh toiseach a’ chlàraidh.)

A’s a’ chiad dol a-mach, saoilidh mi gu bheil baile nan Albannaich dìreach na mhearachd is an sgeulaiche a’ dol car iomrall (no fiù an neach a rinn an tar-sgrìobhadh, chan eil na clàraidhean cho soilleir uaireannan). Ged nach eil mi cinnteach idir mun albhagaich, leis gu bheil gach tionndadh dhen sgeulachd ag innse gun do shreap e suas craobh, chanainn gu bheil air barraibh (< bàrr + -aibh) ag innse gu bheil e na shuidhe air rudeigin, ge be dè th’ ann an albhagaich. Tha albhagaich a’ toirt ailbh(eag) “creag” nam inntinn ach carson a bhiodh e air creag is e dìreach air craobh a shreap?

Ach co-dhiù, ’s e a’ chiad phàirt a tha a’ fàgail tachais nam inntinn bhochd. Dè th’ anns na faclan seo? An e faclan fuadain a th’ annta, vocables mar gum biodh? Cha phìobaire mi ach chan eil coltas canntaireachd air — chan ann air fuathagaich co-dhiù. Chan eil cus ciallach sna faclairean a bharrachd. Tha aon fhacal ann, fìth-fàth, sin cleòca a dh’fhàgas do-fhaicsinneach thu. ’S e facal gu math aosta a th’ ann; tha e a’ nochdadh san t-Seann-Ghaeilge mar fía fé (is cruthan eile). Ach cha chreid mi gur e cleòca mar a sin a th’ againn an-seo. Chan eil dad ann an gin dhe na sgeulachdan a tha a’ toirt iomradh air do-fhaicsinneachd.

An aon rud – agus sin an leth-amharas air an dug mi iomradh roimhe – a bhuail orm, sin an rann ud a tha a’ nochdadh ann an ‘Jack and the Beanstalk’:

Fee-fi-fo-fum,
I smell the blood of an Englishman,
Be he alive, or be he dead
I’ll grind his bones to make my bread

Chan e dìreach gu bheil fee-fi-fo-fum car coltach ri fith fath fuathagaich ach tha an rann air fad gu math coltach na nàdar ris an rann Ghàidhlig, nach eil?

A-rèir coltais, ’s ann aig Shakespeare a tha seo a’ nochdadh ann an sgrìobhadh a’ chiad turas (mar fie, foh, and fum). Tha an Oxford English Dictionary (aig a bheil e mar fee-faw-fum) ag innse dhuinn gur e doggerel a th’ ann ach cha do lorg mi cus a mhìnicheas air na tha fee-fi-fo-fum a’ ciallachadh ann, no cò às a thàinig e. ’S e sin, an e faclan fuadain Beurla a th’ annta no saoil an do ghoid a’ Bheurla seo air cànan eile? Ged a tha eòlaichean sgeulachdan ag innse dhuinn gu bheil ‘Jack and the Beanstalk’ a’ buntainn ri roinn sgeulachdan ris an canar “neach a’ marbhadh dràgan”, chan fhaighear a’ phònair draoidheachd ud ach ann am Breatainn. Cha chuireadh e iongnadh orm nam biodh freumh no freumhag Cheilteach aig Jack, car mar a dh’fhàg àireamhan nam Breatannach lorg san yan tan tethera.

Ach ged a tha pailteas iongnaidh orm, chan eil dad a dh’fhios. Saoil a bheil sgeulachd mar seo aig na Cuimrich? No a bheil mi fada ceàrr ’s mìneachadh gu tur eadar-dhealaichte air? Dè ur beachd-ne?

Mìcheal Bauer, cuidiche rannsachaidh

Decoding Hidden Heritages: Seeking the “Unknown”

As Copyright Administrator for the Decoding Hidden Heritages project, it’s my role to investigate the copyright status of the sound material and transcriptions in the Tale Archive.

Everyone involved with a sound recording has copyright to their material. As a result, it can be a lengthy process when checking which individuals are involved with a recording, and if The School of Scottish Studies Archives (SSSA) hold records of copyright assignation. Typically, the search must go outwith SSSA and that’s when I feel like donning my deerstalker! Today I will highlight some of that process.

We come across a number of contributors who are down as Unknown or Anonymous in our collections. There can be a few different reasons why this happened; not everyone’s names were captured by the fieldworker, or it was a cataloguing error. Sometimes people just wished to remain anonymous – either they were too shy to go on record, or the material may have been deemed too sensitive. These days, we have distinct copyright and Data Protection rules to safeguard sensitive material. We also have methods to close or mute someone’s material for a set period of time rather than anonymising completely or forever. So there is some flexibility in the approaches that we can take.

If persons are not explicitly named for a recording, it doesn’t mean we can assume that copyright can be cleared on that basis alone – we still have to do our due diligence. Given that this week marks International Woman’s Day 2022 (March 8) – let’s look at an example of two anonymous women in the Tale Archive.
index card featuring tales recorded by two anonmyous women in 1959This card refers to part of the recording SA1959.027 – but it doesn’t give us much information, other than it was recorded in Smearisary, Glenuig and the story is of Murchag  is Meanchag/Murchag A’s Mionachag.

My next step was to look at what is included on the whole recording. On checking the Summary book for 1959, it shows that this was a recording made by Calum Maclean, Basil Megaw and Ian Whittaker. The other contributors on the tape were Angus MacNeill, Sandy Gillies and “Anon Woman A”, “Anon Woman B / Mrs MacDonald (?)” and “Anon Woman C”. Not terribly enlightening, in the grand scheme of things! Even if Anon Woman B might be a Mrs MacDonald, it doesn’t give us anything to go on. From here I went down to the archive store room to look at the original tape box  – sometimes there was a listing completed at the time of recording and included in the box.

A beautiful listing both outside and inside the box, but  – as an archive colleague from the past has noted, in pencil – “Who are the informants?”

Listening to the recording itself can be helpful in some cases, because often the name is given at some point – but my Gaelic is not yet good enough for this tape.

So, what now? I will contact our colleagues at Tobar an Dualchais because parts of this tape are available to listen to online; these recordings are by the named contributors. It is possible when their researchers were seeking copyright that they were able to find out who these anonymous female storytellers were. I will keep you updated.

It’s really important to find out who unknown people in our collections are and, if possible, put a name to the voice and acknowledge their important contribution in the archives. I include a very short extract of Anon Woman A and Anon Woman B / Mrs Macdonald (?), of Smearisary and thank them for making my job so interesting!

This clip is placed here on a risk-balanced approach and that is another part of the process for another blog post!

Extract from SA1959.027 from collection of School of Scottish Studies Archives.

 

Louise Scollay, Copyright Administrator

The Selkie o the River Dee

A Faroese stamp featuring the legend of Kópakonan (the Seal Woman).

Whilst working on data capture for the Decoding Hidden Heritages Project, I came across this tale of a seal-woman, or selkie (ScG: ròn ‘seal’), that struck a chord with me. Stanley Robertson from Aberdeen tells of the story he heard from his father, ‘The Selkie o the River Dee’, which Stanley was told was a true story and referred to an ancestor of his.

A man spies a seal-woman coming out of the River Dee and shedding her skin on the shore. The man takes the skin and hides it from her to force the woman to go home with him and be his wife. They have several children together and one day the children find the seal skin the man had hidden. The woman takes the skin and disappears back into the River Dee, never to return, with the man arriving just in time to see her go.

Upon reading the story, it immediately occurred to me that this seems to have been a tragic incident that was ‘spun’ into a whimsical tale, likely for the benefit of the children involved. Surely enough, Stanley goes on to say that he thinks the woman might have actually committed suicide, as the story referred to ”real” people. I think this story is fascinating because it has the ability to truly resonate with the listener or reader on a very emotional level. You don’t need to be an expert on folktales to understand why or how it came to be. Similar seal-woman (or mermaid) stories are found across the North Atlantic, for example in Irish, Icelandic and Faroese folklore.

A much more cheerful selkie reference in popular culture can be seen in the beautifully-illustrated Irish feature film: Song of the Sea.

To hear a similar story in Gaelic, click here.

To hear more stories on selkies (male and female versions), visit the Tobar an Dualchais website.

This blog post was written by Cristina Horvath, the newly-appointed Digitisation & Data Entry Technician for the Decoding Hidden Heritages Project.

Decoding Hidden Heritages project update: 14.01.22

For an automatic translation into English, click here. For a version in Irish, click here.

15 Am Faoilleach 2022

Ùghdar: Dr Andrea Palandri, Rannsaiche Iar-Dhotaireil

Andrea Palandri

As t-samhradh 2021, fhuair Gaois maoineachadh fo sgeama AHRC-IRC gus pròiseact a thòiseachadh air a’ Phrìomh Chruinneachadh Làmh-sgrìobhainnean bho thasg-lann Coimisean Beul-aithris na h-Èireann (Cumann Béaloideasa Éireann, University College Dublin). Canar Decoding Hidden Heritages ris a’ phròiseact seo. Is e cuspair a’ bhlog seo an obair dhigiteachaidh a tha a’ dol air adhart mar phàirt den phròiseact air làmh-sgrìobhainnean a’ Phrìomh Chruinneachaidh.

Thathas a’ meas gu bheil timcheall air 700,000 duilleag làmh-sgrìobhainn anns a’ Phrìomh Chruinneachadh Làmh-sgrìobhainnean, ga fhàgail mar aon de na cruinneachaidhean as motha de stuth beul-aithris air taobh an iar na Roinn Eòrpa. Bhiodh seo air a bhith na dhùbhlan mòr airson digiteachadh mura biodh Transkribus air teicneòlas AI airson aithne làmh-sgrìobhaidh a leasachadh thar nam beagan bhliadhnaichean a dh’fhalbh. Tha Decoding Hidden Heritages gu mòr an urra air an teicneòlas seo agus leigidh e leis a’ phròiseact a innealan-aithne làmh-sgrìobhaidh fhèin a dhèanamh stèidhichte air sgrìobhadairean sònraichte sa chruinneachadh.

On a thòisich ar luchd-rannsachaidh a bhith ag obair leis a’ bhathar-bog Transkribus tràth san Dàmhair, tha sinn air trì innealan làmh-sgrìobhaidh aithnichte a dhèanamh a tha ag obair aig ìre mionaideachd nas àirde na 95%: aon airson Seosamh Ó Dálaigh, aon airson Seán Ó hEochaidh agus aon airson Liam Mac Coisdealbha, trì de an luchd-cruinneachaidh as dealasaiche a bha ag obair don Choimisean.

Figear 1 (Clí) Seosamh Ó Dálaigh a’ cruinneachadh beul-aithris bho Tomás Mac Gearailt (Paraiste Márthain, Corca Dhuibhne) agus (deas) làmh-sgrìobhainn a sgrìobh e bho chlàradh a rinn e de Tadhg Ó Guithín (Baile na hAbha, Dún Chaoin, Corca Dhuibhne) ga ath-sgrìobh ann an Transkribus.

Tha Transkribus feumail air tar-sgrìobhadh ceart a rèir duilleag na làmh-sgrìobhainne – a rèir làmh-sgrìobhadh agus dual-chainnt an neach-cruinneachaidh – gus an einnsean a thrèanadh. An dèidh a bhith ag aithneachadh timcheall air leth-cheud duilleag san dòigh seo, thrèan sinn modal làmh-sgrìobhaidh aig ìre gu math èifeachdach (90% +). Is e dòigh-obrach a’ phròiseict na dhèidh seo ath-sgrìobhadh a dhèanamh air àireamh mhòr de dhuilleagan gu fèin-ghluasadach agus luchd-taic rannsachaidh (Emma McGee, Kate Ní Ghallchóir agus Róisín Byrne) a chur gan ceartachadh mean air mhean. Na dhèidh sin, faodaidh sinn na modailean a dh’ath-thrèanadh air stòr-dàta nas fharsainge gus modalan cànain nas fheàrr (~ 95%) a fhaighinn. Tha toraidhean eadar-amail na h-obrach seo a’ toirt dòchas dhuinn gum bi e comasach don phròiseact ìre mionaideachd nas àirde a choileanadh anns na mìosan a tha romhainn, a leigeas leinn a bhith ag ath-sgrìobhadh gu fèin-ghluasadach mòran den Phrìomh Chruinneachadh Làmh-sgrìobhainnean cha mhòr thar oidhche.

Figear 2  An lúb ionnsachaidh de mhodalan cànain a chaidh a dhèanamh le Transkirbus gu ruige seo: Seán Ó Dálaigh (clí), Seán Ó hEochaidh (meadhan) agus Liam Mac Coisdealbha (deas).

Tha làmh-sgrìobhainnean a’ Phrìomh Chruinneachaidh am measg nan teacsaichean as motha anns a bheil lorg nan dual-chainntean ann an corpas litreachas Gaeilge an latha an-diugh. Is e dòigh-obrach agus dòighean deasachaidh Shéamuis Ó Duilearga fhèin a tha a’ nochdadh ann an Leabhar Sheáin Í Chonaill.  Bhrosnaich agus stèidhidh e Comann Beul-aithris na hÈireann ann an 1927 agus chan eil mìneachadh nas fheàrr air an dòigh-obrach seo na na faclan a sgrìobh Séamus Ó Duilearga fhèin ann an ro-ràdh an leabhair:

Ní raibh ionnam ach úirlis sgríte don tseanachaí: níor atharuíos siolla dá nduairt sé, ach gach aon ní a sgrí chô maith agus d’fhéadfainn é.

Cha robh annam ach inneal sgrìobhaidh dhan t-seanchaidh: cha do dh’atharraich mi lide dhe na thuirt e, ach sgrìobh e a h-uile rud cho math ’s a b’ urrainn dhomh.

(S. Ó Duilearga, Leabhar Sheáin Í Chonaill, xxiv)

Cha deach mòran leabhraichean fhoillseachadh ann an litreachas na Gaeilge bhon uairsin a dh’fhuirich cho dìleas ri dual-chainnt an neach-labhairt ’s a rinn Leabhar Sheáin Í Chonaill: tha cruthan dualchainnteach mar bheadh saé an àite bheadh sé (bhiodh e), no buaileav an àite buaileadh (chaidh a bhualadh) no fáilthiú an àite fáiltiú (fàilteachadh). Mar sin, tha cànan nan làmh-sgrìobhainnean anns a’ Phrìomh Chruinneachadh a’ taisbeanadh dual-chainnt, no eadhon ideo-chainnt, an luchd-fiosrachaidh gu làidir. Mar eisimpleir, bidh claonadh dual-chainnte, do raibh an àite go raibh (gun robh) ga ràdh; bha sin aig cuid de dhaoine à Corca Dhuibhne ann an Chonntaidh Chiarraí, m.e. anns na sgeulachdan a sgrìobh Seosamh Ó Dálaigh bho Thadhg Ó Guithín (Baile na hAbha, Dún Chaoin).

Figear 3 Thug Diarmuid Ó Sé iomradh air an iongantas dualchaint seo ann an Gaeilge Chorca Dhuibhne (§619)

Tha làmh-sgrìobhainnean a’ chruinneachaidh seo car neònach air sàillibh nan cruthan beaga dual-chainnteach a chlàraich an luchd-cruinneachaidh fhad ’s a bha iad gan ath-sgrìobhadh. Is ann air sgàth an iomadachd cànain seo anns a’ chorpas nach eil am pròiseact ag amas air aon mhodail mòr a chruthachadh gus an Cruinneachaidh ath-sgrìobhadh air fad. A bharrachd air sin, chan e a-mhàin gu bheil sinn a’ dèiligeadh ri diofar dhual-chainntean ach tha sinn cuideachd a’ dèiligeadh ri diofar luchd-cruinneachaidh aig nach robh làmh-sgrìobhadh is litreachadh dhual-chainntean co-ionnan. Tha na duilgheadasan seo a’ fàgail gu bheil an corpas Gaeilge seo gu math measgaichte. Feumar dèiligeadh ris le cùram agus le taic bho leabhraichean dhual-chànanachais a bhios a’ toirt cunntas air na puingean beaga cànain a gheibhear ann.

Figear 4 Làmh-sgrìobhadh Sheosaimh Uí Dhálaigh

 

Figear 5 Làmh-sgrìobhadh Sheáin Uí Eochaidh

 

Figear 6 Làmh-sgrìobhadh Liam Mhic Choisdealbha

Agallamh le Roibeart MacThòmais / An interview with Robert Thomas

Anns an t-sreath seo, tha sinn a’ toirt sùil air laoich a rinn adhartas cudromach ann an teicneolas nan cànanan Gàidhealach. Airson a’ cheathramh agallaimh, cluinnidh sinn bho Roibeart MacThòmais. Coltach ri Lucy Evans, the Rob air ùr thighinn gu saoghal na Gàidhlig. Chaidh fhastadh airson còig mìosan ann an 2021 mar phàirt de phròiseact a mhaoinich Data-Driven Innovations (DDI), far a robh an sgioba a’ cruthachadh teicneolas aithneachadh labhairt airson na Gàidhlig. Dh’obraich Rob  air inneal coimpiutaireachd ùr-nòsach eile, An Gocair.

Nuair a bhios tu a’ feuchainn ri teicneòlas cànain a chruthachadh airson mhion-chànain, ’s e an trioblaid as bunasaiche ach dìth dàta. Chan eil an suidheachadh a thaobh na Gàidhlig buileach cho truagh ri cuid a mhion-chànanan eile, ach tha deagh chuid dhen dàta seann-fhasanta a thaobh dhòighean-sgrìobhaidh. Tha sin a’ fàgail nach gabh e cleachdadh gus modailean Artificial Intelligence a thrèanadh gun a bhith a’ cosg airgead mòr air ath-litreachadh.

Bidh An Gocair ag ath-litreachadh theacsaichean gu fèin-obrachail – tha e glè choltach ri dearbhadair-litrichidh. Chan eil ann ach ro-shamhla (prototype) an-dràsta agus tha sinn a’ sireadh taic a bharrachd airson a leasachadh. Aon uair ‘s gum bi e deiseil, b’ urrainnear a chur gu feum ann an iomadach suidheachadh, leithid foillseachadh, foghlam aig gach ìre, prògraman coimpiutaireachd eile agus rannsachadh sgoileireil. Cuiridh e gu mòr cuideachd ri pròiseact rannsachaidh ùr a tha a’ tòiseachadh an dràsta eadar còig oilthighean ann am Breatainn, Ameireaga agus Èirinn: ‘Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-mining and Phylogenetics’.

In this interview series, we are looking at individuals who have significantly advanced the field of Gaelic, Irish and Manx language technology. For the fourth interview, we hear from Mr Rob Thomas. Like Lucy Evans, whom we interviewed a few months ago, Rob has come to the world of Gaelic language technology only recently. He was chosen from a strong field to work with us on project funded by Data-Driven Innovations (DDI), in which we were developing the world’s first automatic speech recogniser for Scottish Gaelic. Rob worked on an important strand of this project – developing a brand-new piece of software called An Gocair.

When trying to develop language technology for minority languages, the most fundamental problem is data sparsity. The situation for Gaelic is not as dire as for some other minority languages, but much of the textual data available is outdated in terms of orthography. That makes it impossible to train machine learning models – at least without spending a lot of money on editing spelling.

An Gocair re-spells texts automatically – it’s basically an unsupervised spell-checker with some extra bells and whistles. It is currently only a prototype, however, and we are seeking additional support for its development. Once completed, it will be able to be used in a wide range of contexts, including publishing, education at all levels, as part of other computer programs and within academic research. It will also make a significant contribution to a new research project currently underway between five universities in Britain, America and Ireland: ‘Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-mining and Phylogenetics’.

Interview with Rob Thomas

Agallamh le Roibeart MacThòmais

Tell us a little bit about your background. For instance, where are you from, and what got you into language technology work?

Hello! I’m from a small town in South Wales called Monmouth. I grew up mostly in the countryside, quite far from civilisation. My interest in linguistics probably stems from having a fantastic English teacher in my high school. (Shout out to Mr Jones.) I don’t know if it was the content or how he taught it, but I remember at the time really enjoying the subject and his lessons.

Rob Thomas

I went on to study English Language and Linguistics at the University of Portsmouth. After graduating, I worked for a while at Marks and Spencer as I was not yet sure what kind of career I was looking for. Still kind of directionless, I spent a year and a bit traveling and on return began working in tech support. I managed to find a course in Language Technology at the University of Gothenburg, I had recently found a new interest in programming and this was a great way to merge my new interest and my academic foundation. After a few years living, studying and working in Sweden, I returned to the UK and began the job hunt and was lucky to find the position at the University of Edinburgh.

You mention studying language technology at the University of Gothenburg. What did you find most interesting about the course? Do you have any advice for someone who is thinking about studying language technology?

The course was fascinating and it attracted students from quite a broad background. The first meeting was like The Time Machine by H.G Wells: we were all introduced as the linguist or the mathematician, cognitive scientist, computer scientist, philosopher etc. I think what stood out is that language technology, as a field, relies on input and experience from a multitude of academical backgrounds. This is due to the complex nature of language. I think I would advise anyone who is not from a technical or STEM background to think about how important your knowledge and perspective is for the future of language-based AIs, systems and services. But if, like me, you do come from a humanities background be prepared to dive straight back in to the maths that you thought you managed to escape after you completed your GCSEs.

You are developing a tool for Scottish Gaelic that automatically corrects misspelled words and makes text conform to a Gaelic orthographical standard. That’s impressive for someone with Gaelic, and even more so for someone who doesn’t speak it. How did you manage to do this?

I am quite lucky to be supported by Gaelic linguists and other programmers. I found a way to integrate Am Faclair Beag, an online Gaelic dictionary developed by our resident Gaelic domain expert, Michael Bauer. Alongside the dictionary we translated complicated linguistic rules into something a computer could understand. We have managed to develop a program that takes a text and, line by line, attempts to identify spelling that don’t belong to the modern orthography and searches for the right word from our dictionary. If it has no luck, it then attempts to resolve the issue algorithmically. From the start I knew it was important that I was able to compare the program’s output to work done by Gaelic experts so that I could see whether I was improving the tool or just breaking it.

An Gocair

Since you’ve been born, you’ve seen language technology change and permeate how we work and live. What’s been your own experience of the changes that it has brought?

It has been very interesting witnessing the exponential growth of language technology in the mainstream. It wasn’t until I studied it that I realised how much it was already embedded in websites and services that I’ve been using for years. The more visible applications such as smart assistants are becoming much more normalised in our society. Even my grandma uses her smart assistant to turn on classic FM and put on timers which I think is really cool. My grandma is pretty tech savvy to be fair!

With the dominance of world languages in mass media and on the internet, some would say that technology is an existential threat to minority languages like Gaelic and Welsh. What do you think about this? Are there ways for minority languages to survive or even thrive today?

I think one of the issues in language technology is that most of the work is dedicated to languages that already have huge amounts of resources, for example English. Most of the breakthroughs are being made by large companies that ultimately aim to increase the value of their services. There are a lot of companies that sell language technology as a service (e.g. machine translation) rather than serving communities per se. The latter may not have direct monetary value, but it’s essential to keep that focus in order to allow minority languages to gain access to state-of-the-art technology.

What are your predications for language technology in the year 2050? If you had your own way, what would you like to see by that time?

I imagine smart assistants will be present in more spaces in society, perhaps even in a more official capacity. The county council in Monmouthshire already use a smart chatbot for questions about what days your bins are being collected. Imagine if they were given greater powers such as being able to make important decisions (scary thought). The more time goes on, the more I think we are going to end up with malevolent AIs like HAL from 2001, Space Odyssey, rather than ones like C3PO from Star Wars.

I’m not sure what I would like to see. It would be nice if there was more community-developed and open-source alternatives to what the main large tech companies provide, so a consumer would be able to be sure their data was being used in a safe and respectable way.

New AHRC-funded project on Gaelic & Irish folktales and the Digital Humanities

Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-Mining and Phylogenetics

This exciting new three-year study is funded by the AHRC and IRC jointly under the UK–Ireland collaboration in digital humanities programme. It brings together five international universities, two folklore archives and two online folklore portals.

October 2021–Sept 2024

‘Morraha’ by John Batten. From Celtic Fairy Tales (Jacobs 1895)

Summary

This project will fuse deep, qualitative analysis with cutting-edge computational methodologies to decode, interpret and curate the hidden heritages of Gaelic traditional narrative. In doing so, it will provide the most detailed account to date of convergence and divergence in the narrative traditions of Scotland and Ireland and, by extension, a novel understanding of their joint cultural history. Leveraging recent advances in Natural Language Processing, the consortium will digitise, convert and help to disseminate a vast corpus of folklore manuscripts in Irish and Scottish Gaelic.

The project team will create, analyse and disseminate a large text corpus of folktales from the Tale Archive of the School of Scottish Studies Archives and from the Main Manuscript Collection of the Irish National Folklore Collection. The creation of this corpus will involve the scanning of c.80k manuscript pages (and will also include pages scanned by the Dúchas digitisation project), the recognition of handwritten text on these pages (as well as some audio material in Scotland), the normalisation of non-standard text, and the machine translation of Scottish Gaelic into Irish. The corpus will then be annotated with document-level and motif-level metadata.

Analysis of the corpus will be carried out using data mining and phylogenetic techniques. Both the data mining and phylogenetic workstreams will encompass the entire corpus, however, the phylogenetic workstream will also focus on three folktale types as case studies, namely Aarne–Thompson–Uther (ATU) 400 ‘The Search for the Lost Wife’, ATU 425 ‘The Search for the Lost Husband’, and ATU 503 ‘The Gifts of the Little People’. The results of these analyses will be published in a series of articles and in a book entitled Digital Folkloristics. The corpus will be disseminated via Dúchas and Tobar an Dualchais, and via a new aggregator website (under construction) that will include map and graph visualisations of corpus data and of the results of our analysis.

Project team

UK

  • Principal Investigator Dr William Lamb, The University of Edinburgh (School of Literatures, Languages and Cultures)
  • Co-Investigator Prof. Jamshid Tehrani, Durham University (Department of Anthropology)
  • Co-Investigator Dr Beatrice Alex, The University of Edinburgh (School of Literatures, Languages and Cultures)
University of Edinburgh
  • Language Technician, Michael Bauer
  • Louise Scollay, Copyright Administrator

Ireland

  • Co-Principal Investigator Dr Brian Ó Raghallaigh, Dublin City University (Fiontar & Scoil na Gaeilge)
  • Co-Investigator Dr Críostóir Mac Cárthaigh, University College Dublin (National Folklore Collection)
  • Co-Investigator Dr Barbara Hillers, Indiana University (Folklore and Ethnomusicology)
Dublin City University
  • Postdoctoral Researcher: Dr Andrea Palandri
  • Research Assistant: Kate Ní Ghallchóir

Contact

 

‘An Gocair’: Gaelic Normalisation at a Click

By Rob Thomas

While some of our research group has been busy creating the world’s first Scottish Gaelic Speech Recognition system, others been creating the world’s first Scottish Gaelic Text Normaliser. Although it might not turn the heads of AI enthusiasts and smart device lovers in the same way, the normaliser is an invaluable tool for unlocking historical Gaelic, enhancing its use for machine learning and giving people a way to correct Gaelic spelling with no hassle.

Rob Thomas

Why do we need a Gaelic text normaliser? Well, this program takes pre-standardised texts, which can vary in their orthography, and rewrites them in the modern Gaelic Orthographic Conventions (GOC). GOC is a document published by the SQA which details the modern standards for writing in Gaelic. Text normalisation is an important step in text pre-processing for machine learning applications. It’s also useful when reprinting older texts for modern readers, or if you just want to quickly spellcheck something in Gaelic.

I joined the project towards the end and have been fast at work trying to understand Gaelic orthography, how it has developed over the centuries, and what is possible in regards to automated normalisation. I have been working alongside Michael ‘Akerbeltz’ Bauer, a Gaelic linguist with extensive credentials. He has literally written the dictionary on Gaelic as well as a book on Gaelic phonology: it is safe to say I am in good hands. We have been working together to find a way of teaching a program exactly how to normalise Gaelic text. Whereas a human can explain why a word should be spelt a specific way, programming this takes quite a bit of figuring out.

An early ancestor to Scottish Gaelic (Archaic Irish) was written in Ogham, and interestingly enough was carved vertically into stone.

Luckily historical text normalisation is a well-trodden path, and there are plenty of papers and theses online to help. In her thesis, Eva Pettersson describes four main methods for normalising text and, inspired by these, we got started. The first method relies on possessing an extensive lexicon of the target language, which we so happen to have, thanks to Michael.

Lexicon Based Normalisation

This method relies upon having a large lexicon stored that can cover the majority of words in the target language. Using this, you can check to see if a word is spelt correctly, whether it is in a traditional spelling, or if the writer has made a mistake.

The advantage of this method is that you do not have to be an expert in the language yourself (lucky for me!). Our first step was finding a way to integrate the world’s most comprehensive digital Scottish Gaelic dictionary, Am Faclair Beag. The dictionary contains traditional and misspelt words mapped to their correct spellings. This meant that we can have the program go through a text and swap words if it identifies one that needs correcting.

The table above shows some modern words with pre-GOC variants or misspellings. Michael has been collecting Gaelic words and their spelling variants for decades. If our program finds a word that is ‘out of dictionary’, we pass it on to the next stage of normalisation, which involves the hand crafting of linguistic rules.

‘An Gocair’

Rule-based Text Normalisation

Once we have filtered out all of the words that can be handled by our lexicon alone, we try to make use of linguistic rules. It’s not always easy to program a rule so that a computer can understand it. For example, we all know the English rule ‘i before e except after c’ (which of course is an inconsistent rule in English). We can program this by getting the computer to catch all the i’s before e’s and make sure they don’t come after a c.

With guidance from Michael, we went about identifying rules in Gaelic that can be intuitively programmed. One common feature of traditional Gaelic is the replacement of vowels with apostrophes at the end of words if the following word begins with a vowel. This is called ellipsis and is due to the fact that, if one were to speak the phrase, one wouldn’t pronounce both vowels: the writer is simply writing how they would speak. For example, native Gaelic speakers wouldn’t say is e an cù a tha ann ‘it is the dog’: they would say ’s e ’n cù a th’ ann, dropping three vowels. But in writing, we want these vowels to appear – at least for most machine learning situations.

It is not always straightforward working out which vowel an apostrophe replaces, but we can use a rule to help us. Gaelic vowels come in two categories, broad (a, o, u) and slender (e, i). In writing, vowels conform to the ‘broad to broad and slender to slender rule’, so when reinstating a vowel at the end of a word we need to check the form of the first vowel to the left of our apostrophe and ensure that, if it is a broad vowel, we add in a matching vowel.

Pattern Matching with Regular Expression

For this method of normalisation we make use of regular expressions for catching common examples that require normalisation, but are not covered by the lexicon or our previous rules. For example, consider the following example, which is a case of hyper-phonetic spelling, when a person writes like they speak:

Tha sgian ann a sheo tha mis’ a’ toir dhu’-sa.

Here, the word mis’ is given an apostrophe as a final character, because the following word begins with a vowel. GOC suggests that we restore the final vowel. To restore this vowel, we’re helped by the regularity of the Gaelic orthography, a form of vowel harmony, whereby each consonant has to be surrounded either by slender letters (e, i) or broad letters (a, o, u). So in the example above we need to make sure the final vowel of mis’ is a slender vowel (mise), because the first vowel to the left is also slender. We have managed to program this and, using a nifty algorithm, we can then decipher what the correct word should be. When the word is resolved we check to see if the resolved form is in the lexicon and if it is, we save it and move on to the next word.

Evaluation

Now you might be wondering how I managed to learn Scottish Gaelic so comprehensively in five months that I was able to write a program that corrects spelling and also confirm that it is working properly. Well, I didn’t. From the start of the task, I knew there was no way I would be able to gain enough knowledge about the language that I could confidently assess how well the tool was performing. Luckily I did have a large amount of text that was corrected by hand, thanks to Michael’s hard work.

To be able to verify that the tool is working, I had to write some code that automatically compares the output of the tool to the gold standard that Michael created, and then provide me with useful metrics. Eva Peterssonn describes in her thesis on Historical Text Normalisation two such metrics: error reduction and accuracy. Error reduction provides you with the percentage of errors in a text that are successfully corrected using the following formula:

Accuracy simply evaluates the number of words in the gold standard text which has an identical spelling in the normalised version. Below you can see the results of normalisation on a test set of sentences. The green line shows the percentage or errors that are corrected whilst the red and blue line show the accuracy before and after normalisation, respectively. As you can see the normaliser manages to successfully improve the accuracy, sometimes even to 100%.

From GOC to ‘An Gocair’

With a play of words on GOC, we have named the program An Gocair ‘The Un-hooker’. We have tried to make it as easy as possible to update it with new rules. We hope to have the opportunity to create more rules in the future ourselves. The program will also improve with the next iteration of Michael’s fabulous dictionary. We hope to release the first version of An Gocair to the world by the end of October 2021. Keep posted!

Acknowledgement

This program was funded by the Data-Driven Innovation initiative (DDI), delivered by the University of Edinburgh and Heriot-Watt University for the Edinburgh and South East Scotland City Region Deal. DDI is an innovation network helping organisations tackle challenges for industry and society by doing data right to support Edinburgh in its ambition to become the data capital of Europe. The project was delivered by the Edinburgh Futures Institute (EFI), one of five DDI innovation hubs which collaborates with industry, government and communities to build a challenge-led and data-rich portfolio of activity that has an enduring impact.

References

Pettersson, E. (2016). Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction, University of Uppsala.

The Acoustic Model and Scottish Gaelic Speech Recognition Results

By Lucy Evans

In our last blog post, we outlined some of the data preparation that is necessary to train the acoustic model for our Scottish Gaelic speech recognition system. This includes normalization and alignment. Normalization is where speech transcriptions are stripped of punctuation, casing, and any unspoken text. Alignment is where each word in a transcription is stamped with a start and end time to show where it occurs in an audio recording.

After these steps, speech data can be used to train an acoustic model. Once combined with our lexicon and language model (as described in our last blog post), this forms the full speech recognition system. In this blog post, we explain the function of the acoustic model and outline two common forms. We also report on our most recent Gaelic speech recognition results.

The Acoustic Model

The acoustic model is the component of a speech recogniser that recognises short speech sounds. Given an audio input where a speaker says, “She said hello”, for example, the acoustic model will try to predict which phonemes make up that utterance:

Audio Input Acoustic Model Output
Speaker says “She said hello” sh iy s eh d hh ah l ow

The acoustic model is able to recognise speech sounds by relying on its component phoneme models. Each phoneme model provides information about the expected range of acoustic features for one particular phoneme in the target language. For example, the ‘sh’ model will capture the typical pitch, energy, or formant structure of the ‘sh’ phoneme. The acoustic model uses the knowledge from these models to recognise the phonemes in an input stream of speech, based on its acoustic features. Combining this prediction with the lexicon, as well as the prediction of the language model, the system can transcribe the input sentence:

ASR System Component(s) Output, given a speaker saying: “She said hello”
Acoustic Model Prediction sh iy s eh d hh ah l ow
+ Lexicon sh iy = she

s eh d = said 

hh ah l ow = hello

+ Language Model Prediction She said hello

Training the Acoustic Model

In order to train our acoustic model, we feed it a large quantity of recorded speech in the target language. These are split up into sequences of 10ms ‘chunks’, or frames. Alongside the recordings, we also feed in their corresponding time-aligned transcription:

Aligned Gaelic speech

Using the lexicon, the system maps each word in the transcript to its component phonemes. Then, according to the start and end times of that word, it can estimate which phoneme is being pronounced during each 10ms frame where the word is being spoken. By gathering acoustic information from every frame in which each particular phoneme is pronounced, the set of phoneme models can be generated. 

Training procedure for the Acoustic Model

Types of Acoustic Model: Gaussian Mixture Models vs Deep Neural Networks 

Early acoustic modelling approaches incorporated the Gaussian Mixture Model (GMM) for building phoneme models. This is a generative type of model, meaning that it recognises the phonemes in a spoken utterance by estimating, for every 10ms frame, how likely each phoneme model is to generate that frame. For each frame, the phoneme label of the model with the highest likelihood is output.

More recent, state-of-the-art approaches use the Deep Neural Network (DNN) model. This is a discriminative model. The model directly classifies each input frame of speech with a predicted phoneme label, based on the discriminatory properties of that frame (such as its pitch or formant structure). The outputs of the two models are therefore the same – a sequence of phoneme labels – but generated in different ways. 

The reason that the DNN has overtaken the GMM in speech recognition applications is largely due to its modelling power. DNNs are models with a number of different ‘layers’, and consequently a larger number of parameters. Parameters are variables contained within the model, whose values are estimated from the training data. Put simply, having more parameters enables DNNs to retain much more information about each phoneme than GMMs, and as such, they perform better on speech recognition tasks.

Another key difference between the two types of acoustic model is the training data they require. For GMMs, we can simply input recordings with their time-aligned transcriptions, as we already prepared using Quorate’s English aligner. On the other hand, training the DNN requires that every frame of each recording is classified with its corresponding Gaelic phoneme label. We obtain these labels by training a GMM acoustic model, which, once trained on the Gaelic recordings and time-aligned transcriptions, can be used for forced alignment. During forced alignment, each frame of the speech data is aligned to a ‘gold standard’ phoneme label. This output can then be used to train the DNN model directly.

Speech Recognition Results

Having carried out the training of our GMM and DNN acoustic models, we are now in a position to report our first speech recognition results. We initially trained our models using only the Clilstore data, which amounted to 21 hours of speech training data. Next, we added the Tobar an Dualchais data to our training set, which increased the size of the dataset to 39.9 hours of speech (NB: the texts in this data are transcriptions of traditional narrative from the School of Scottish Studies Archives, made by Tobar an Dualchais staff). Finally, we added data from the School of Scottish Studies Archives via the Automatic Handwriting Recognition Project to train our third, most recent model, on 63.5 hours of speech. 

We evaluated our models on a subset of the Clilstore data, which was excluded from the training data. This evaluation set comprises 54 minutes of speech, from 21 different speakers. Each recording was passed through the speech recogniser to produce a predicted transcription. We then measured the system’s performance using Word Error Rate (WER). The WER value is the proportion of words that the speech recogniser transcribes incorrectly for each input recording. The measure can also be inverted to reflect accuracy. 

As can be seen from the table below, our results have been encouraging, especially considering that DNN models perform best when trained on much larger quantities (100s of hours) of data. We are particularly pleased to report that our latest model passed below 30% WER (i.e. > 70% accuracy), an initial goal of our Gaelic speech recognition project. 

Model

Training Corpus (hours of speech) Word Error Rate (WER) Accuracy

WER Reduction (from previous model)

A Clilstore (21) 35.8% 64.2%
B Clilstore 

+ Tobar an Dualchais (39.9)

31.0% 69.0% 4.8%
C Clilstore 

+ Tobar an Dualchais

+ Handwriting (63.5)

28.2% 71.8% 2.8%

To showcase our speech recogniser’s current performance, we have put together some demo videos. These are subtitled with the speech recogniser’s predicted transcription for each video. Please note that the subtitles will have imperfections, given that we are using our speech recogniser (with 71.8% accuracy) to generate them. Take a look by clicking this link!

Demo video screenshot

Next Steps…

With just 2 months left of the project, the countdown is on! We plan to spend this time adding a final dataset to the model’s training data, with the hopes of further reducing the WER of our system. After this, we plan to experiment with speech recognition techniques, such as data augmentation, to maximise the performance of the system on the data we have collected thus far. Make sure to look out for further updates coming soon!

Acknowledgements

With thanks to Data-Driven Innovation Initiative for funding this part of the project within their ‘Building Back Better’ open funding call

Page 3 of 4

Powered by WordPress & Theme by Anders Norén

css.php

Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.

  Cancel