Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.

Rannsachadh digiteach air a' Ghàidhlig ~ Goireasan digiteach airson nan Gàidheal

Author: wlamb

Agallamh leis an Ollamh Kevin Scannell

Anns an t-sreath seo, bidh sinn a’ coimhead air sàr-laoich a rinn adhartas mòr ann an teicneolas nan cànanan gàidhealach. Airson a’ chiad agallaimh, cha b’ urrainn dhuinn na b’ fheàrr fhaighinn na ‘n t-Ollamh Kevin Scannell à Oilthigh San Louis, anns na Stàitean Aonaichte. Tha Kevin air an t-uabhas de ghoireasan a chur a-mach airson nan trì cànanan Gàidhlig, agus tha e o chionn ghoirid air duais Fulbright fhaighinn gus goireasan airson Gàidhlig na h-Èireann a chruthachadh a chleachdas teicneolas niùrail agus ionnsachadh domhainn. Mòran taing do Kevin a bhith deònach an t-agallamh seo a dhèanamh.

In this series, we look at heroes of language technology who have made significant progress for the Gaelic languages. For the first interview, we couldn’t do better than Professor Kevin Scannell of St. Louis University (USA). Kevin has produced a vast number of resources for the three Gaelic languages (Gaelic, Irish and Manx), and has recently been awarded a Fulbright Award (2019) to develop tools for Irish Gaelic that utilise neural networks and deep learning techniques. Many thanks to Kevin for agreeing to do this interview with us. 

We’re presenting some of these interviews in a Gaelic or Irish only format. If required, they can be translated to English using Google Translate.  

Agallamh leis an Ollamh Kevin Scannell

Interview with Professor Kevin Scannell

An tOllamh Kevin Scannell

Tá Kevin Scannell ina Ollamh le Matamaitic agus Ríomheolaíocht in Ollscoil San Louis, Missouri. Oibríonn sé i gcomhar le grúpaí ar fud an domhain le hacmhainní ríomhaireachta a fhorbairt a chuidíonn leo a dteanga dhúchais a úsáid ar líne. Tá suim ar leith aige sa Ghaeilge agus sna teangacha Ceilteacha eile; tá gramadóir, litreoir, agus teasáras Gaeilge forbartha aige, chomh maith le foclóirí agus inneall aistriúcháin Gàidhlig-Gaelg-Gaeilge.  Glacann sé páirt i dtogra a sholáthraíonn leaganacha Gaeilge de roinnt táirgí ríomhaireachta mór-le-rá: Mozilla Firefox, LibreOffice, Gmail, agus Twitter mar shampla. I 2011, bhunaigh sé an suíomh Indigenous Tweets chun mionteangacha agus teangacha dúchasacha a chur chun cinn sna meáin shóisialta.

“Cá as tú agus cá bhfuair tú Gaeilge ar dtús?”

Is as Bostún Mheiriceá mé ó dhúchas. Thosaigh mé ag foghlaim na Gaeilge i Meiriceá sa 1990idí, i m’aonar, ó leabhair agus ó fhoclóirí. Bhí go leor eolais agam ar litríocht na Gaeilge agus gramadach na Gaeilge ach ní raibh mé compordach leis an teanga labhartha ar feadh blianta fada. Thosaigh mé ag teacht go hÉirinn thart ar 2006 agus tháinig feabhas ar mo chumas labhartha de réir a chéile.

“Cad a thug ort oibriú le teicneolaíocht na teanga? Conas a thosaigh tú sa réimse seo?”

Go bunúsach, thosaigh mé ar an obair seo mar gheall ar na riachtanais a bhí ormsa féin mar fhoghlaimeoir. Sna 1990idí, ghlac mé páirt sna liostaí r-phoist Gaelic-L agus Gaeilge-A agus bhí díomá orm nach raibh seiceálaí litrithe ar fáil. Mar a tharlaíonn sé, bhí mé ag bailiú bunachar sonraí foclóireachta mar chuid de mo phróiseas foghlamtha. Ní raibh mórán oibre i gceist seiceálaí litrithe a chruthú as sin — bá é sin GaelSpell — foilsíodh an chéad leagan 20 bliain ó shin. Ní raibh aon saineolas agam ar an réimse seo ag an am — bhí mé i mo mhatamaiticeoir, ach bhí scileanna ríomhaireachta sách maith agam. Agus ba léir dom ag an am gurbh fhiú corpas a thógáil chun cabhrú liom an bunachar foclóireachta a thógáil níos sciobtha, agus le bheith cinnte go raibh na focail is coitianta agam. Bhailigh mé b’fhéidir milliún focal Gaeilge ón Idirlíon sna 90idí, agus lean mé ar aghaidh leis an obair sin (i dteangacha eile freisin), agus anois tá níos mó ná 200 milliún focal sa gcorpas Gaeilge ar mo ríomhaire!

“I measc na dtionscadal teicneolaíochta a raibh tú páirteach iontu, cé acu ceann ba thábhachtaí nó ba thaitneamhaí duit?”

Creid nó ná creid, déarfainn gurb é GaelSpell an tionscadal is tábhachtaí (de réir líon daoine atá ag baint úsáid as) cé nach bhfuil sé róspéisiúil ó thaobh cúrsaí teicneolaíochta. Rinne mé seiceálaí gramadaí darb ainm An Gramadóir freisin, agus bíonn go leor daltaí scoile agus mac léinn ollscoile á úsáid chun aistí a sheiceáil. Ach an ceann is tábhachtaí dar liomsa ná “An Caighdeánaitheoir”, tionscadal nach bhfuil i mbéal an phobail ar chor ar bith. Rud thar a bheith simplí atá ann — déanann sé caighdeánú ar litriú agus ar ghramadach téacsanna Gaeilge a bhí scríofa roimh an gCaighdeán Oifigiúil. D’fhoilsigh Rialtas na hÉireann mórán leabhar Gaeilge sna 1930idí, ach úsáidtear an seanlitriú iontu (agus an seanchló chomh maith). Mar sin, tá sé i bhfad níos deacra tairbhe a bhaint astu i gcúrsaí NLP, mar shampla, agus bíonn fadhbanna ag an foclóirithe Gaeilge cuardach a dhéanamh sna téacsanna seo.

Leaganacha den fhocal “Gaeilge” sa chorpas

Tá an tionscadal foclóireachta focloir.ie (An Gúm) agus Foclóir na Nua-Ghaeilge (Acadamh Ríoga na hÉireann) ag baint úsáid as an gCaighdeánaitheoir. Agus is féidir é a úsáid chun seantéacsanna a réiteach do lucht léitheoireachta sa lá atá inniu ann, daoine nach bhfuil cleachta leis an seanlitriú. Tá sé sin déanta agam le roinnt seanleabhar.

“Cad iad cuid de na fadhbanna le teicneolaíocht a fhorbairt do mhionteanga mar an Ghaeilge?”

Ba mhaith liom tuilleadh daoine óga a mhealladh chun obair a dhéanamh sa réimse. Tá grúpaí taighde ann in áiteanna éagsúla in Éirinn (DCU, Trinity, NUIG go háirithe) agus bíonn mic léinn máistreachta/PhD acu anois is arís, ach ní leor é sin chun an obair a chur ar bhonn slán fadtéarmach. Ba chóir do Rialtas na hÉireann infheistíocht mhór a dhéanamh sna grúpaí sin (agus i gcinn eile nach iad!); níos mó mac léinn, léachtóirí, ollúna, srl. Tá na daoine céanna i mbun oibre ar phlean teicneolaíocht don Ghaeilge anois, faoi scáth Roinn na Gaeltachta, agus le cúnamh Dé tiocfaidh tuilleadh airgid chun cinn mar thoradh ar an bplean.

An rud eile atá ag teastáil ná comhoibriú níos fearr leis na mórchomhlachtaí teicneolaíochta. Tá saineolas teicneolaíochta agus sonraí againne nach bhfuil ag Google, mar shampla, agus bheadh sé an-éasca feabhas mór a chur ar tháirgí Google cosúil le Google Translate. Agus tá an t-ardán atá acu ag teastáil uainne! Mar shampla, rinne mé aistritheoir Gàidhlig > Gaeilge agus Gaelg > Gaeilge roinnt blianta ó shin, ach is annamh a bhaineann éinne úsáid as; tá sé i bhfad níos éasca rudaí a aistriú go díreach in Chrome.

“Ar an láimh eile, an bhfuil aon deiseanna ann má oibríonn tú le mionteanga? Cé hiad?”

Tá! Tá pobal na Gaeilge an-díograiseach maidir leis an teanga, agus den chuid is mó bíonn siad réidh troid a dhéanamh ar son na teanga nó ar son cearta teanga. Sna cásanna ina rabhamar in ann comhoibriú a dhéanamh leis na comhlachtaí teicneolaíochta, mar shampla an t-aistriúchán a rinneamar ar GMail nó ar WhatsApp, obair dheonach a bhí ann. Ach mar sin féin, bhí sé an-éasca grúpa mór daoine a earcú chun an obair a dhéanamh; thuig siad láithreach an tábhacht a bhaineann leis na táirgí seo a bheith ar fáil i nGaeilge.

“I do thuairim, cad é an dúshlán is mó don teicneolaíocht Ghaelach do na cúig bliana amach romhainn?”

Gan a bheith fágtha as an rás chun samhlacha mór néaracha a chruthú. Tá mé i mbun oibre ar na cúrsaí seo faoi láthair, agus feicim an chumhacht agus na féidearthachtaí atá ann. Ach an taighde atá ar siúl in Google, Facebook, NVIDIA, srl., tá sé dírithe céad faoin gcéad ar Bhéarla. Agus ciallaíonn sé sin go bhfuil an taighde “overfit” ar theangacha gan mórán moirfeolaíochta mar shampla, agus (níos measa) ar theangacha a bhfuil na céadta billiún focal acu le haghaidh traenála. Caithfimid ár dtaighde féin a dhéanamh: cad iad na teicnící is fearr nuair nach bhfuil mórán sonraí traenála agat? Conas is féidir tairbhe a bhaint as na hacmhainní eile atá againn; foclóirí den chéad scoth, saineolas teangeolaíochta, srl.

“Cén fhís atá agat maidir le teicneolaíocht teanga sa bhliain 2050? Cad ba mhaith leat a fheiceáil don teicneolaíocht Ghaelach roimh sin?”

Comhéadain ghutha ar gach gléas/ríomhaire/táirge leictreonach i mbeagnach teanga ar bith. Is é sin an scoilt dhigiteach nua a bheidh ann; beidh an teicneolaíocht ghutha ar fáil i dteangacha áirithe, agus ní bheidh sí ar fáil i dteangacha eile. An chontúirt atá ann ná nach mbeidh daoine sásta cloí leis na teangacha sa dara grúpa. Is é mo thuairim go bhfuil an Ghaeilge go díreach ar an teorainn faoi láthair. Le fís fhadtéarmach, tuilleadh infheistíochta ón Rialtas, agus comhoibriú le comhlachtaí teic, beimid in ann an sliabh a dhreapadh. Ach níl sé deacair malairt an scéil a shamhlú ach oiread.

Naisc

  • Cadhan: na hacmhainní Gaeilge go léir ag Kevin
  • Intergaelic: Aistriúchán meaisín idir Gàidhlig agus Gaeilge, agus Gaelg  agus Gaeilge
  • Léacht a thug Kevin faoi “ailtireacht seirbhís-bhunaithe” do Teicneolaíochtaí Gaeilge

New Gaelic language technology website launched

A Linguistic Toolkit for Scottish Gaelic

Dr Loïc Boizou (Vytautas Magnus University) and Dr William Lamb (University of Edinburgh) have collaborated on a new bilingual website that provides a linguistic toolkit for Scottish Gaelic. Called Mion-sgrùdaiche Cànanachais na Gàidhlig or the Gaelic Linguistic Analyser, the site provides users with tools for analysing the words and structures of Gaelic sentences. The information provided by these tools can be used for additional natural language processing (NLP) tasks, or just for exploring the language further. This new website presents the tools together for the first time and provides users with two ways of interacting with them: a graphical interface and a command line method.

‘Like black magic’

The website’s development goes back to the late 1990s, when Lamb was working on his PhD. In order to investigate grammatical variation in Gaelic, Lamb constructed the first linguistically annotated corpus of Scottish Gaelic, spending over a year annotating 80,000 words of Gaelic by hand. He says, ‘It was a slog. Typing in 100,000 tags by hand… just don’t do it. I developed a nasty case of repetitive strain injury and vowed never to do this sort of thing by hand again.’ After returning to the University of Edinburgh in 2010, after 10 years at Lews Castle College Benbecula, he revisited his corpus to develop an automatic part-of-speech tagger and make the corpus available to other researchers. Today, the corpus is known as the ‘Annotated Reference Corpus of Scottish Gaelic’ or ARCOSG and is available freely online.

The corpus forms the backbone of two of the tools on the new website: the part-of-speech tagger and the syntactic parser. They were created using machine learning techniques, modelling the kinds of patterns that you find in Gaelic speech and writing. Lamb said, ‘what you can do today even with a relatively small amount of text is tremendously exciting. When we looked at developing a POS tagger in the 90s, we would have had to program each type of pattern manually to enable the computer to recognise it properly. Now, you can just run the corpus through a set of algorithms and the computer works the patterns out itself. It’s like black magic’.

Dr Will Lamb

The lemmatiser was developed in a different way, using a form of the popular online dictionary, Am Faclair Beag. Lamb explains: ‘When we were working on the part-of-speech tagger in 2013 or 14, Sammy Danso and I got in touch with Michael Bauer and Will Robertson, who put together the fantastic Am Faclair Beag. We were going to try to leverage some of the information in the dictionary, and they generously offered their data for this purpose. While that plan didn’t materialise, I was able to create a root finder or lemmatiser with it years later, which we used to help create the first neural network for Gaelic. The lemmatiser sat in the virtual cupboard for a while, until I was contacted by Loïc in 2017. Loïc wanted to create a proper Gaelic lemmatiser, and I was onboard.’

Dr Loïc Boizou

Dr Loïc Boizou is a Swiss French NLP specialist working in Lithuania (Vytautas Magnus University) who is interested in computational tools for under-resourced languages. He received his PhD in Natural Language Processing at Inalco (Institute of Eastern Languages and Civilisations) in Paris. About the project, he said, ‘I am very supportive of cultural diversity and Gaelic is one of the few endangered languages that provides serious opportunities for distance learning, thanks to Sabhal Mòr Ostaig. I really enjoyed learning the language and I decided to use my NLP skills to give it a bit of a boost. I learned about Will’s corpus and found we could cooperate very nicely.’

Roots, Trees and Tags

The website provides different ways of exploring  Gaelic text. Lemmatisation is simplest of the tools and involves retrieving a word’s root form. If you were to input a sentence like tha na coin mhòra ann (‘the big dogs are here’), the website would return ‘bi’, ‘cù’ and ‘mòr’ as the lemmas (root forms) of bha, coin and mhòra. The website also offers part-of-speech tagging, which provides grammatical information about words in a sentence. Using the previous example, the website’s algorithms would assign ‘POS tags’ to each word, as in the third tab-separated value in each line below (glossed in inverted commas):

tha	bi	V-p       'Verb: present tense'
na	na	Tdpm      'Article: pl masc def'
coin	cù	Ncpmn     'Noun: common pl masc nom'
mhòra	mòr	Aq-pmn    'Attributive adjective: plur masc nom'
ann	e	Pr3sm     'Prep pronoun: 3rd person sing masc'

The grammatical information in this example is quite precise, but such precision comes at a cost: the default tagger is subject to error about 9% of the time. For users who want simpler POS tags and more accurate tagging, the website also offers a ‘simplified tagset’ option, which provides 95% accuracy. The same sentence above, submitted with this option would provide the following:

tha	bi	Vp    'Verb: present tense'
na	na	Td    'Article: definite'
coin	cù	Nc    'Noun: common'
mhòra	mòr	Aq    'Adjective: attributive'
ann	e	Pr    'Prepositional pronoun'

In addition to lemmatisation and POS-tagging, the site also offers syntactic parsing, using a syntactically annotated corpus developed by Dr Colin Batchelor (Royal Society of Chemistry). Again, using the same sentence, the website returns the following if parsing is selected:

1	tha	bi	V-p	0	root
2	na	na	Tdpm	3	det
3	coin	cù	Ncpmn	1	nsubj
4	mhòra	mòr	Aq-pmn	3	amod
5	ann	e	Pr3sm	1	xcomp:pred

The number in the 4th column indicates which element in the sentence the word is governed by. In the case of tha, the number is 0, because it is the syntactic root. Both na and mhòra, on the other hand, are parts of a noun phrase governed by element 3, coin. This is a numerical way of displaying the kind of information that is often conveyed in a syntactic tree, such as in the example below. The information in column 5 indicates the function of the element in the sentence. For example, the function of coin is nsubj or ‘nominal subject’. More information on Dr Batchelor’s parser can be found here.

Syntactic tree for tha na coin mhòra ann

Next Steps

When asked what the next steps are for the language, Lamb explains that it’s an exciting time: ‘Well, this is really just an interim step and there is a lot to do. For a start, we hope to improve the accuracy of the tools gradually and perhaps augment them. Gaelic is, in some ways, in a very fortunate position when it comes to language technology. Advanced tools are starting to come online — like Google Translate, a handwriting recogniser and speech synthesiser — and we can exploit great resources like DASG, ARCOSG and recordings from the School of Scottish Studies Archives to push into territory that would have seemed like science fiction a few years ago.’

The dream is artificial general intelligence. ‘Elon Musk is famous for saying that one day, he’d like to die on Mars – just not on impact. Before I kick the proverbial bucket, I’d like to chat with a computer that has better Gaelic than I do’.

Predicting Grammatical Gender in Scottish Gaelic with Machine Learning

English speakers never have to worry about grammatical gender – nouns are just nouns. When I began learning Gaelic in my early twenties, getting to grip with grammatical gender was a challenge. Until I learnt some of the patterns intuitively, I had to look up every new noun in the dictionary to determine its gender, and add this information to my stack of flash cards. As it happens, computers also struggle with identifying gender. When we built the first part-of-speech tagger for Gaelic a few years ago, gender was one of the things that our statistically-based model often got wrong.

Some grammars supply a list of suffixes that are typically feminine or masculine, and these can be helpful to new students of Gaelic. For instance, once you know that just about all nouns ending in -chd are feminine, you can take a new noun with that suffix and be relatively confident about how to use it. In the grammar at the end of my 2008 book, Scottish Gaelic Speech and Writing, I list (pp 206-207)  the suffixes provided by Calder (1923: 76-77):

  • Masc: -adh, -an/-ean, -as, -ach, -aiche, and -air.
  • Fem: -ag, -achd/-eachd, -ad, /-ead, -e, and -ir (for polysyllables only)

But how reliable are these endings for predicting gender? And what proportion of nouns ending in a particular suffix takes the expected gender? Furthermore, are there any Gaelic suffixes that are useful for predicting gender that Gaelic grammarians haven’t noticed already?

Over the last few months, I’ve been assembling some code and resources that will allow me to do some new research on Gaelic grammar. Thanks to Michael Bauer of Am Faclair Beag, I have a large list of Gaelic words accompanied by useful lexicographical info. Last night, I wondered how well a machine learning algorithm could model the relationships between Gaelic orthography and gender.

I began by extracted all the nouns from the lexicon (17207 nouns total) along with their gender, and put them into a Python list of tuples, like this:

[('eigheantach', 'f'), ('dìobhairt', 'm'), ('faoinsgeulachd', 'f'), ('còmhnardachadh', 'm'), ('inneal-spreagaidh', 'm')...]

Then I  randomised the noun list – which simply included the root form and its gender – and divided it into a training (90%) and and testing set (10%). I defined three features: 1) the last letter; 2) the last two letters and 3) the last three letters. I then built a model using a Naive Bayes Classifier from the Python package, NLTK.

When applied to the test set, the model was about 83% accurate. So, knowing the ending of Gaelic noun can definitely help if you are trying to determine its gender.  Putting this in an POS tagging context, if your tagger can’t guess the gender of a word because it hasn’t seen the word before, you could use a model like this to hazard a guess and be accurate most of the time.

Calling up the most informative features from the model confirmed many expectations, but also some patterns that I didn’t expect. These are the 30 most informative features of the model – all with ratios of 10:1 or more (i.e. these endings are at least 10 times more likely to be one gender than the other):

>>> classifier.show_most_informative_features(30)
suffix2 = 'ag' f : m = 84.9 : 1.0
suffix3 = 'eag' f : m = 72.9 : 1.0
suffix3 = 'adh' m : f = 72.6 : 1.0
suffix3 = 'has' m : f = 70.6 : 1.0
suffix3 = 'nag' f : m = 54.2 : 1.0
suffix3 = 'tag' f : m = 40.2 : 1.0
suffix3 = 'rag' f : m = 37.1 : 1.0
suffix3 = 'eid' f : m = 36.3 : 1.0
suffix3 = 'gan' m : f = 34.3 : 1.0
suffix2 = 'an' m : f = 24.7 : 1.0
suffix3 = 'ear' m : f = 24.7 : 1.0
suffix3 = 'lag' f : m = 21.9 : 1.0
suffix3 = 'chd' f : m = 21.6 : 1.0
suffix2 = 'hd' f : m = 21.0 : 1.0
suffix2 = 'on' m : f = 20.3 : 1.0
suffix3 = 'ilt' f : m = 19.9 : 1.0
suffix2 = 'ar' m : f = 17.6 : 1.0
suffix3 = 'ing' f : m = 16.1 : 1.0
suffix3 = 'tan' m : f = 15.2 : 1.0
suffix3 = 'ait' f : m = 14.7 : 1.0
suffix3 = 'ean' m : f = 14.3 : 1.0
suffix2 = 'as' m : f = 14.2 : 1.0
suffix3 = 'oil' f : m = 14.2 : 1.0
suffix3 = 'lan' m : f = 13.7 : 1.0
suffix3 = 'ith' f : m = 12.7 : 1.0
suffix2 = 'am' m : f = 12.2 : 1.0
suffix3 = 'tar' m : f = 11.3 : 1.0
suffix3 = 'ram' m : f = 11.0 : 1.0
suffix2 = 'al' m : f = 10.9 : 1.0
suffix3 = 'oin' f : m = 10.8 : 1.0

 

It is easier to view these in bar plots. Here are 14 most typically feminine suffixes (the y axis shows the ratio ‘x:1’):

And here are the 15 most typically masculine ones:

There is some crossover here clearly (e.g. –eag, -nag, -tag, -rag and -lag are all forms of the diminutive female suffix –ag),  so the model could be better specified. But these tell us that gender in Gaelic is well encoded in the suffix. For example, if you see a noun that ends with -adh, it is 76 times more likely to be masculine than feminine. Indeed, there are very few feminine nouns in the lexicon that end with -adh:

>>> [noun for (noun,gender) in nounslow if noun.endswith('adh') and gender == 'f']

['cneadh', 'dearg-chriadh', 'leasradh', 'stuadh', 'speireag-ruadh', 'muirgheadh', 'buadh', 'riadh', 'criadh', 'ealadh', 'pìob-chriadh', 'ceòlradh', 'roinn-phàigheadh']

These results can be generalised as:

  • Masc nouns tend to end in: -adh, -as, -an, -ar, -am, -al and broad consonant or clusters (e.g. -al), except for a vowel + –g
  • Fem nouns tend to end in: –ag, -chd and slender consonants or clusters (e.g. –ilt, ing, -in, -il)

As any intermediate Gaelic learner knows, a good rule to follow is that masc nouns end broad and feminine nouns end slender. But are there any exceptions? Well, we already saw that nouns ending in -ag, -achd are largely feminine. Are there any others? Digging a little deeper into the model, we find the following (ratios rounded to whole numbers):

  • Fem:  –ìob (7:1), -ng (7:1), –lb (6:1)
  • Masc: –che (6:1)

Calder had the last one already (e.g. fulangaiche), but he didn’t notice that combinations of a sonorant (l, n, r) and the non-aspirated stops (b and  g) tend to be feminine – words like fang and sgealb.

With the model, we can check to see what it would make of a made-up word — how it might classify a nonce word or unusual dialectal form, for instance — if we used it as part of a part-of-speech tagger:

>>> classifier.classify(gender_features('brùthang'))

'f'

All in all, this is useful and — if you are a language geek — pretty interesting stuff. What is exciting about doing NLP with Gaelic is that, while this type of work is old hat for many languages now, it is brand new for Scottish Gaelic.

So, by generating this model and testing it upon a hold-out of 10% of the nouns in the lexicon, we have shown that it confirms certain expectations, discovers some unexpected patterns in the language, allows us to quantify relationships and provides a pragmatic solution to the quandary of how to guess the gender of unknown words as part of an NLP pipeline.

 

Code

(NB: nounslow is the list of nouns in lowercase)

def gender_features(word):

...     return{'suffix1': word[-1:],

...             'suffix2': word[-2:],

...             'suffix3': word[-3:]}

...

>>> featuresets = [(gender_features(n.lower()), gender) for (n, gender) in nounslow]

>>> train_set, test_set = featuresets[:size], featuresets[size:]

>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> print(nltk.classify.accuracy(classifier, test_set))

0.8256827425915165

New release of tagged Scottish Gaelic corpus (ARCOSG)

ARCOSG has been used for a range of projects including a voice synthesiser and syntactic parser. It has been newly revised and made compatible with the popular Natural Language Toolkit (NLTK): release available here.

A simplified version of the corpus has also been released, ARCOSG-S, which uses a less complex tag scheme (41 tags vs 246). It is available here.

 

Scottish Gaelic and its representation in language tech tools

This thoughtful article in the Guardian got me thinking about the fact that the choices we make about representation of Scottish Gaelic in new language tech tools are far from trivial. The balance of ages, genders and dialects, for instance, on a tool like Duolingo can impact the future of the language in ways that are hard to anticipate.

Developing something like an artificial voice is a resource-intensive endeavour, especially for a small language like Scottish Gaelic. How do we decide which dialects – which voices – survive digitally in the years to come? When a project is funded via public monies (unlike Duolingo), whose choice should it be?

 

Page 2 of 2

Powered by WordPress & Theme by Anders Norén

css.php

Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.

  Cancel