Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.

Rannsachadh digiteach air a' Ghàidhlig ~ Goireasan digiteach airson nan Gàidheal

Predicting Grammatical Gender in Scottish Gaelic with Machine Learning

English speakers never have to worry about grammatical gender – nouns are just nouns. When I began learning Gaelic in my early twenties, getting to grip with grammatical gender was a challenge. Until I learnt some of the patterns intuitively, I had to look up every new noun in the dictionary to determine its gender, and add this information to my stack of flash cards. As it happens, computers also struggle with identifying gender. When we built the first part-of-speech tagger for Gaelic a few years ago, gender was one of the things that our statistically-based model often got wrong.

Some grammars supply a list of suffixes that are typically feminine or masculine, and these can be helpful to new students of Gaelic. For instance, once you know that just about all nouns ending in -chd are feminine, you can take a new noun with that suffix and be relatively confident about how to use it. In the grammar at the end of my 2008 book, Scottish Gaelic Speech and Writing, I list (pp 206-207)  the suffixes provided by Calder (1923: 76-77):

  • Masc: -adh, -an/-ean, -as, -ach, -aiche, and -air.
  • Fem: -ag, -achd/-eachd, -ad, /-ead, -e, and -ir (for polysyllables only)

But how reliable are these endings for predicting gender? And what proportion of nouns ending in a particular suffix takes the expected gender? Furthermore, are there any Gaelic suffixes that are useful for predicting gender that Gaelic grammarians haven’t noticed already?

Over the last few months, I’ve been assembling some code and resources that will allow me to do some new research on Gaelic grammar. Thanks to Michael Bauer of Am Faclair Beag, I have a large list of Gaelic words accompanied by useful lexicographical info. Last night, I wondered how well a machine learning algorithm could model the relationships between Gaelic orthography and gender.

I began by extracted all the nouns from the lexicon (17207 nouns total) along with their gender, and put them into a Python list of tuples, like this:

[('eigheantach', 'f'), ('dìobhairt', 'm'), ('faoinsgeulachd', 'f'), ('còmhnardachadh', 'm'), ('inneal-spreagaidh', 'm')...]

Then I  randomised the noun list – which simply included the root form and its gender – and divided it into a training (90%) and and testing set (10%). I defined three features: 1) the last letter; 2) the last two letters and 3) the last three letters. I then built a model using a Naive Bayes Classifier from the Python package, NLTK.

When applied to the test set, the model was about 83% accurate. So, knowing the ending of Gaelic noun can definitely help if you are trying to determine its gender.  Putting this in an POS tagging context, if your tagger can’t guess the gender of a word because it hasn’t seen the word before, you could use a model like this to hazard a guess and be accurate most of the time.

Calling up the most informative features from the model confirmed many expectations, but also some patterns that I didn’t expect. These are the 30 most informative features of the model – all with ratios of 10:1 or more (i.e. these endings are at least 10 times more likely to be one gender than the other):

>>> classifier.show_most_informative_features(30)
suffix2 = 'ag' f : m = 84.9 : 1.0
suffix3 = 'eag' f : m = 72.9 : 1.0
suffix3 = 'adh' m : f = 72.6 : 1.0
suffix3 = 'has' m : f = 70.6 : 1.0
suffix3 = 'nag' f : m = 54.2 : 1.0
suffix3 = 'tag' f : m = 40.2 : 1.0
suffix3 = 'rag' f : m = 37.1 : 1.0
suffix3 = 'eid' f : m = 36.3 : 1.0
suffix3 = 'gan' m : f = 34.3 : 1.0
suffix2 = 'an' m : f = 24.7 : 1.0
suffix3 = 'ear' m : f = 24.7 : 1.0
suffix3 = 'lag' f : m = 21.9 : 1.0
suffix3 = 'chd' f : m = 21.6 : 1.0
suffix2 = 'hd' f : m = 21.0 : 1.0
suffix2 = 'on' m : f = 20.3 : 1.0
suffix3 = 'ilt' f : m = 19.9 : 1.0
suffix2 = 'ar' m : f = 17.6 : 1.0
suffix3 = 'ing' f : m = 16.1 : 1.0
suffix3 = 'tan' m : f = 15.2 : 1.0
suffix3 = 'ait' f : m = 14.7 : 1.0
suffix3 = 'ean' m : f = 14.3 : 1.0
suffix2 = 'as' m : f = 14.2 : 1.0
suffix3 = 'oil' f : m = 14.2 : 1.0
suffix3 = 'lan' m : f = 13.7 : 1.0
suffix3 = 'ith' f : m = 12.7 : 1.0
suffix2 = 'am' m : f = 12.2 : 1.0
suffix3 = 'tar' m : f = 11.3 : 1.0
suffix3 = 'ram' m : f = 11.0 : 1.0
suffix2 = 'al' m : f = 10.9 : 1.0
suffix3 = 'oin' f : m = 10.8 : 1.0

 

It is easier to view these in bar plots. Here are 14 most typically feminine suffixes (the y axis shows the ratio ‘x:1’):

And here are the 15 most typically masculine ones:

There is some crossover here clearly (e.g. –eag, -nag, -tag, -rag and -lag are all forms of the diminutive female suffix –ag),  so the model could be better specified. But these tell us that gender in Gaelic is well encoded in the suffix. For example, if you see a noun that ends with -adh, it is 76 times more likely to be masculine than feminine. Indeed, there are very few feminine nouns in the lexicon that end with -adh:

>>> [noun for (noun,gender) in nounslow if noun.endswith('adh') and gender == 'f']

['cneadh', 'dearg-chriadh', 'leasradh', 'stuadh', 'speireag-ruadh', 'muirgheadh', 'buadh', 'riadh', 'criadh', 'ealadh', 'pìob-chriadh', 'ceòlradh', 'roinn-phàigheadh']

These results can be generalised as:

  • Masc nouns tend to end in: -adh, -as, -an, -ar, -am, -al and broad consonant or clusters (e.g. -al), except for a vowel + –g
  • Fem nouns tend to end in: –ag, -chd and slender consonants or clusters (e.g. –ilt, ing, -in, -il)

As any intermediate Gaelic learner knows, a good rule to follow is that masc nouns end broad and feminine nouns end slender. But are there any exceptions? Well, we already saw that nouns ending in -ag, -achd are largely feminine. Are there any others? Digging a little deeper into the model, we find the following (ratios rounded to whole numbers):

  • Fem:  –ìob (7:1), -ng (7:1), –lb (6:1)
  • Masc: –che (6:1)

Calder had the last one already (e.g. fulangaiche), but he didn’t notice that combinations of a sonorant (l, n, r) and the non-aspirated stops (b and  g) tend to be feminine – words like fang and sgealb.

With the model, we can check to see what it would make of a made-up word — how it might classify a nonce word or unusual dialectal form, for instance — if we used it as part of a part-of-speech tagger:

>>> classifier.classify(gender_features('brùthang'))

'f'

All in all, this is useful and — if you are a language geek — pretty interesting stuff. What is exciting about doing NLP with Gaelic is that, while this type of work is old hat for many languages now, it is brand new for Scottish Gaelic.

So, by generating this model and testing it upon a hold-out of 10% of the nouns in the lexicon, we have shown that it confirms certain expectations, discovers some unexpected patterns in the language, allows us to quantify relationships and provides a pragmatic solution to the quandary of how to guess the gender of unknown words as part of an NLP pipeline.

 

Code

(NB: nounslow is the list of nouns in lowercase)

def gender_features(word):

...     return{'suffix1': word[-1:],

...             'suffix2': word[-2:],

...             'suffix3': word[-3:]}

...

>>> featuresets = [(gender_features(n.lower()), gender) for (n, gender) in nounslow]

>>> train_set, test_set = featuresets[:size], featuresets[size:]

>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> print(nltk.classify.accuracy(classifier, test_set))

0.8256827425915165

Share

Previous

New release of tagged Scottish Gaelic corpus (ARCOSG)

Next

New Gaelic language technology website launched

6 Comments

  1. Valerie

    This is a very interesting and helpful post. I bookmarked it a few months ago, but today I’m going back over all my notes on just this – knowing when a noun is masculine or feminine. So now I’ve read this post completely (which I had not done before) – I really like the list of f:m and m:f. I made a copy and separated it into 2 lists so I can compare it to my other notes. I suppose someday we won’t have to go through a stack of books to find all the guidelines to memorize. Thank you for posting this and best wishes with future work.

    • wlamb

      Thanks, Valarie – I’m really glad that you found the post helpful. One of the things that I really like about building resources for the language, programming and machine learning work, is that they can benefit users and learners of the language so much. I’m working on a new grammar of Gaelic at the moment and some the insights that we’ve found with data driven techniques will certainly find their way in there – Will

  2. Alice

    I found this exceptionally interesting as well. I just started learning Gaelic last year and have always been interested in patterns in any of the languages I have studied or use.

    I noticed in your answer to Valerie that you were working on a new grammar of Gaelic. That was back in 2020. Has the new grammar been published yet?

    • wlamb

      Hi Alice — thanks very much for your comment. I’m about 85-90% done writing the grammar, but it’s not been published yet. I expect that it might be sometime around 2024.

      Will

      • Alice

        Thanks, Will. I’ll look out for it then. In the meantime I’m bookmarking this site.

        Bliadhna mhath ùr dhuibh.

        Alice

  3. Kathryn Cole

    Very interesting read! I am a new learner so memorizing feminine vs. masculine nouns has been an ongoing challenge. One thing I have noticed is that feminine nouns tend to have a ‘harder’ sounding ending than masculine nouns, or there is an i near the end of the word, which seems to jibe with your work. Of course there are many exceptions, so I have been focusing on the ‘exceptions’ from available noun lists. It’s enough just to memorize all the possible definite article rules that apply to each, so having a guide – even if it is only 80% accurate – is a huge help. Thank you!

Leave a Reply to wlamb Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress & Theme by Anders Norén

css.php

Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.

  Cancel