ARCOSG has been used for a range of projects including a voice synthesiser and syntactic parser. It has been newly revised and made compatible with the popular Natural Language Toolkit (NLTK): release available here.
A simplified version of the corpus has also been released, ARCOSG-S, which uses a less complex tag scheme (41 tags vs 246). It is available here.
This thoughtful article in the Guardian got me thinking about the fact that the choices we make about representation of Scottish Gaelic in new language tech tools are far from trivial. The balance of ages, genders and dialects, for instance, on a tool like Duolingo can impact the future of the language in ways that are hard to anticipate.
Developing something like an artificial voice is a resource-intensive endeavour, especially for a small language like Scottish Gaelic. How do we decide which dialects – which voices – survive digitally in the years to come? When a project is funded via public monies (unlike Duolingo), whose choice should it be?