Will Lamb will be talking to us about a project he is currently heading up, here at the university. This marks our last lecture of the semester!
Language technology for Scottish Gaelic remains in an incipient state, compared to recent progress in the area for other European minority languages. It is crucial to provide certain key computational resources and tools for Gaelic if it is to participate fully in future, data-rich research paradigms, and a variety of NLP-driven applications, which would benefit a range of end users. The Carnegie Trust and Bòrd na Gaelic funded project, ‘An on-line part-of-speech (POS) tagger and gold-standard corpus of Scottish Gaelic’ was devised to help address this situation, with three main aims:
- Develop a hand-tagged ‘gold-standard’ corpus (GSC) of Scottish Gaelic
- Develop a POS tagger with an accuracy level of 97%, tested on the GSC
- Make these resources freely available on the internet
As this one-year project approaches its half-way mark, Will Lamb will be reporting on work-in-progress. In particular, he will be taking stock of some of the challenges of instantiating an NLP pipeline with an under-standardised and morphologically rich language. The Gaelic nominal system, for example, is notably complex and is sensitive to variation conditioned by dialect, register and age. Dr Lamb will also present the results from their first statistically-induced tagger, based upon a finalized 12k word subset of the 80k word corpus.