New language resources for the four South African Nguni languages

Better technology tools are on the way for South Africa’s four Nguni languages. Research being done at the Centre for Text Technology (CTexT®) at the North-West University (NWU) is helping to fill in the missing links about how these languages are being used, which will in turn lead to the development of language tools based on the latest technologies, notably artificial intelligence (AI).

“The recent research was focused on obtaining and annotating resources for four of our South African languages: isiNdebele, isiXhosa, isiZulu and Siswati. Core technologies were also made available for these languages by looking at morphological analysers (see text box below), part-of-speech taggers and lemmatisers,” explain the three-person key research team, Dr Tanja Gaustad, Dr Martin Puttkammer and Jaco du Toit.

The benefit of this kind of research is that it offers the opportunity to improve existing language technologies. For instance, machine translation systems for South African languages can be enhanced by using these resources to further promote mutual understanding and better communication. Developing better core technologies paves the way for better tools such as spelling checkers, information-retrieval systems and text-mining tools.

The importance of data

According to Dr Gaustad, who is the senior computational linguist at CTexT®, “the current research in artificial intelligence, especially deep learning, is data-driven. This means that for better tools to be developed for South African languages, data resources are needed. As South African languages are under-resourced, this poses a problem for gaining better insight into how these languages are being used and allowing the development of these necessary tools.”

Linguistic resources enable and facilitate related research efforts. According to the researchers, this knowledge has in the past mostly been captured by recording rule-based representations of the inner workings of natural language.

“Such approaches require expert knowledge to both maintain and expand the rules and are limited in their comprehensiveness, since they do not address any scarce or unrecorded morphological processes that fall outside the scope of the defined rules,” explains Jaco, CTexT®’s computational linguist.

How the current research was conducted
Currently most grammars for South Africa’s Nguni languages are fairly dated (from the 1950s), so applying machine learning to understand how these languages work can help improve the dated linguistic descriptions and reflect modern language use.

Since the four languages share a similar linguistic structure, the textual data can be collected and analysed in parallel to allow researchers to do comparative computational linguistic studies. Using this data, core technologies were developed in the form of morphological analysers, part-of-speech taggers, and lemmatisers.

Using the new morphological analyser to analyse the text improved the overall accuracy to between 82% and 92%, which outperformed previously developed rule-based analysers for the same languages.

SADiLaR is a research infrastructure established by the Department of Science and Innovation (DSI) of the South African government as part of the South African Research Infrastructure Roadmap (SARIR).

These resources are available as open source on their repository website.

Definitions of core technologies

  • Morphological analyser – refers to the analysis of a word based on the meaningful parts contained within and aims to find the smallest units of meaning in a language.
     
  • Part-of-speech (POS) tagger – is a software tool that labels words as one of several categories to identify a word's function in a given language, in other words, a noun, verb, etc.
     
  • Lemmatiser – groups together different inflectional forms of the same word.

 

Martin Puttkammer   Tanya   Jaco du Toit
Dr Martin Puttkammer   Dr Tanja Gaustad   Jaco du Toit

 

Submitted on Wed, 07/13/2022 - 16:07