AFRICAN LANGUAGES IN THE DEVELOPMENT OF CORPUS LINGUISTICS AND OPENAI/CHATGPT: A CASE STUDY OF THE YORUBA LANGUAGE

Authors

  • Anthonia Adunola Abe Author

Abstract

This study, African Languages in the Development of Corpus Linguistics and Open AI/Chat GPT: A Case Study of the Yoruba Language, looked into the flaws of Chat GPT when it comes to inputting African languages into the model. Chat GPT is largely trained in European languages corpora, but to a limited extent, in African languages corpora. The research uses the Yoruba language as a case study. Yoruba corpora data were gathered and analyzed for the benefit of understanding the language structure adequately.  The differences between African and European languages were also looked into, which results in one of the flaws of Chat GPT, in giving the right response in African languages. Most African languages, such as Yoruba, Igbo, Zulu and so on are tonal languages while most European languages are non-tonal. It was evident that Chat GPT has challenges in differentiating words with the same segments but different tones, and it was also discovered that Chat GPT was giving wrong data for the Yoruba language.  Many African languages are considered low-resource, meaning there is limited digital content available in those languages. This scarcity in training data may impact models’ ability, such as Chat GPT, to understand and generate content accurately. This study, therefore, laid the foundation for other researchers who may be interested in working on the corpus of African languages.

Downloads

Published

2025-08-09