How good are Large Language Models on African Languages?

2311.07978

YC

0

Reddit

0

Published 5/1/2024 by Jessica Ojo, Kelechi Ogueji, Pontus Stenetorp, David Ifeoluwa Adelani

💬

Abstract

Recent advancements in natural language processing have led to the proliferation of large language models (LLMs). These models have been shown to yield good performance, using in-context learning, even on tasks and languages they are not trained on. However, their performance on African languages is largely understudied relative to high-resource languages. We present an analysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks (topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition) across 60 African languages, spanning different language families and geographical regions. Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages (such as English) for most tasks. We find that GPT-4 has an average to good performance on classification tasks, yet its performance on generative tasks such as machine translation and summarization is significantly lacking. Surprisingly, we find that mT0 had the best overall performance for cross-lingual QA, better than the state-of-the-art supervised model (i.e. fine-tuned mT5) and GPT-4 on African languages. Similarly, we find the recent Aya model to have comparable result to mT0 in almost all tasks except for topic classification where it outperform mT0. Overall, LLaMa 2 showed the worst performance, which we believe is due to its English and code-centric~(around 98%) pre-training corpus. Our findings confirms that performance on African languages continues to remain a hurdle for the current LLMs, underscoring the need for additional efforts to close this gap.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • The paper examines the performance of four large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks across 60 African languages.
  • The results show that the language models generally perform worse on African languages compared to high-resource languages like English.
  • The paper highlights the need for more efforts to improve the performance of language models on African languages.

Plain English Explanation

Large language models (LLMs) like GPT-4 and LLaMa 2 have become very capable at understanding and generating human-like text. These models are trained on vast amounts of data from the internet and can perform well on a variety of tasks, even for languages they weren't specifically trained on.

However, the researchers found that the performance of these LLMs is significantly lower for African languages compared to high-resource languages like English. They tested four popular LLMs on tasks like topic classification, sentiment analysis, machine translation, summarization, question answering, and named entity recognition across 60 African languages.

The results suggest that while models like GPT-4 can do reasonably well on some classification tasks for African languages, they struggle with more complex generative tasks like translation and summarization. The researchers also found that the mT0 model performed the best overall on the cross-lingual question answering task, even outperforming a specialized model trained on African languages.

The paper highlights the need for more research and development to improve the performance of LLMs on African languages. This is important because these models are increasingly being used in a wide range of applications, and their poor performance on underrepresented languages could lead to biases and exclusion.

Technical Explanation

The researchers in this paper conducted an extensive evaluation of four popular large language models - mT0, Aya, LLaMa 2, and GPT-4 - on six different tasks across 60 African languages.

The tasks included topic classification, sentiment classification, machine translation, summarization, question answering, and named entity recognition. The researchers chose a diverse set of African languages spanning different language families and geographical regions to get a comprehensive understanding of the models' performance.

The results showed that all the LLMs performed significantly worse on the African languages compared to high-resource languages like English. There was a large gap in performance, particularly for the generative tasks like machine translation and summarization.

Interestingly, the researchers found that the mT0 model performed the best overall on the cross-lingual question answering task, even outperforming a specialized model (fine-tuned mT5) and GPT-4. The Aya model also showed comparable results to mT0 across most tasks, except for topic classification, where it outperformed mT0.

On the other hand, the LLaMa 2 model exhibited the worst performance, which the researchers attribute to its heavily English and code-centric pre-training corpus (around 98%).

Critical Analysis

The paper provides valuable insights into the current limitations of large language models when it comes to processing African languages. The researchers have conducted a comprehensive evaluation across a wide range of tasks and languages, which gives us a clear picture of the performance gaps.

One potential concern is the lack of a detailed analysis of the factors contributing to the poor performance of the models on African languages. The paper mentions that the pre-training corpus composition may play a role, but a more in-depth investigation into the specific challenges, such as linguistic diversity, data availability, or model architecture limitations, could have provided a more complete understanding.

Additionally, the paper does not discuss the potential societal implications of these performance gaps. As LLMs become more widely deployed in various applications, their poor performance on underrepresented languages could lead to biases and exclusion, which should be further explored and addressed.

The researchers also acknowledge the need for additional efforts to close the performance gap, but they do not provide specific recommendations or a roadmap for how this can be achieved. Exploring potential solutions, such as enhancing existing models or developing specialized models for African languages, could have strengthened the paper's impact and practical relevance.

Conclusion

This paper highlights a significant challenge in the field of natural language processing: the performance disparity between large language models and African languages. The researchers have conducted a comprehensive evaluation across a wide range of tasks and models, revealing that current LLMs struggle to match their high-resource language performance when it comes to African languages.

The findings underscore the need for more concerted efforts to improve the capabilities of language models on underrepresented languages. As these models become increasingly prominent in various applications, ensuring equitable and inclusive performance is crucial. The paper serves as a call to action for the research community to address this pressing issue and work towards closing the performance gap for African languages.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li

YC

0

Reddit

0

Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT). In this paper, we systematically investigate the advantages and challenges of LLMs for MMT by answering two questions: 1) How well do LLMs perform in translating massive languages? 2) Which factors affect LLMs' performance in translation? We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4. Our empirical results show that translation capabilities of LLMs are continually involving. GPT-4 has beat the strong supervised baseline NLLB in 40.91% of translation directions but still faces a large gap towards the commercial translation system like Google Translate, especially on low-resource languages. Through further analysis, we discover that LLMs exhibit new working patterns when used for MMT. First, LLM can acquire translation ability in a resource-efficient way and generate moderate translation even on zero-resource languages. Second, instruction semantics can surprisingly be ignored when given in-context exemplars. Third, cross-lingual exemplars can provide better task guidance for low-resource translation than exemplars in the same language pairs. Code will be released at: https://github.com/NJUNLP/MMT-LLM.

Read more

6/17/2024

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Millicent Ochieng, Varun Gumma, Sunayana Sitaram, Jindong Wang, Vishrav Chaudhary, Keshet Ronen, Kalika Bali, Jacki O'Neill

YC

0

Reddit

0

The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings. This research evaluates the performance of seven leading LLMs in sentiment analysis on a dataset derived from multilingual and code-mixed WhatsApp chats, including Swahili, English and Sheng. Our evaluation includes both quantitative analysis using metrics like F1 score and qualitative assessment of LLMs' explanations for their predictions. We find that, while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances, as well as lack of transparency in their decision-making process as observed from their explanations. In contrast, GPT-4 and GPT-4-Turbo excelled in grasping diverse linguistic inputs and managing various contextual information, demonstrating high consistency with human alignment and transparency in their decision-making process. The LLMs however, encountered difficulties in incorporating cultural nuance especially in non-English settings with GPT-4s doing so inconsistently. The findings emphasize the necessity of continuous improvement of LLMs to effectively tackle the challenges of culturally nuanced, low-resource real-world settings and the need for developing evaluation benchmarks for capturing these issues.

Read more

6/14/2024

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Xuanfan Ni, Piji Li

YC

0

Reddit

0

Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.

Read more

5/17/2024

💬

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

YC

0

Reddit

0

There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA models, GPT-4-Vision and Gemini-Pro-Vision. Our experiments show that larger models such as GPT-4, Gemini-Pro and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.

Read more

4/4/2024