visit
Africa has over 2000 languages however, these languages are not well represented in the existing Natural language processing (NLP) ecosystem. One of the challenges is the lack of useful African language datasets that can be used to solve different social and economical problems.
In this article, I have compiled a list of African language datasets from across the web. These datasets can be used in numerous NLP tasks such as text classification, named entity recognition, machine translation, sentiment analysis, speech recognition, and topic modeling.This collection of datasets have been made public to give you an opportunity to use your skills and help solving different challenges.Below is the list of African language datasets for Text classification.
The Swahili news dataset contains more than 31,000 news articles from different news categories such as Local, International, Business or Financial, health, sports, and Entertainment. The Swahili language is one of the most spoken languages in Africa, it is spoken by 100-150 million people across East Africa.
The data was collected from different news publication platforms inside and outside of Tanzania. The dataset can be used to develop a multi-class classification model to classify news content according to their specific categories specified.
The model can be used by Swahili online news platforms to automatically group news according to their categories and help readers find the specific news they want to read.
from datasets import load_dataset
dataset = load_dataset("swahili_news")
Note: The Swahili news dataset has an imbalance of category distribution. It contains few news articles in the following categories:
This dataset consists of news articles in Chichewa. Chichewa is a Bantu language spoken in much of Southern, Southeast, and East Africa, namely the countries of Malawi and Zambia, where it is an official language.
The dataset contains a collection of 3,482 articles, containing over 930,000 words, and over 48,000 sentences. The Chichewa news articles have been categorized into 19 categories such as education, law/order.politics, culture, arts and crafts, farming, economy, and wildlife.
You can also download this dataset from the following link: .This is a parallel corpus dataset for machine translation from French to Ewe and French to Fongbe.
Fonbge and Ewe are Niger-Congo languages, Fongbe is spoken in Benin with approximately 4.1 million speakers while Ewe is spoken in Togo and southeastern Ghana with approximately 4.5 million speakers.This dataset contains roughly 23,000 French to Ewe and 53,000 French to Fongbe parallel sentences, collected from blogs, tales, newspapers, daily conversations, webpages and annotated for neural machine translation.
The dataset consists of 10,054 parallel Yorùbá-English sentences from different domains like news, Yorùbá proverbs, movie transcript, localization translation, and books.
The dataset consists of 15,022 parallel English-Luganda sentences and it was created by a team of researchers from the AI & Data Science research Lab at Makerere University with a team of Luganda teachers, students, and freelancers.
Sentiment Analysis Datasets are used for the interpretation and classification of emotions (positive, negative, and neutral) within text data using different text analysis methods.
Sentiment analysis has found its applications in various fields such as social media monitoring, brand monitoring, customer service, and market research.Below is the list of African language datasets for Sentiment Analysis.gathered comments from social media platforms that express sentiment about popular topics. They extracted 100k comments using public streaming APIs.
The collected comments were manually annotated using an overall polarity:from datasets import load_dataset
dataset = load_dataset("tunizi")
The ASR dataset has a total of 6,683 audio files and transcriptions and it was created by a team of researchers from Baamtu Datamation company in Senegal.
The was created by 895 speakers from different genders and ages in a common voice platform. The dataset has a total of 1,183 hours of validated speech. The current dataset size is 40 GB.
The dataset contains news headlines (i.e short text) from Setswana and Sepedi languages. Setswana is a Bantu language spoken in Southern Africa by about 8.2 million people while Sepedi is mainly spoken in the northern parts of South Africa by 4.7 million people.
Since the dataset is not annotated, you can use it to create a Topic model to cluster news data into different news topics such as sports, politics, culture, and entertainment.I hope you found this list of different African language datasets useful and you can use them in your next data science project. I will be happy to see what applications/solutions you will create from these datasets. If you couldn't find the dataset you need, please check out the following links:
And you can read more articles like this here.
For more AI and machine learning guides, be sure to subscribe to our newsletter in the footer below.