visit
News in Swahili is an important part of the media sphere in Tanzania and other countries in East Africa. News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many African countries.
In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.
Swahili open-source African language text datasets are not often available in Tanzania that results in being left behind in the creation of NLP technologies to solve African challenges.
The goal of this project was to build an open-source text dataset in the Swahili language focused on News articles. I mainly focus on collecting news in different categories such as Local, International, Business or Financial, health, sports, and entertainment.(a)Collect website with Swahili news
The first phase of the project is to find and collect different websites that provide news in the Swahili language. I was able to find some websites that provide news in Swahili only and others in different languages including Swahili.
(b) Understand policy and copyright.
In this phase of the project, I mainly focus on understanding their policies and copyrights for each website on what I can do and what I can not do.helped me to understand this process by providing Data Protection Guidelines to consider for data collection and data mining.
(c) Understand the structure of the news website
Each news website was developed by different web technologies such as PHP, Python, WordPress, Django, javascript e.t.c. The main task is to analyze website source code by using a web browser tool (view page source). I looked at different HTML tags to find news titles, categories, and links to access the content of the particular title.
(d) Data Collection
News articles were collected by using different tools and programming languages. These tools are as follows:
(e) Analyzing and Cleaning
The collected news articles were analyzed and cleaned to remove irrelevant information such as HTML tags and symbols that were collected during the scrapping process.
You can download the datasets from two different versions. The first version (v0.1) was released on December 1, 2020, you can download the dataset from zenodo platform .
Another way is by using the datasets python library from Hugging Face.
from datasets import load_dataset
dataset = load_dataset("swahili_news")
Therefore, my plans are to find more news resources in the Swahili language and collect more news datasets on the topics mentioned above in order to bring more balance among news topics in the dataset.
This will help AI practitioners to create useful machine learning models that perform well in test environments.
And you can read more articles like this here.
Want to keep up to date with all the latest datasets for machine learning and data science? Subscribe to our newsletter in the footer below