visit
The Datasets library from hugging Face provides a very efficient way to load and process NLP datasets from raw files or in-memory data. These NLP datasets have been shared by different research and practitioner communities across the world.
You can also load various evaluation metrics used to check the performance of NLP models on numerous tasks.
If you are working in Natural Language Processing and want an NLP dataset for your next project, I recommend you use this library from Hugging Face.
You can use this library with other popular frameworks in machine learning, such as Numpy, Pandas, Pytorch, and TensorFlow. You will learn more in the examples below.
The NLP datasets are available on different tasks such as
pip install datasets
conda install -c huggingface -c conda-forge datasets
To view the list of different available datasets from the library, you can use the
list_datasets()
function from the library.from datasets import list_datasets, load_dataset
from pprint import pprint
datasets_list = list_datasets()
pprint(datasets_list,compact=True)
You can also view a list of the datasets with details by adding an argument called
with_details
equal to True
in the list_datases() function as follows.datasets_list = list_datasets(with_details=True)
pprint(datasets_list)
To load the dataset from the library, you need to pass the file name on the
load_dataset()
function.The load_dataset function will do the following.dataset = load_dataset('ethos','binary')
"ETHOS: onlinE haTe speecH detectiOn dataSet. This repository contains a dataset for hate speech detection on social media platforms, called Ethos. There are two variations of the dataset:"-
Note: Each dataset can have several configurations that define the sub-part of the dataset you can select. For example, the ethos dataset has two configurations.
print(dataset)
ethos_train = load_dataset('ethos','binary',split='train')
ethos_validation = load_dataset('ethos','binary',split='validation')
This will save the training set in the
ethos_train
variable and the validation set in ethos_validation
variable.Note: Not all datasets have the train, validation, and test set. Some of them can contain only the train set. So you need to read more about the dataset you want to download from the .
dataset = load_dataset('csv', data_files='my_file.csv')
from datasets import Dataset
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
df_dataset = Dataset.from_pandas(df)
print(df_dataset)
When you load already downloaded data from the cache directory, you can control how the
load_datasets()
function handles it by setting its download_mode parameter.The parameter options are as follows.dataset = load_dataset('ethos','binary', download_mode="force_redownload")
You can set the format of the dataset instance by using the
set_format()
function that takes as arguments.type: an optional string that defines the type of the objects that should be returned by
datasets.Dataset.getitem()
:ethos_train.set_format(type='pandas', columns=['comment', 'label'])
In the above example, we set the format type as "pandas".
And you can read more articles like this here.
Want to keep up to date with all the latest in Data Science? Subscribe to our newsletter in the footer below.