paint-brush
Top 15 Chatbot Datasets for NLP Projects by@limarc
19,922 reads
19,922 reads

Top 15 Chatbot Datasets for NLP Projects

by Limarc AmbalinaDecember 2nd, 2020
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

An effective chatbot requires a massive amount of training data in order to quickly solve user inquiries without human intervention. However, the primary bottleneck in chatbot development is obtaining realistic, task-oriented dialog data to train these machine learning-based systems.

People Mentioned

Mention Thumbnail
Mention Thumbnail

Company Mentioned

Mention Thumbnail
featured image - Top 15 Chatbot Datasets for NLP Projects
Limarc Ambalina HackerNoon profile picture
An effective chatbot requires a massive amount of training data in order to quickly solve user inquiries without human intervention. However, the primary bottleneck in chatbot development is obtaining realistic, task-oriented dialog data to train these machine learning-based systems.We’ve put together the ultimate list of the best conversational datasets to train a chatbot, broken down into question-answer data, customer support data, dialogue data and multilingual data.

Question-Answer Datasets for Chatbot Training

: This corpus includes Wikipedia articles, manually-generated factoid questions from them, and manually-generated answers to these questions, for use in academic research.: A publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, they used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer.: This page features manually curated QA datasets from Yahoo Answers from Yahoo.: TREC has had a question answering track since 1999. In each track, the task was defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain, closed-class questions.

Customer Support Datasets for Chatbot Training

: Consists of almost one million two-person conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. The full dataset contains 930,000 dialogues and over 100,000,000 words: A collection of travel-related customer service data from four sources. The conversation logs of three commercial customer service IVAs and the Airline forums on TripAdvisor.com during August 2016.: This dataset on Kaggle includes over 3 million tweets and replies from the biggest brands on Twitter.

Dialogue Datasets for Chatbot Training

: This automatically generated IRC chat log  is available in RDF, back to 2004, on a daily basis, including time stamps and nicknames.: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies.: The dataset contains more than 2000 dialogues for a  competition, where human evaluators recruited via the crowdsourcing platform Yandex.Toloka chatted with bots submitted by teams.: This dataset includes approximately 249,000 words of transcription, audio, and timestamps at the level of individual intonation units.: This corpus consists of 10,567 posts out of approximately 500,000 posts gathered from various online chat services in accordance with their terms of service.: Open dialogue dataset where the conversation aims at accomplishing a task or taking a decision – specifically, finding flights and a hotel. The dataset contains complex conversations and decision-making covering 250+ hotels, flights, and destinations.: A fully-labeled collection of written conversations spanning over multiple domains and topics. The dataset contains 10k dialogues, and is at least one order of magnitude larger than all previous annotated task-oriented corpora.

Multilingual Chatbot Training Datasets

: This corpus was created for social media text normalization and translation. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus and then translated into formal Chinese.: These datasets, available in English and Italian, contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company.Still can’t find the data you need? Lionbridge AI provides custom  for machine learning in 300 languages to help make your conversations more interactive and supportive for customers worldwide.  to learn more about how we can work for you.
Lead image via Volodymyr on UnsplashOriginally published by Alex Nguyen on: and has been reposted with permission.
바카라사이트 바카라사이트 온라인바카라