visit
Our case study Question Answering System in Python using BERT NLP [1] and BERT based Question and Answering system demo [2], developed in Python + Flask, got hugely popular garnering hundreds of visitors per day. We got a lot of appreciative and lauding emails praising our QnA demo. Along with that, we also got number of people asking about how we created this QnA demo. And till the day, we keep getting requests on how to develop such a QnA system using BERT pre-trained model open-sourced by Google.
To start with, the readme file on the official GitHub repository of BERT provides a good amount of information about how to fine-tune the model on SQuAD 2.0 but we could see that developers are still facing issues. So, we decided to publish a step-by-step tutorial to fine-tune the BERT pre-trained model and generate inference of answers from the given paragraph and questions on using TPU.In this tutorial, we are not going to cover how to create web-based interface using Python + Flask. We’ll just cover the fine-tuning and inference on Colab using TPU. You can create your own interface using Flask or Django.Overview
In this tutorial we will see how to perform a fine-tuning task on SQuAD using Google Colab, for that we will use BERT GitHub Repository, BERT Repository includes:1) Change Runtime to TPU
On the main menu, click on Runtime and select Change runtime type. Set “ TPU “ as the hardware accelerator. Below screeenshot will help you understand how you can change the runtime to TPU.
After Clicking on “Change runtime type”, Select TPU from the dropdown option as given in the below figure.
BERT, or Bidirectional Embedding Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. You can find the academic paper here: .
BERT has two stages: Pre-training and fine-tuning.
Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure. BERT has released a number of pre-trained models. Most NLP researchers will never need to pre-train their own model from scratch.
Fine-tuning is inexpensive. One can replicate all the results given in the paper, in at most 1 hour on a single Cloud TPU, or a few hours on a GPU. For example, SQuAD can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%.
So our first step is to Clone the BERT github repository, below is the way by which you can clone the repo from github. Now get inside the Bert repo using “ cd “ command
!git clone //github.com/google-research/bert.git
cd bert
3) Download the BERT PRETRAINED MODEL
BERT Pretrained Model List :BERT has released BERT-Base and BERT-Large models, that have uncased and cased version. Uncased means that the text is converted to lowercase before performing Workpiece tokenization, e.g., John Smith becomes john smith, on the other hand, cased means that the true case and accent markers are preserved.
When using a cased model, make sure to pass -do_lower=False at the time of training.
You can download any model of your choice. We have used the BERT-Large-Uncased Model.
!wget //storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
# Unzip the pretrained model
!unzip uncased_L-24_H-1024_A-16.zip
4) Download the SQUAD2.0 Dataset
For the Question Answering task, we will be using SQuAD2.0 Dataset.SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000+ questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. You can download the dataset from SQUAD site#Download the SQUAD train and dev dataset
!wget //rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget //rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
5) Set up your TPU environment
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)
from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
print('TPU devices:')
pprint.pprint(session.list_devices())
# Upload credentials to TPU.
with open('/content/adc.json', 'r') as f:
auth_info = json.load(f)
tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
# Now credentials are set for all future sessions on this TPU.
6) Create an output directory
Prerequisite: You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket to run the colab file. Please follow the Google Cloud for how to create a GCP account and GCS bucket. You have $300 free credit to start with any GCP product, learn more about it at .
ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on uncased_L-24_H-1024_A-16/bert_model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: 'uncased_L-24_H-1024_A-16/bert_model.ckpt')
[[node checkpoint_initializer_14 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
BUCKET = 'bertnlpdemo' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'bert_output' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))
7) Move Pretrained Model to GCS Bucket
Need to move Pre-trained Model at GCS (Google Cloud Storage) bucket, as Local File System is not Supported on TPU. If you don’t move your pre-trained model to TPU you may face the error.The gsutil mv command allows you to move data between your local file system and the cloud, move data within the cloud, and move data between cloud storage providers.
!gsutil mv /content/bert/uncased_L-24_H-1024_A-16 $BUCKET_NAME
8) Training
Below is the command to run the training. To run the training on TPU you need to make sure about below Hyperparameter, that is tpu must be true and provide the tpu_address that we have found above.-use_tpu=True
-tpu_name=YOUR_TPU_ADDRESS
!python run_squad.py \
--vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
--bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
--init_checkpoint=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_model.ckpt \
--do_train=True \
--train_file=train-v2.0.json \
--do_predict=True \
--predict_file=dev-v2.0.json \
--train_batch_size=24 \
--learning_rate=3e-5 \
--num_train_epochs=2.0 \
--use_tpu=True \
--tpu_name=grpc://10.1.118.82:8470 \
--max_seq_length=384 \
--doc_stride=128 \
--version_2_with_negative=True \
--output_dir=$OUTPUT_DIR
Create Testing File
We are creating input_file.json as a blank JSON file and then add the data in the file in the SQuAD dataset format.- touch is used to create a file
- %%writefile is used to write a file in the colab
!touch input_file.json
%%writefile input_file.json
{
"version": "v2.0",
"data": [
{
"title": "your_title",
"paragraphs": [
{
"qas": [
{
"question": "Who is current CEO?",
"id": "56ddde6b9a695914005b9628",
"is_impossible": ""
},
{
"question": "Who founded google?",
"id": "56ddde6b9a695914005b9629",
"is_impossible": ""
},
{
"question": "when did IPO take place?",
"id": "56ddde6b9a695914005b962a",
"is_impossible": ""
}
],
"context": "Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet."
}
]
}
]
}
Prediction
Below is the command to perform your own custom prediction, that is you can change the input_file.json by providing your paragraph and questions after then execute the below command.!python run_squad.py \
--vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
--bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
--init_checkpoint=$OUTPUT_DIR/model.ckpt-10859 \
--do_train=False \
--max_query_length=30 \
--do_predict=True \
--predict_file=input_file.json \
--predict_batch_size=8 \
--n_best_size=3 \
--max_seq_length=384 \
--doc_stride=128 \
--output_dir=output/
Previously published at //www.pragnakalp.com/nlp-tutorial-setup-question-answering-system-bert-squad-colab-tpu/
References