Introduction
On a recent project I was building a binary classification model to try to identify competitor products that were similar to our own. This was ultimately towards ensuring the
pricing for our products was optimised within the market. In the world of Asset Management it can be difficult to compare like-for-like investment solutions across different asset
managers and there is no database that exists that will tell me that, for example, our European market neutral strategy is really the same as a competitors' similarly named strategy
and so a machine learning approach was required.
I apprached the binary classification model in a standard way using a range of numerical and categorical variables as features
however I wasn't having a great deal of success. After discussing with colleagues it was suggested to me that there was a feature of the objects I was working
with that may contain a useful encoding of the binary class that I was ultimatly trying to predict.
The challenge was that the feature was a non-categorical natural language collection of words and therefore not a common feature variable type used in modelling.
I was attracted to the idea however of distilling the problem to 'simply' using one natural language feature to predict a binary outcome variable. I.e. the data set could be reduced to
something like Table 1 below:
Table 1: Illustrative data
| Feature |
Target |
| Word1 Word2 Word3 |
1 |
| Word4 Word5 Word6 Word7 |
0 |
| Word8 Word9 |
1 |
| Word11 |
0 |
| ... |
... |
I had been reading about Google's
BERT model, a (massive) pretrained neural network that was achieving state of the art results
on many NLP problems and wondered if this tool could help in my case.
BERT is a type of transfer learning model that processes natural language and passes along some information it has extracted
from it as an output that can then be transfered as an input into an additional classification model.
In my case, after getting the BERT model processing my data set I built and compared
logistic regression, random forest, gradient boosted (xgboost) and
neural network
models as the final classification layer.
This post describes Step 1 above, namely how to get the BERT model running on your machine and extracting information from natural language
Part 2 in this series then describes the binary classifiers I used.
To preserve confidentiality the below code illustrates this approach on a popular movie review sentiment analysis data set that is structurally the same as
Table 1
and still accurately illustrates the process above.
Implementing in Python
Link to my GitHub repo:
NLP-with-BERT
Step 1: Extract information with BERT
The key python module used here is
transformers
Import modules and load data . I have included a copy of the data in my GitHub repo which the below refers to; this was originally located at the following repository:
https://github.com/AcademiaSinicaNLPLab/sentiment_dataset/blob/master/data/stsa.binary.train
#############################################################
# Title: Binary Classification with BERT and some Classifiers
# Author: Thomas Handscomb
#############################################################
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import transformers
import tensorflow
import keras
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from keras.models import Sequential
from keras.layers import Dense
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## Helpful control of display option in the Console
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Control the number of columns displayed in the console output
pd.set_option('display.max_columns', 20)
# Control the width of columns displayed in the console output
pd.set_option('display.width', 1000)
#########################
# STEP 1: Load BERT model
#########################
#~~~~~~~~~~~~~~~~~~~~~~~~~
## Bring in data and clean
#~~~~~~~~~~~~~~~~~~~~~~~~~
df_Sentiment_Train_Full = \
pd.read_csv("https://github.com/ThomasHandscomb/NLP-with-BERT/raw/master/train.csv"
, encoding = "ISO-8859-1")
# Rename column headings
colnamelist = ['Text', 'Label']
df_Sentiment_Train_Full.columns = colnamelist
# Take a random sample of the data frame to speed up processing in this example
frac = 0.10
df_Sentiment_Train = \
df_Sentiment_Train_Full.sample(frac=frac, replace=False, random_state=1)
df_Sentiment_Train.reset_index(drop=True, inplace = True)
Prepare data set for loading into BERT. First we define which NLP model and pre-trained weights to use. In this example I use the light-weight version of
the full BERT model, called DistilBert, to speed up run time however the process is exactly the same on the full BERT model by replacing DistilBERT with BERT
NLP_model_class = transformers.DistilBertModel
NLP_tokenizer_class = transformers.DistilBertTokenizer
NLP_pretrained_weights = 'distilbert-base-uncased'
# Load pretrained tokenizer and model
NLP_tokenizer = NLP_tokenizer_class.from_pretrained(NLP_pretrained_weights)
NLP_model = NLP_model_class.from_pretrained(NLP_pretrained_weights)
Tokenise the string names. This converts each word in the text into an integer corresponding to that word in the
BERT dictionary. The tokeniser also adds endpoint tokens of 101 at the start of the word and 102 at the end. An illustration of this process is below:
example_text = pd.Series(['A B C Hello'])
example_tokenized_text = \
example_text.apply((lambda x: NLP_tokenizer.encode(x, add_special_tokens=True)))
example_tokenized_text
Out[20]:
0 [101, 1037, 1038, 1039, 7592, 102]
dtype: object
Tokenise the real data
tokenized_text = \
df_Sentiment_Train['Text'].apply((lambda x: NLP_tokenizer.encode(x, add_special_tokens=True)))
The input data for the BERT model needs to be uniform in width, i.e. all entries need
to have the same length. To achieve this we pad the data set with 0's from the width
of each tokenized_text value to the maximum tokenised length in the series
# Determine the maximum length of the tokenized_text values
max_length = max([len(i) for i in tokenized_text.values])
# Create an array with each tokenised entry padded by 0's to the max length
padded_tokenized_text_array = \
np.array([i + [0]*(max_length-len(i)) for i in tokenized_text.values])
padded_tokenized_text_array.shape
Out[27]: (692, 64)
Define an array specifying the padded values - we use this later to distinguish the real data from
the padded [0] data
padding_array = np.where(padded_tokenized_text_array != 0, 1, 0)
#padding_array.shape
The BERT model expects a PyTorch tensor as input so convert the padded_tokenized_text and padding arrays to PyTorch tensors (Note need to specify dtype = int)
padded_tokenized_text_tensor = torch.tensor(padded_tokenized_text_array, dtype = int)
padding_tensor = torch.tensor(padding_array)
We can view the evolution of a row of data from original text through to padded, tokenised tensor
df_Sentiment_Train.loc[[0]] # Original data
Out[29]:
Text Label
0 peppered with witty dialogue and inventive mom... 1
tokenized_text[0] # Initial tokenised series
Out[30]: [101, 11565, 2098, 2007, 25591, 7982, 1998, 1999, 15338, 3512, 5312, 102]
padded_tokenized_text_array[0] # Padded tokenised array
Out[31]:
array([ 101, 11565, 2098, 2007, 25591, 7982, 1998, 1999, 15338,
3512, 5312, 102, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0])
padded_tokenized_text_tensor[0] # Padded tokenised tensor
Out[32]:
tensor([ 101, 11565, 2098, 2007, 25591, 7982, 1998, 1999, 15338, 3512,
5312, 102, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0])
Pass the processed torch tensor through the BERT model . This can take some processing time
# Ensure the pytorch gradients are set to zero - by default these accumulate
with torch.no_grad():
DistilBERT_Output = NLP_model(padded_tokenized_text_tensor
, attention_mask = padding_tensor)
The full details take some unpacking here and are largely beyond the scope of this blog post however the
DistilBERT_Output object is a 1-tuple whose single
entry is a 3-dimensional tensor with
The original number of data set rows as rows
The max number of text words as columns
768 number of layers as the depth
print(type(DistilBERT_Output[0]))
print(padded_tokenized_text_tensor.shape)
print(DistilBERT_Output[0].shape)
<class 'torch.Tensor'>
torch.Size([692, 64])
torch.Size([692, 64, 768])
The width corresponds to tokens and the 768 depth layers correspond to hidden states for each text and come from the construction of the massive neural network that
comprises the BERT model - this is the number of nodes in the output layer of the network.
The authors of BERT have specified the hidden state vector corresponding to the first token as an aggregate representation of the whole sentence used for classification tasks.
That is, for each row of text the vector of length 768 corresponding to the
[0]th width token of the BERT output is the output that should be used as input into a final
classifier model.
We can efficiently slice the
[0]th width token for all rows of the DistilBERT output as follows. These form the feature variable dataframe for final clasification
features_df = pd.DataFrame(np.array(DistilBERT_Output[0][:,0,:]))
features_df.shape
Out[47]: (692, 768)
The target label dataframe is constructed from the original data set
labels_df = df_Sentiment_Train[['Label']]
labels_df.shape
Out[48]: (692, 1)
Having used the BERT pre-trainined model to extract some information from our natural language features we are now in a position to build classifier models to predict the binary
target classes. This is covered in more detail in
Part 2 of this series.