Multiple Category Classification with Keras, Tensorflow, Pandas, Numpy, & Python

In this one, we'll be creating a deep neural network since it will help us do is find patterns in a large amount of data to make the best possible prediction on new data. Before we get started, let's revisit an old adage:

Good data in, good data out. Garbage in, garbage out.

Building a neural network is easy even if you're new to writing code. The part that's hard is getting the data ready, setting the correct parameters, and understanding the math behind the neural network. The thing is, we don't have to understand the math to actually use a neural network.

In this post, we'll do 4 things:

Pre-process pre-existing data with built in tools
Run a neural network with Keras and Tensorflow
Plot neural network performance
Save trained model for continued-training and predictions (inference)

Thanks to Peter Nagy for the HUGE inspiration for creating this notebook.

Jupyter Notebook is here | Google Colab

The jupyter notebook is an amazing way to run live python code for data science and deep learning, it's also how this post is formatted. Learn how to create your own jupyter notebook server here.

Installation Requirements:

pip install Keras==2.2.4 pandas>=0.25.0 numpy<17.0 sklearn tensorflow==1.14.0

Simply run !pip install ... within a cell to install within a jupyter notebook.

Full notebook requirements are located: ../requirements/multiple_category_classification_keras.txt

Recommend Hardware Setup & Guides:

Nvidia GPU with Cuda/CuDNN GPUs are critical in processing large matrix operations; that's essentially what's happening in a neural network. We have gaming to thank for the huge advancements in GPUs.
- Install Tensorflow GPU on Windows using CUDA and cuDNN
Jupyter Notebook Using jupyter makes your life much easier as your working out your code especially as it relates to data science and visualizing what you're working on.
- A Jupyter Notebook Server can be very useful so then you don't have to invest in your own local hardware (like the guide above).

python

# !pip install Keras==2.2.4 pandas>=0.25.0 numpy<17.0 sklearn tensorflow==1.14.0 matplotlib==3.1.1

python

import os

import numpy as np 
import pandas as pd
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping, ModelCheckpoint
# disable tensorflow warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

import pickle
import matplotlib.pyplot as plt
%matplotlib inline

python

CURRENT_DIR = os.getcwd()
BASE_DIR = os.path.dirname(CURRENT_DIR)
data_dir = os.path.join(BASE_DIR, 'data')
models_dir = os.path.join(BASE_DIR, 'neural_networks', 'models')
checkpoints_dir = os.path.join(BASE_DIR, 'neural_networks', 'checkpoints')
pickles_dir = os.path.join(BASE_DIR, 'neural_networks', 'pickles')
os.makedirs(models_dir, exist_ok=True)
os.makedirs(checkpoints_dir, exist_ok=True)
os.makedirs(pickles_dir, exist_ok=True)

dataset_path = os.path.join(data_dir, 'uci-news-aggregator.csv')
DEFAULT_MODEL_NAME = 'multi'

python

def get_callbacks(name=DEFAULT_MODEL_NAME):
    early_stopping = EarlyStopping(monitor='val_loss',
        patience=7, 
        min_delta=0.0001)
    checkpoint_path = os.path.join(checkpoints_dir, name)
    os.makedirs(checkpoint_path, exist_ok=True)
    filepath = os.path.join(
        checkpoint_path,
        'weights.{epoch:02d}-{val_loss:.2f}.hdf5'
    )
    checkpoint = ModelCheckpoint(filepath, 
        monitor='val_loss',  
        save_best_only=True,
        save_weights_only=True
    )
    callbacks = [early_stopping, checkpoint]
    return callbacks

def save_model(model, name=DEFAULT_MODEL_NAME):
    filename = os.path.join(models_dir, name + '.hdf5')
    return model.save(filename)

python

df = pd.read_csv(dataset_path, usecols=['TITLE', 'CATEGORY'])
df.head()

Handling Duplicates

As a part of pre-processing data, we want to remove duplicate data as much as possible. Let's take a look to see if we have any title duplicates within our dataset. We use df.TITLE because TITLE is the actual name of the column. df.CATEGORY is the other column.

python

duplicate_title_distribution = df.TITLE.value_counts()[:10]
duplicate_title_distribution

python

most_common_title = duplicate_title_distribution.index[0]
most_common_title

python

df[df['TITLE'].str.contains(most_common_title)][:5]

Here's the most common article title:

The article requested cannot be found! Please refresh your browser or go back  ...

That occurred 145 times! If you're familiar with web scraping, you'll know that this is probably a 404 page error but it's clearly in our data and causing issues. Let's just remove all duplicate titles to ensure we don't have data like PR Newswire as being one of our possible data points.

Basically, my thought is, if the article title is duplicated, it's probably not a good article title.

python

df = df.drop_duplicates(subset='TITLE', keep=False)
df[df['TITLE'].str.contains(most_common_title)]

Calculate the distribution of each article title and it's respective category. We're looking for the category with the least number of titles associated to it. We want an even distribution of titles / categories for best results otherwise we should anticipate or results to be skewed incorrectly.

python

category_dict = {
    'e': 'entertainment', 
    'b':'business', 
    't': 'science/tech', 
    'm': 'health'
}
df.CATEGORY.value_counts()

python

category_labels = {
    'e': 'entertainment', 
    'b': 'business', 
    't': 'science/tech', 
    'h': 'health'
}

python

max_num_of_labels = df.CATEGORY.value_counts()[-1] # based on the least_number of categories in the value counts above.

data_df = df.copy() # I create copies of the input data to ensure I always have the original copy readily available.

shuffled_df = data_df.reindex(np.random.permutation(data_df.index)) # always shuffle data when you can.


e = shuffled_df[shuffled_df['CATEGORY'] == 'e'][:max_num_of_labels]
b = shuffled_df[shuffled_df['CATEGORY'] == 'b'][:max_num_of_labels]
t = shuffled_df[shuffled_df['CATEGORY'] == 't'][:max_num_of_labels]
m = shuffled_df[shuffled_df['CATEGORY'] == 'm'][:max_num_of_labels]

concated_df = pd.concat([e,b,t,m], ignore_index=True)
#Shuffle the dataset
concated_df = concated_df.reindex(np.random.permutation(concated_df.index))
concated_df['LABEL'] = 0
concated_df.head()

python

#One-hot encode the label
concated_df.loc[concated_df['CATEGORY'] == 'e', 'LABEL'] = 0 # e = index 0
concated_df.loc[concated_df['CATEGORY'] == 'b', 'LABEL'] = 1 # b = index 1
concated_df.loc[concated_df['CATEGORY'] == 't', 'LABEL'] = 2 # t = index 2
concated_df.loc[concated_df['CATEGORY'] == 'm', 'LABEL'] = 3 # m = index 3
print(concated_df['LABEL'][:10])
labels = to_categorical(concated_df['LABEL'], num_classes=4)
print(labels[:10])
if 'CATEGORY' in concated_df.keys():
    concated_df = concated_df.drop(['CATEGORY'], axis=1)
'''
 [1. 0. 0. 0.] e
 [0. 1. 0. 0.] b
 [0. 0. 1. 0.] t
 [0. 0. 0. 1.] m
'''

python

concated_df.head()

python

n_most_common_words = 8000
max_len = 130
token_filter = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~'
tokenizer = Tokenizer(num_words=n_most_common_words, filters=token_filter, lower=True)
tokenizer.fit_on_texts(concated_df['TITLE'].values)

python

sequences = tokenizer.texts_to_sequences(concated_df['TITLE'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

python

X = pad_sequences(sequences, maxlen=max_len)

python

X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)

python

epochs = 10
emb_dim = 128
batch_size = 256
labels[:2]

python

callbacks = get_callbacks(name='multi')

python

print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))

Create your Neural Network

Keras makes it easy to create neural networks on top of the tensorflow library. You don't have to know exactly how this works to run it.

Below will define your neural network's architecture.

From Keras docs: When using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros except for a 1 at the index corresponding to the class of the sample)...

We're doing categorical classification vs binary classification.

Binary classification means predictions will be one or the other ie, cat vs dog or liked vs disliked.

Category classification means prediction "category" your new data belows to ie, cat vs dog vs bird vs chair or what we did here.

python

model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.7))
model.add(LSTM(64, dropout=0.7, recurrent_dropout=0.7))
model.add(Dense(4, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(model.summary())

Run training

This step will take a while. It will be much faster if you lower the number of epochs (above) as well as use a GPU (as mentioned at the top).

python

training = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=callbacks)

python

accr = model.evaluate(X_test, y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

python

accuracy = training.history['acc']
val_accuracy = training.history['val_acc']
loss = training.history['loss']
val_loss = training.history['val_loss']

python

epochs = range(1, len(accuracy) + 1)

plt.plot(epochs, accuracy, 'bo', label='Training accuracy')
plt.plot(epochs, val_accuracy, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

python

txt = ["Regular fast food eating linked to fertility issues in women"]
seq = tokenizer.texts_to_sequences(txt)
padded = pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
labels = ['entertainment', 'business', 'science/tech', 'health']
print(pred, labels[np.argmax(pred)])

python

def extract_label(index):
    '''
    The labels correspond to exact label indices, in other words, the 
    order is absolutely important.
    '''
    labels = ['entertainment', 'business', 'science/tech', 'health']
    return labels[index]

python

def predict(text, model_klass=model):
    seq = tokenizer.texts_to_sequences([text])
    padded = pad_sequences(seq, maxlen=max_len)
    pred = model_klass.predict(padded)
    top_prediction_index = np.argmax(pred)
    predicted_label = extract_label(top_prediction_index)
    predictions = pred.tolist()[0]
    extracted_predictions = [{extract_label(i):"%.2f%%"%(x*100)} for i, x in enumerate(predictions)]
    top_percent = "%.2f%%"% (predictions[top_prediction_index] * 100)
    print(f"{text}\t\t{top_percent} {predicted_label}")
    return extracted_predictions

python

predict("The startup is doing very well")

python

predict("The startup company is booming")

python

predict("The startup company's growth has been amazing so far.")

python

predict("Sales are through the roof!")

python

predict("Stocks are booming.")

python

predict("That was an incredible performance by the actors")

python

predict("That was an incredible performance!")

python

predict("The health of the company is poor.")

python

predict("The health of the kid is poor.")

Save and Prepare for Reusable Prediction

First, we'll save the model. Then we'll save the tokenizer with pickle. After that, we'll adjust our predict method to be resuable as well.

python

save_model(model, name='multi_category_classification')

python

multi_category_tokenizer_pkl = os.path.join(pickles_dir, 'multi_category_tokenizer.pkl')
multi_category_tokenizer_pkl

python

write_mode = 'wb'
with open(multi_category_tokenizer_pkl, write_mode) as f:
    pickle.dump(tokenizer, f)

Fully Reusable Model

Below is all we need: the trained model and the pickled tokenizer and now we can use our model at any time.

python

from keras.models import load_model

stored_model  = os.path.join(models_dir, 'multi_category_classification')
model_obj = load_model(f'{stored_model}.hdf5')

python

multi_category_tokenizer_pkl = os.path.join(pickles_dir, 'multi_category_tokenizer.pkl')

write_mode = 'rb'
with open(multi_category_tokenizer_pkl, write_mode) as f:
    tokenizer_obj = pickle.load(f)

python

def predict(text, model_obj=None, tokenizer_obj=None):
    assert(tokenizer_obj != None)
    assert(model_obj != None)
    seq = tokenizer_obj.texts_to_sequences([text])
    padded = pad_sequences(seq, maxlen=max_len)
    pred = model_obj.predict(padded)
    top_prediction_index = np.argmax(pred)
    predicted_label = extract_label(top_prediction_index)
    predictions = pred.tolist()[0]
    extracted_predictions = [{extract_label(i):"%.2f%%"%(x*100)} for i, x in enumerate(predictions)]
    top_percent = "%.2f%%"% (predictions[top_prediction_index] * 100)
    print(f"{text}\t\t{top_percent} {predicted_label}")
    return extracted_predictions

python

predict("This is working well.", model_obj=model_obj, tokenizer_obj=tokenizer_obj)

python

predict("The market viability is uncertain.", model_obj=model_obj, tokenizer_obj=tokenizer_obj)