Sentiment Analysis with an RNN

Reference: https://github.com/agungsantoso/deep-learning-v2-pytorch

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis.

Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the sequence of words.

Here we'll use a dataset of movie reviews, accompanied by sentiment labels: positive or negative.

Network Architecture

The architecture for this network is shown below.

First, we'll pass in words to an embedding layer. We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the Word2Vec lesson. You can actually train an embedding with the Skip-gram Word2Vec model and use those embeddings as input, here. However, it's good enough to just have an embedding layer and let the network learn a different embedding table on its own. In this case, the embedding layer is for dimensionality reduction, rather than for learning semantic representations.

After input words are passed to an embedding layer, the new embeddings will be passed to LSTM cells. The LSTM cells will add recurrent connections to the network and give us the ability to include information about the sequence of words in the movie review data.

Finally, the LSTM outputs will go to a sigmoid output layer. We're using a sigmoid function because positive and negative = 1 and 0, respectively, and a sigmoid will output predicted, sentiment values between 0-1.

We don't care about the sigmoid outputs except for the very last one; we can ignore the rest. We'll calculate the loss by comparing the output at the last time step and the training label (pos or neg).


Load in and visualize the data

Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps, we'll want to take:

  • We'll want to get rid of periods and extraneous punctuation.
  • Also, you might notice that the reviews are delimited with newline characters \n. To deal with those, I'm going to split the text into each review using \n as the delimiter.
  • Then I can combined all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

Test your code

As a text that you've implemented the dictionary correctly, print out the number of unique words in your vocabulary and the contents of the first, tokenized review.

Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

Removing Outliers

As an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to shape our reviews into a specific length. We'll approach this task in two main steps:

  1. Getting rid of extremely long or short reviews; the outliers
  2. Padding/truncating the remaining data so that we have reviews of the same length.

Before we pad our review text, we should check for reviews of extremely short or long lengths; outliers that may mess with our training.

Okay, a couple issues here. We seem to have one review with zero length. And, the maximum review length is way too many steps for our RNN. We'll have to remove any super short reviews and truncate super long reviews. This removes outliers and should allow our model to train more efficiently.


Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some seq_length, we'll pad with 0s. For reviews longer than seq_length, we can truncate them to the first seq_length words. A good seq_length, in this case, is 200.

As a small example, if the seq_length=10 and an input review is:

[117, 18, 128]

The resultant, padded sequence should be:

[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]

Your final features array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified seq_length.

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.

Training, Validation, Test

With our data in nice shape, we'll split it into training, validation, and test sets.


DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:

  1. Create a known format for accessing our data, using TensorDataset which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
  2. Create DataLoaders and batch our training, validation, and test Tensor datasets.
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)

This is an alternative to creating a generator function for batching our data into full batches.


Sentiment Network with PyTorch

Below is where you'll define the network.

The layers are as follows:

  1. An embedding layer that converts our word tokens (integers) into embeddings of a specific size.
  2. An LSTM layer defined by a hidden_state size and number of layers
  3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
  4. A sigmoid activation layer which turns all outputs into a value 0-1; return only the last sigmoid output as the output of this network.

The Embedding Layer

We need to add an embedding layer because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec, then load it here. But, it's fine to just make a new layer, using it for only dimensionality reduction, and let the network learn the weights.

The LSTM Layer(s)

We'll create an LSTM to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, you're network will have better performance with more layers; between 2-3. Adding more layers allows the network to learn really complex relationships.

Note: init_hidden should initialize the hidden and cell state of an lstm layer to all zeros, and move those state to GPU, if available.

Instantiate the network

Here, we'll instantiate the network. First up, defining the hyperparameters.


Training

Below is the typical training code. If you want to do this yourself, feel free to delete all this code and implement it yourself. You can also add code to save a model by name.

We'll also be using a new kind of cross entropy loss, which is designed to work with a single Sigmoid output. BCELoss, or Binary Cross Entropy Loss, applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyparameters:


Testing

There are a few ways to test your network.

Inference on a test review