Chapter 4 Text classification
This tutorial classifies movie reviews as positive or negative using the text of the review. We’ll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.
4.1 Load the data
We will keep only the op 10,000 most frequently occuring words in the training data.
imdb <- dataset_imdb(num_words = 10000)
train_data <- imdb$train$x
train_labels <- imdb$train$y
test_data <- imdb$test$x
test_labels <- imdb$test$y
The obtained train_data
and test_data
are lists of review. Each review is a list of word indices.
## int [1:218] 1 14 22 16 43 530 973 1622 1385 65 ...
The obtained train_labels
and test_labels
are lists of 0 (negatove revieuw) and 1 (positive review).
## int 1
We can decode the words index back to text words in this way:
# Named list mapping words to an integer index.
word_index <- dataset_imdb_word_index()
reverse_word_index <- names(word_index)
names(reverse_word_index) <- word_index
# Decodes the review. Note that the indices are offset by 3 because 0, 1, and
# 2 are reserved indices for "padding," "start of sequence," and "unknown."
decoded_review <- sapply(train_data[[1]], function(index) {
word <- if (index >= 3) reverse_word_index[[as.character(index - 3)]]
if (!is.null(word)) word else "?"
})
cat(decoded_review)
## ? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
4.2 Prepare the data for neural network
Since we can’t feed a list of integers into a neural network, we need to transfomr our ists into tensors. There are two options to turn lists of integers into tensors:
- Pad our lists in way that they have the same length. We pad them into an integer tensor of shape
samples, word_indices
. Then, we use a first layer in our network a layer that can handle such integer tensors, like theembedding_layer
. - One-hot encoding our lists by transforming them into vectors of 0s and 1s. Then, we could use the obtained sparce vectors as the first layer. We will test this solution in the tutorial in order to learn how to vectorize manually the data.
vectorize_sequences <- function(sequences, dimension = 10000) {
# Creates an all-zero matrix of shape (length(sequences), dimension)
results <- matrix(0, nrow = length(sequences), ncol = dimension)
for (i in 1:length(sequences))
# Sets specific indices of results[i] to 1s
results[i, sequences[[i]]] <- 1
results
}
x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)
Here’s what the samples look like now:
## num [1:10000] 1 1 0 1 1 1 1 1 1 0 ...
We should also convert your labels from integer to numeric
4.3 Building the model
We will use a simple stack of fully connected dense layers with relu
activation.
library(keras)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
We compile the model
model %>% compile(
optimizer = optimizer_rmsprop(lr=0.001),
loss = "binary_crossentropy",
metrics = c("accuracy")
)
In order to monitor during training the accuracy of the model on data it has never seen before, you’ll create a validation set by setting apart 10,000 samples from the original training data.
val_indices <- 1:10000
x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]
Now we train the odel over 10 epochs, in mini-batches of 512 samples. In order to monitor loss ad accuracy on the validation set, we pass the validation data as validation_data
argument.
4.4 Testing the model
We saw in the last section that the model performance decrease after 4 epochs and starts to overfitting. So, we can decide to stop training after 4 epochs to avoid overfitting. We will train the model from scratch with 4 epochs and evaluate it with the test data.
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
model %>% keras::fit(x_train, y_train, epochs = 4, batch_size = 512)
results <- model %>% evaluate(x_test, y_test)
results
4.5 Reference
Chollet & Allaire (2017, Dec. 7). RStudio AI Blog: Deep Learning for Text Classification with Keras. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2017-12-07-text-classification-with-keras/