Chapter 10 Document-term matrix

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix:

Rows correspond to documents in the collection and
Columns correspond to terms
Values contain the number of appearances of terms in the specified documents

10.1 COnverting DTM into dataframe

We will see how to transform a document-term matrix into a dataframe. We can find examples of DTM data by loading topicmodels package.

library(tm)
library(topicmodels)
library(quanteda)

data("AssociatedPress", package = "topicmodels")
AssociatedPress

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

The loaded dataset contains 2246 documents and 10473 distinct terms. We notice that this DTM is 99% sparse (99% of document-word paris are zero). We can get the terms using Terms() function.

terms = Terms(AssociatedPress)
head(terms)

## [1] "aaron"      "abandon"    "abandoned"  "abandoning" "abbott"    
## [6] "abboud"

In order to analyze the data, we should transform it inot dataframe. We can use tidy() function to do that.

ap_td = tidy(AssociatedPress)
ap_td

## # A tibble: 302,031 x 3
##    document term       count
##       <int> <chr>      <dbl>
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # ... with 302,021 more rows

Once we have the data in a dataframe format, we can perform some analysis. Here is an example of applying sentiment analysis to evaluate the negative and positive terms in the collection.

# using "bing" database to attribute negative/positive attribute to terms
ap_sentiments = ap_td %>% 
  inner_join(get_sentiments("bing"), by = c(term = "word"))
ap_sentiments

## # A tibble: 30,094 x 4
##    document term    count sentiment
##       <int> <chr>   <dbl> <chr>    
##  1        1 assault     1 negative 
##  2        1 complex     1 negative 
##  3        1 death       1 negative 
##  4        1 died        1 negative 
##  5        1 good        2 positive 
##  6        1 illness     1 negative 
##  7        1 killed      2 negative 
##  8        1 like        2 positive 
##  9        1 liked       1 positive 
## 10        1 miracle     1 positive 
## # ... with 30,084 more rows

# plot the results
library(ggplot2)

ap_sentiments %>%
  count(sentiment, term, wt = count) %>%
  ungroup() %>%
  filter(n >= 200) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(term = reorder(term, n)) %>%
  ggplot(aes(term, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  ylab("Contribution to sentiment") +
  coord_flip()

10.2 Generating Document-term matrix

Some algorithms may need document-term matrix as input. The cast_dtm function enable the generation of DTM structure from a dataframe.

ap_td %>% 
  cast_dtm(document, term, count)

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

We can also generate a Document-feature matrix by using the cast_dfm function

ap_td %>% 
  cast_dfm(document, term, count)

## Document-feature matrix of: 2,246 documents, 10,473 features (98.7% sparse).
##     features
## docs adding adult ago alcohol allegedly allen apparently appeared arrested
##    1      1     2   1       1         1     1          2        1        1
##    2      0     0   0       0         0     0          0        1        0
##    3      0     0   1       0         0     0          0        1        0
##    4      0     0   3       0         0     0          0        0        0
##    5      0     0   0       0         0     0          0        0        0
##    6      0     0   2       0         0     0          0        0        0
##     features
## docs assault
##    1       1
##    2       0
##    3       0
##    4       0
##    5       0
##    6       0
## [ reached max_ndoc ... 2,240 more documents, reached max_nfeat ... 10,463 more features ]