Day72: Counting emojis

Posted by csiu on May 7, 2017 | with: 100daysofcode

Now that we can recognize emojis, I can count them using this example:

The R Markdown document for this project is found here.

Counting emojis

The original code appears to be in base R (which is hard to read). From what I can see, sapply(emojis$rencoding, regexpr, tweets$text, ignore.case = T, useBytes = T) is the key line used to count emojis from tweets. This code essentially tells R to do pattern matching for each pattern in β€œemojis$rencoding” to text β€œtweets$text” with options β€œignore.case = T” and β€œuseBytes = T”.

Attempt 1: Using base R

Following the example, I tried to use regexpr to count the number of times an emoji appears in a tweet. The problem, however, is that emojis are only counted once per tweet (irregardless of duplicates). This underrepresentation is not what I want.

Attempt 2: Word tokenization

Next I tried tokenizing the tweets with RWeka::WordTokenizer as done in Twimoji: Identifying Emoji in Tweets (Chris Tufts, 2015). The problem, however, is that I am unable to install RWeka and thus am unable to tokenize my tweets.

Attempt 3: stringr

Finally the stringr::str_count function by Hadley Wickham allow users to β€œcount the number of matches in a string” – which is what I want to do.

Counting emoijs the tidy way

I first define a helper function to count the number of times an emoji appears in a tweet (using stringr::str_count) and then apply this function (using purrr::map) for all emojis.

# Helper function to count number of times pattern occur in string
count_emojis <- function(e){
  counts <- str_count(df_tidy$text, e)
  data.frame(
    counts,
    tweet_id = 1:length(counts)
  )
}
# Do the counting of emojis for each tweet
emoji_counts <-
  emoticons %>%
  select(description, r_encoding) %>%
  mutate(
    counts = purrr::map(r_encoding, ~count_emojis(.x))
  ) %>%
  tidyr::unnest(counts)
## # A tibble: 2,526 Γ— 4
##       description               r_encoding counts tweet_id
##             <chr>                    <chr>  <int>    <int>
## 1  aerial tramway <ed><a0><bd><ed><ba><a1>      0        1
## 2  aerial tramway <ed><a0><bd><ed><ba><a1>      0        2
## 3  aerial tramway <ed><a0><bd><ed><ba><a1>      0        3
## 4        airplane             <e2><9c><88>      0        1
## 5        airplane             <e2><9c><88>      0        2
## 6        airplane             <e2><9c><88>      0        3
## 7     alarm clock             <e2><8f><b0>      0        1
## 8     alarm clock             <e2><8f><b0>      0        2
## 9     alarm clock             <e2><8f><b0>      0        3
## 10  alien monster <ed><a0><bd><ed><b1><be>      0        1
## # ... with 2,516 more rows

After the count of emojis per tweet is computed, we can next look at total emoiji counts.

emoji_counts %>%
  # Ignore emojis with no counts
  filter(counts != 0) %>%

  # Here I want to only consider the latest tweet
  filter(tweet_id == 1) %>%
  select(-tweet_id) %>%

  # Summarize the counts per emoji
  group_by(description) %>%
  summarise(count = sum(counts)) %>%
  arrange(desc(count))
## Source: local data frame [7 x 3]
## Groups: description [7]
##
##              description count
##                    <chr> <int>
## 1             automobile     4
## 2                   taxi     3
## 3         delivery truck     2
## 4              ambulance     1
## 5                    bus     1
## 6            fire engine     1
## 7 vertical traffic light     1

The results check out!