Now that we can recognize emojis, I can count them using this example:
Another test: Counting the number of emojis (w/ duplicates) from tweets
— Celia S. Siu (@celiassiu) May 8, 2017
--.--
π¦
ππππππππππππ
The R Markdown document for this project is found here.
Counting emojis
The original code appears to be in base R (which is hard to read). From what I can see, sapply(emojis$rencoding, regexpr, tweets$text, ignore.case = T, useBytes = T)
is the key line used to count emojis from tweets. This code essentially tells R to do pattern matching for each pattern in βemojis$rencodingβ to text βtweets$textβ with options βignore.case = Tβ and βuseBytes = Tβ.
Attempt 1: Using base R
Following the example, I tried to use regexpr
to count the number of times an emoji appears in a tweet. The problem, however, is that emojis are only counted once per tweet (irregardless of duplicates). This underrepresentation is not what I want.
Attempt 2: Word tokenization
Next I tried tokenizing the tweets with RWeka::WordTokenizer
as done in Twimoji: Identifying Emoji in Tweets (Chris Tufts, 2015). The problem, however, is that I am unable to install RWeka and thus am unable to tokenize my tweets.
Attempt 3: stringr
Finally the stringr::str_count
function by Hadley Wickham allow users to βcount the number of matches in a stringβ β which is what I want to do.
Counting emoijs the tidy way
I first define a helper function to count the number of times an emoji appears in a tweet (using stringr::str_count
) and then apply this function (using purrr::map
) for all emojis.
# Helper function to count number of times pattern occur in string
count_emojis <- function(e){
counts <- str_count(df_tidy$text, e)
data.frame(
counts,
tweet_id = 1:length(counts)
)
}
# Do the counting of emojis for each tweet
emoji_counts <-
emoticons %>%
select(description, r_encoding) %>%
mutate(
counts = purrr::map(r_encoding, ~count_emojis(.x))
) %>%
tidyr::unnest(counts)
## # A tibble: 2,526 Γ 4
## description r_encoding counts tweet_id
## <chr> <chr> <int> <int>
## 1 aerial tramway <ed><a0><bd><ed><ba><a1> 0 1
## 2 aerial tramway <ed><a0><bd><ed><ba><a1> 0 2
## 3 aerial tramway <ed><a0><bd><ed><ba><a1> 0 3
## 4 airplane <e2><9c><88> 0 1
## 5 airplane <e2><9c><88> 0 2
## 6 airplane <e2><9c><88> 0 3
## 7 alarm clock <e2><8f><b0> 0 1
## 8 alarm clock <e2><8f><b0> 0 2
## 9 alarm clock <e2><8f><b0> 0 3
## 10 alien monster <ed><a0><bd><ed><b1><be> 0 1
## # ... with 2,516 more rows
After the count of emojis per tweet is computed, we can next look at total emoiji counts.
emoji_counts %>%
# Ignore emojis with no counts
filter(counts != 0) %>%
# Here I want to only consider the latest tweet
filter(tweet_id == 1) %>%
select(-tweet_id) %>%
# Summarize the counts per emoji
group_by(description) %>%
summarise(count = sum(counts)) %>%
arrange(desc(count))
## Source: local data frame [7 x 3]
## Groups: description [7]
##
## description count
## <chr> <int>
## 1 automobile 4
## 2 taxi 3
## 3 delivery truck 2
## 4 ambulance 1
## 5 bus 1
## 6 fire engine 1
## 7 vertical traffic light 1
The results check out!
Another test: Counting the number of emojis (w/ duplicates) from tweets
— Celia S. Siu (@celiassiu) May 8, 2017
--.--
π¦
ππππππππππππ