Today I got the program to recognize emojis in tweets. Specifically, the blue heart emoji from this tweet:
Trying to count emojis off twitter 💙
— Celia S. Siu (@celiassiu) May 7, 2017
Recall: Getting the tweets
# Load libs
library(twitteR)
library(dplyr)
library(readr)
# Create API keys from https://apps.twitter.com
# Then enter the following information
api_key <- 'XXX'
api_secret <- 'XXX'
access_token <- 'XXX'
access_token_secret <- 'XXX'
# I use the following script to replace 'XXX' with my API key values
source("twitter_api_key.R")
# Connect to Twitter
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
# Pull my test tweet
search_string <- 'celiassiu'
tweets.raw <-
searchTwitter(search_string, since = '2017-05-06')
# Tidying the tweets by:
# 1. Removing retweets
# 2. converting to data frame
# 3. selecting only the "text" field
# 4. converting encodings
(df_tidy <-
strip_retweets(tweets.raw, strip_manual = TRUE, strip_mt = TRUE) %>%
twListToDF() %>%
select(text) %>%
mutate(
text_converted = iconv(text, from='latin1', to='ASCII', sub='byte')
)
)
As a result, we produce the following text converted (right) tweets:
text | text_converted |
Trying to count emojis off twitter \xed\xa0\xbd\xed\xb2\x99 | Trying to count emojis off twitter <ed><a0><bd><ed><b2><99></b2></ed></bd></a0></ed> |
Day70: Tidying twitter data using tidy code #100daysofcode #rstats https://t.co/6bvxqnPEHa | Day70: Tidying twitter data using tidy code #100daysofcode #rstats https://t.co/6bvxqnPEHa |
The blue heart is now encoded by: <ed><a0><bd><ed><b2><99></b2></ed></bd></a0></ed>
.
Loading Emoji dictionary
Originally I was using the emoji encoding dictionary from Github: laurenancona/twimoji, which encoded the blue heart by \xF0\x9F\x92\x99
(UTF-8)… And it is this mismatch of encoding between the dictionary and the tweet that caused my program to fail yesterday.
Looking at the Emoticons decoder for social media sentiment analysis in R (Jessica Peterka-Bonetta, 2015), I stumbled upon a different emoji dictionary located at GitHub: today-is-a-good-day/emojis. This time the representations of the blue heart emoji matches.
# Load emoji dictionary and consider only the entry for the blue heart
(emoticons <-
readr::read_delim(
"emDict.csv",
delim = ";",
col_names = c("description", "native", "bytes", "r_encoding"),
skip = 1
) %>%
mutate(description = tolower(description)) %>%
filter(description == "BLUE HEART")
)
Running this code produces the following emoji dictionary for the blue heart:
description | native | bytes | r_encoding |
blue heart | \U0001f499 | \xF0\x9F\x92\x99 | <ed><a0><bd><ed><b2><99></b2></ed></bd></a0></ed> |
Now to see if it works.
Emojis in tweets
regexpr(pattern = emoticons$r_encoding[1], # the r_encoding for the blue heart
text = df_tidy$text_converted, # the tweet
useBytes = T)
Running the above code produces:
## [1] 36 -1
## attr(,"match.length")
## [1] 24 -1
## attr(,"useBytes")
## [1] TRUE
Eureka! The program recognizes the blue heart emoji in the tweet! (as indicated by the non negative number i.e. 36 and 24 in the results.)