Day71: Recognizing emojis

Posted by csiu on May 6, 2017 | with: 100daysofcode

Today I got the program to recognize emojis in tweets. Specifically, the blue heart emoji from this tweet:


Recall: Getting the tweets

# Load libs
library(twitteR)
library(dplyr)
library(readr)

# Create API keys from https://apps.twitter.com
# Then enter the following information
api_key <- 'XXX'
api_secret <- 'XXX'
access_token <- 'XXX'
access_token_secret <- 'XXX'
# I use the following script to replace 'XXX' with my API key values
source("twitter_api_key.R")

# Connect to Twitter
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
# Pull my test tweet
search_string <- 'celiassiu'
tweets.raw <-
  searchTwitter(search_string, since = '2017-05-06')
# Tidying the tweets by:
#   1. Removing retweets
#   2. converting to data frame
#   3. selecting only the "text" field
#   4. converting encodings
(df_tidy <-
  strip_retweets(tweets.raw, strip_manual = TRUE, strip_mt = TRUE) %>%
  twListToDF() %>%
  select(text) %>%
  mutate(
    text_converted = iconv(text, from='latin1', to='ASCII', sub='byte')
  )
)

As a result, we produce the following text converted (right) tweets:

text text_converted
Trying to count emojis off twitter \xed\xa0\xbd\xed\xb2\x99 Trying to count emojis off twitter <ed><a0><bd><ed><b2><99></b2></ed></bd></a0></ed>
Day70: Tidying twitter data using tidy code #100daysofcode #rstats https://t.co/6bvxqnPEHa Day70: Tidying twitter data using tidy code #100daysofcode #rstats https://t.co/6bvxqnPEHa

The blue heart is now encoded by: <ed><a0><bd><ed><b2><99></b2></ed></bd></a0></ed>.

Loading Emoji dictionary

Originally I was using the emoji encoding dictionary from Github: laurenancona/twimoji, which encoded the blue heart by \xF0\x9F\x92\x99 (UTF-8)… And it is this mismatch of encoding between the dictionary and the tweet that caused my program to fail yesterday.

Looking at the Emoticons decoder for social media sentiment analysis in R (Jessica Peterka-Bonetta, 2015), I stumbled upon a different emoji dictionary located at GitHub: today-is-a-good-day/emojis. This time the representations of the blue heart emoji matches.

# Load emoji dictionary and consider only the entry for the blue heart
(emoticons <-
  readr::read_delim(
    "emDict.csv",
    delim = ";",
    col_names = c("description", "native", "bytes", "r_encoding"),
    skip = 1
  ) %>%

  mutate(description = tolower(description)) %>%
  filter(description == "BLUE HEART")
)

Running this code produces the following emoji dictionary for the blue heart:

description native bytes r_encoding
blue heart \U0001f499 \xF0\x9F\x92\x99 <ed><a0><bd><ed><b2><99></b2></ed></bd></a0></ed>

Now to see if it works.

Emojis in tweets

regexpr(pattern = emoticons$r_encoding[1],  # the r_encoding for the blue heart
        text = df_tidy$text_converted,      # the tweet
        useBytes = T)

Running the above code produces:

## [1] 36 -1
## attr(,"match.length")
## [1] 24 -1
## attr(,"useBytes")
## [1] TRUE

Eureka! The program recognizes the blue heart emoji in the tweet! (as indicated by the non negative number i.e. 36 and 24 in the results.)