Day70: Tidying Twitter

Posted by csiu on May 5, 2017 | with: 100daysofcode

Today, I clean the twitter data.

(Originally I wanted to count the emoticons from the tweets, but I’m running into issues with matching the text encodings.)

Figuring out what the code means

In Hamdan Azhar’s original code, he used the following to clean the twitter data.

df$hashtag <- ht; df$created <- as.POSIXlt(df$created); df$text <- iconv(df$text, 'latin1', 'ASCII', 'byte'); df$url <- paste0('https://twitter.com/', df$screenName, '/status/', df$id); df <- rename(df, c(retweetCount = 'retweets'));
df.a <- subset(df, select = c(text, created, url, latitude, longitude, retweets, hashtag));

The code in its original form is somewhat challenging to read.

Doing it the tidy way

Here I make use of the library(dplyr) package to add new columns, convert encodings, update date types, rename columns, and select columns for further analysis.

df_tidy <-
  df %>%
  mutate(
    # Add new columns containing the hashtag & tweet url
    hashtag = search_string,
    url = paste0('https://twitter.com/', screenName, '/status/', id),

    # Convert character vector between encodings
    text = iconv(text, from='latin1', to='ASCII', sub='byte'),

    # Convert to a date type
    created = lubridate::ymd_hms(created, tz = "UTC")
  ) %>%

  rename(
    retweets = retweetCount
  ) %>%

  select(
    text, created, url, latitude, longitude, retweets, hashtag
  )