Today, I clean the twitter data.
(Originally I wanted to count the emoticons from the tweets, but I’m running into issues with matching the text encodings.)
Figuring out what the code means
In Hamdan Azhar’s original code, he used the following to clean the twitter data.
df$hashtag <- ht; df$created <- as.POSIXlt(df$created); df$text <- iconv(df$text, 'latin1', 'ASCII', 'byte'); df$url <- paste0('https://twitter.com/', df$screenName, '/status/', df$id); df <- rename(df, c(retweetCount = 'retweets'));
df.a <- subset(df, select = c(text, created, url, latitude, longitude, retweets, hashtag));
The code in its original form is somewhat challenging to read.
Doing it the tidy way
Here I make use of the library(dplyr)
package to add new columns, convert encodings, update date types, rename columns, and select columns for further analysis.
df_tidy <-
df %>%
mutate(
# Add new columns containing the hashtag & tweet url
hashtag = search_string,
url = paste0('https://twitter.com/', screenName, '/status/', id),
# Convert character vector between encodings
text = iconv(text, from='latin1', to='ASCII', sub='byte'),
# Convert to a date type
created = lubridate::ymd_hms(created, tz = "UTC")
) %>%
rename(
retweets = retweetCount
) %>%
select(
text, created, url, latitude, longitude, retweets, hashtag
)