As a continuation of Day81: A bit of bioinformatics (and revisions for a paper), I need to parse the results of the HOMER motif discovery analysis.
Task: Extract from the HOMER motif name
The names of the motifs are jammed with information which needs to be split into separate columns.
To do this task, I defined an R function to load the data (part 1), parse out count data information (part 2), and separate the motif name into different columns (part 3).
load_homer_knownresults <- function(knownresults_file, returncounts=FALSE){
## Part 1: Read known results file
motifs <- readr::read_tsv(knownresults_file, progress=FALSE)
## Part 2: Return number of sequences in target and background
if (returncounts) {
original_columns <- colnames(motifs)
seqcounts <- c(
"targets" = as.integer(sub(".*?(\\d+).*", "\\1", original_columns[6])),
"background" = as.integer(sub(".*?(\\d+).*", "\\1", original_columns[8]))
)
return(seqcounts)
}
## Part 3: Tidy column names & extract from motif name
colnames(motifs) <-
c("name", "consensus", "pval", "pvalLOG", "qvalBEN",
"tagetN", "targetP", "bgN", "bgP")
motifs %>%
tidyr::separate("name", into=c("tf", "origin", "source"), sep="/") %>%
tidyr::separate("tf", into=c("tf", "dnabindingdomain"), sep="\\(") %>%
dplyr::mutate(dnabindingdomain = sub("\\)$", "", dnabindingdomain))
}
Now that the information is tidied, we are able to search the transcription factor name in e.g. PubMed for further research.