Day86: A bit of bioinformatics pt2

Posted by csiu on May 21, 2017 | with: 100daysofcode

As a continuation of Day81: A bit of bioinformatics (and revisions for a paper), I need to parse the results of the HOMER motif discovery analysis.

Task: Extract from the HOMER motif name

The names of the motifs are jammed with information which needs to be split into separate columns.

To do this task, I defined an R function to load the data (part 1), parse out count data information (part 2), and separate the motif name into different columns (part 3).

load_homer_knownresults <- function(knownresults_file, returncounts=FALSE){
  ## Part 1: Read known results file
  motifs <- readr::read_tsv(knownresults_file, progress=FALSE)

  ## Part 2: Return number of sequences in target and background
  if (returncounts) {
    original_columns <- colnames(motifs)
    seqcounts <- c(
      "targets" = as.integer(sub(".*?(\\d+).*", "\\1", original_columns[6])),
      "background" = as.integer(sub(".*?(\\d+).*", "\\1", original_columns[8]))
    )
    return(seqcounts)
  }

  ## Part 3: Tidy column names & extract from motif name
  colnames(motifs) <-
    c("name", "consensus", "pval", "pvalLOG", "qvalBEN",
      "tagetN", "targetP", "bgN", "bgP")
  motifs %>%
    tidyr::separate("name", into=c("tf", "origin", "source"), sep="/") %>%
    tidyr::separate("tf", into=c("tf", "dnabindingdomain"), sep="\\(") %>%
    dplyr::mutate(dnabindingdomain = sub("\\)$", "", dnabindingdomain))
}

Now that the information is tidied, we are able to search the transcription factor name in e.g. PubMed for further research.