Day87: A bit of bioinformatics pt3

Posted by csiu on May 22, 2017 | with: 100daysofcode

“I’ll get it done by May 23”, was what I said to my MSc thesis supervisor.

Needless to say, meeting this artificial deadline that I set out for myself, was what I was working on all day. Hopefully, this is the final part of “A bit of bioinformatics” following yesterday’s part2.

Here is a high level description of some of the tasks done today.

Task: Filter data frame

I was interested in obtaining regions of interest unique to an individual (here represented by V1, V2, V3).

# Create data frame
set.seed(1)
dat <-
  matrix(data = rbinom(3*10, 1, 0.4), ncol = 3) %>%
  as.data.frame()
#>     V1 V2 V3
#>  1   0  0  1
#>  2   0  0  0
#>  3   0  1  1
#>  4   1  0  0
#>  5   0  1  0
#>  6   1  0  0
#>  7   1  1  0
#>  8   1  1  0
#>  9   1  0  1
#>  10  0  1  0

# Filter for unique positions
dat %>%
  filter( (V1>0 & V2==0 & V3==0) |
            (V1==0 & V2>0 & V3==0) |
            (V1==0 & V2==0 & V3>0) )

It’s funny that & and && (or | and ||) means something different in R.

& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. (StackOverflow, 2013)

Honestly, I use these operations very infrequently such that I never really remember…

Task: Adding additional labels

Once I have my regions of interest, I then want to map individual annotations onto these regions.

Since I was working with genomic regions, I used bedtools intersect ... to find overlapping genomic features.

<Insert further data wrangling and hair pulling here>

Task: Visualization

Once the data is together, ggplot2 was then used to visualize the information.

I wanted to know, what the fractional DNA methylation levels look like for each individual’s DNA methylation profile for each regions of interest unique to an individual.

dat_tidy %>%

  # Visualize as boxplot
  ggplot(aes(
    x = sample_id_of_dna_methylation,
    y = fractional_methylation_level)) +
  geom_boxplot() +

  # Lay out panels in a grid based on
  # regions of interest unique to an individual
  facet_grid(~sample_id_of_region_of_interest) +

  # Zoom in to
  coord_cartesian(ylim = c(50, 100)) +

  # Relabel axes
  ylab("Fractional methyl call per bin") +
  xlab("DNA methylation profile")

# Save file
ggsave(my_cool_figure, width=5.5, heigh=4, scale=1.25)

Now that the fun part is done, the next step is to complete revisions on the manuscript 😅