Day38: Refinement

Posted by csiu on April 3, 2017 | with: 100daysofcode

Day38: Refinement

Today we will continue with the exploration of the pesticides. The data I use comes from Kaggle’s Pesticide Data Program (2015) data set uploaded by the United States Department of Agriculture.

I continued with R and the markdown document for this little project is found here. (You can also find the R Markdown document in the sample directory, but it’s less pretty.)

Factor levels

One heuristics that I generally look at is the number of distinct values a variable take. A low number suggests the variable is grossly divided, possibly a categorical variable, and a high number suggests there are many divisions or the variable is a continuous variable.

sapply(colnames(results), function(var){
  results[[var]] %>% unique() %>% length()
}) %>%
  {data.frame(var = names(.), n = .)} %>%
  arrange(desc(n))
##            var     n
## 1    sample_pk 10187
## 2       concen  2582
## 3     pestcode   489
## 4          lod   126
## 5       commod    20
## 6    testclass    20
## 7          lab     8
## 8     annotate     5
## 9     determin     5
## 10  confmethod     4
## 11        mean     4
## 12    commtype     3
## 13  quantitate     3
## 14     extract     2
## 15     conunit     1
## 16 confmethod2     1

We find n = 1 for “conunit” and “confmethod2”, which tells you nothing interesting about the data. Such variables can be discarded from further analysis. From the results, I also see that the extraction method (“extract”) takes on 2 values.

Exploring the extraction methods

  • 805 = MDA Modified QuEChERS Method
  • 818 = NSL Animal Tissue Extraction Method
# Generate separate figures
p1 <-
  results %>%
  ggplot(aes(x=extract, fill=factor(extract))) +
  geom_bar(color="black") +
  xlab("") +
  theme(legend.position = "None")
p2 <-
  results %>%
  ggplot(aes(x=extract, y=lod, fill=factor(extract))) +
  geom_boxplot() +
  xlab("") +
  theme(legend.position = "None")
p3 <-
  results %>%
  ggplot(aes(x=commod, y=lod, fill=factor(extract))) +
  geom_boxplot() +
  labs(fill="Extraction\nmethod") +
  xlab("Commodity")

# Combine figures with library(cowplot)
cowplot::ggdraw() +
  cowplot::draw_plot(p1, 0, .5, .5, .5) +
  cowplot::draw_plot(p2, .5, .5, .5, .5) +
  cowplot::draw_plot(p3, 0, 0, 1, .5) +
  cowplot::draw_plot_label(c("A", "B", "C"), c(0, 0.5, 0), c(1, 1, 0.5), size = 15)

From the figure, the following insights could be drawn:

  • [A] More samples are extracted by MDA Modified QuEChERS Method (805) than NSL Animal Tissue Extraction Method (818)
  • [B] The limit of detection by MDA Modified QuEChERS Method (805) is lower, implying greater resolutions
  • [C] “PB” is the only commodity extracted by NSL Animal Tissue Extraction Method (818)

Following the trail

PB had an unusually high Limit of Detection (LOD), I want to look into this further. When we compare the number of distinct levels per variable, we find only 1 lab does the NSL Animal Tissue Extraction Method (818).

##   sample_pk      commod    commtype         lab    pestcode   testclass
##         315           1           1           1         107          10
##      concen         lod     conunit  confmethod confmethod2    annotate
##           4          13           1           2           1           1
##  quantitate        mean     extract    determin
##           1           3           1           3

We next took a look at the LOD by Pesticide for each pesticide class type:

From this figure, we see (1) there are relatively a lot of Halogenated compounds and only a handful Benzimidazole, Avermectin, and 2,4-D / Acid Herbicides compounds; and (2) the pestcode, testclass, and LOD information are all linked.

(Initially I made a boxplot, but no boxes were drawn implying no variability.)

Reaching the right information

Since understanding the data more, I come to realize using only the complete cases is not the ideal way to go; instead, the columns with missing information (containing the pesticides concentrations) must be kept.

  • concen - Concentration/LOD Unit-of-Measure Code
  • quantitate - QUANTITATION METHOD in 2015 PDP Analytical Results
  • annotate -ANNOTATED INFORMATION in 2015 PDP Analytical Results
  • confmethod - CONFIRMATION METHOD in 2015 PDP Analytical Results
  • confmethod2

And using the the correct variable (ie. “concen”), we refine/corrected the figure from yesterday.