Day32: Variable importance in ANNs

Posted by csiu on March 28, 2017 | with: 100daysofcode, Machine Learning

Today I look at how variables importance can be obtained from neural networks.

7 methods (1. Connection weights algorithm (CW), 2. Modified Connection Weights (MCW), 3. Most Squares (MS), 4. Multiple Linear Regression (MLR), 5. Dominance Analysis (DA), 6. Garson’s algorithm, & 7. Partial derivative) for assessing the relative variable importance are listed in (Ibrahim, 2013). In this posts, I will try to reproduce the calculations & results of (Ibrahim, 2013) using R.

Ibrahim, OM. 2013. A comparison of methods for assessing the relative importance of input variables in artificial neural networks. Journal of Applied Sciences Research, 9(11): 5692-5700.

The R markdown for this post is found here.


Data: Trained network weights

The data is pulled from Table 4: Final connection weights.

> w$ih
        h1     h2      h3     h4
x1 -0.8604 1.0651 -0.1392 1.0666
x2 -6.6167 1.5142 -5.3051 1.3358
x3 -1.7338 1.4107 -1.4207 0.7845

> w$oh
        h1     h2      h3     h4
gy -5.2075 2.0158 -4.4617 1.5858


Connection weights algorithm (CW)


Here we show explicitly the calculations for the importance of input x1.

# We first calculate the connection weight product of eg. input x1
# w_x1,h1 * w_h1,gy
# w_x1,h2 * w_h2,gy
# w_x1,h3 * w_h3,gy
# w_x1,h4 * w_h4,gy
(cw_prd_x1 <- w$ih[1,] * w$oh)
         h1       h2        h3       h4
gy 4.480533 2.147029 0.6210686 1.691414
# & then take the sum
sum(cw_prd_x1)
[1] 8.940045

We do the calculation for all inputs and get the following connection weight product sums:

> cw_prd_sums
       x1        x2        x3
 8.940045 63.296866 19.455250

To calculate relative importance, we normalize by all connection weight product sums. Defining a helper function makes redoing this calculation with different inputs trivial.

# Helper function for calculating normalized scores
normalize_to_100 <- function(x, sort=TRUE){
  (x / sum(x)*100) %>%
    sort(decreasing = sort)
}

normalize_to_100(cw_prd_sums)
       x2        x3        x1
69.031928 21.218008  9.750064


  • Above is the equation mentioned in the paper, but to calculate the relative importance (out of 100%), I think they normalized by the total and thus the equation should be:


The first proposed algorithm (MCW) - Modified Connection Weights

A correction term (partial correlation) is multiplied by this sum and the absolute value is taken, this is called the corrected sum. Then to calculate the relative importance of each input, the corrected sum of each input is divided by the total corrected sum.


Partial correlation

Partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. – wikipedia


The following is from SlideShare: What is a partial correlation? (Ken Plummer, 2014)

> dat
  individual height weight gender
1          a     73    240      1
2          b     70    210      1
3          c     69    180      1
4          d     68    160      1
5          e     70    150      2
6          f     68    140      2
7          g     67    135      2
8          h     62    120      2

The correlation between height and weight is 0.825

cor(dat$height, dat$weight, method="pearson") #default is pearson
[1] 0.8251372

However, when controlling for gender (male=1; female=2) the correlation between height and weight is 0.770

(r_ij <- cor(dat$height, dat$weight))
(r_ki <- cor(dat$gender, dat$height))
(r_kj <- cor(dat$gender, dat$weight))

numer <- r_ij-(r_ki*r_kj)
denom <- sqrt((1-r_ki^2) * (1-r_kj^2))

sprintf("Partial correlation: %0.3f", numer/denom)
[1] 0.8251372
[1] -0.5498414
[1] -0.8026318
[1] "Partial correlation: 0.770"

“A partial correlation makes it possible to eliminate the effect of a third variable”



The partial correlation values were taken from Table 6.

(r_ijk <- c(x1=0.5266, x2=0.9584, x3=0.7275))

We then proceed to calculate the relative importance of the variables:

# Calculate absolute corrected sum
(abs_cor_sum <- cw_prd_sums * r_ijk)
       x1        x2        x3
 4.707827 60.663716 14.153694
# Normalize & sort by relative importance
normalize_to_100(abs_cor_sum)
       x2        x3        x1
76.282345 17.797739  5.919916
  • One possible issue is that we don’t know the identities of the controlling random variables


The second proposed algorithm (MS) - Most Squares

This metric is calculated by comparing the connection weights between the input and hidden layer of the initial (un-trained) and trained weights. Note: The connection weights between the hidden and output layer were not used.

Initial weights were pulled from Table 7: Input-hidden initial connection weights.

        h1      h2      h3      h4
x1  0.0236  0.1943  0.1478  0.2054
x2  0.0562 -0.3437 -0.3215 -0.0894
x3 -0.1488  0.1844 -0.0608  0.3267

Calculations are made as follows:

# Sum of squared of differences
(sum_sq_diff <-
  (w$ih_i - w$ih)^2 %>%
  rowSums()
)
       x1        x2        x3
 2.363783 74.846851  6.074946
# Normalize & sort
normalize_to_100(sum_sq_diff)
       x2        x3        x1
89.867719  7.294115  2.838166
  • Possible issues include (1) discrepancies depending on weight initialization & (2) output layer is removed (what of additional hidden layers?)


Multiple Linear Regression (MLR)

(This is a fancy name for a linear regression with more than one explanatory variable; MLR is different from multivariate linear regression which refers to the case of multiple response variables.)

  • Issue: I’m want to infer variable importance from ANNs, not remodel the data with a linear model.

Dominance Analysis (DA)

DA determines the dominance of one input over another by comparing their additional contributions across all subset models… approaches the problem of relative importance by examining the change in R^2 resulting from adding an input to all possible subset regression models… averaging all the possible models

  • Issue: I’m want to infer variable importance from ANNs, not remodel the data with many linear models



Garson’s algorithm

# If i = input i; j = hidden j; and o = the single output,
# Compute: w_ij * w_jo
# for each value i and j
#
# Matrix of weights (4,3) multiplied by vector of hidden-output weights
# Save the transposed result (3,4)
(numer <-
  (t(w$ih) * as.numeric(w$oh)) %>%
   t()
)
          h1       h2         h3       h4
x1  4.480533 2.147029  0.6210686 1.691414
x2 34.456465 3.052324 23.6697647 2.118312
x3  9.028764 2.843689  6.3387372 1.244060
# Sum of products of all inputs
# Think: contribution from each hidden node
# Eg. contribution from h1 = w_x1,h1*w_h1,o +
#                            w_x2,h1*w_h1,o +
#                            w_x3,h1*w_h1,o
# Will get contribution score for h1, h2, h3, and h4
#
# input-hidden weights (3,4) %*% (4,4) diagonal matrix of hidden-output weights
(denom <-
  w$ih %*% diag(as.numeric(w$oh)) %>%
  colSums()
)
[1] 47.965762  8.043042 30.629571  5.053786
# Take division to produce a matrix (3,4) (see Table 11)
# Take row sum to score the importance for each input
(ga <-
  (numer %*% diag(denom^-1)) %>%
  rowSums()
)
       x1        x2        x3
0.7153128 2.2897825 0.9949047
# Normalize & sort
normalize_to_100(ga)
      x2       x3       x1
57.24456 24.87262 17.88282


Partial derivative

(Not enough information specified in the paper for replication.)

Final thoughts

Out of the 7 algorithms, Partial derivative, Multiple Linear Regression, and Dominance Analysis were not reproducible. MLR and DA were also not directly applicable to use on neural networks. Modified Connection Weights required knowing and removing the controlling random variable, which is unknown to me. Most Squares does not make sense, because (1) the weight initialization will affect the results, and (2) I’m not sure why they removed the weights in the last layer. The methods that do make sense are the Connection weights algorithm and Garson’s algorithm. The concept behind Connection weights algorithm is simple and I will need to stare at Garson’s algorithm some more.