1 Motivation

Linear models may not meet the assumptions of Ordinary Least Squares (OLS), which are:

  1. Independent residuals
  2. No relation between residuals and fitted values
  3. Normally-distributed residuals
  4. Homoscedastic (equal-variance) residuals, at each fitted value
  5. No high-leverage points with large residuals
  6. Linearity in all predictors, independently (perhaps also with interaction)
  1. can be dealt with by Generalized Linear Squares (GLS), if the nature of the dependence is known (e.g., time or space dependence, repeated measurements on the same object).

  2. usually indicates a non-linear relation; (3) and (4) may be due to that, but also to unusual values, and especially skewed predictors or response variables.

For example, in the Meuse soil pollution dataset available in the legacy sp R package, we can model one of the metal concentrations in soil as a linear combination of the elevation above sea level ‘elev’ and the distance (in meters) from the Maas river dist.m:

data(meuse, package = "sp")  # an example dataset supplied with the sp package
names(meuse)
##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m"
summary(meuse$zinc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   113.0   198.0   326.0   469.7   674.5  1839.0
summary(m.ols <- lm(zinc ~ elev + dist.m, data=meuse))
## 
## Call:
## lm(formula = zinc ~ elev + dist.m, data = meuse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -414.49 -152.10  -41.03  105.76 1120.31 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1696.5393   171.6115   9.886  < 2e-16 ***
## elev        -122.8008    22.5172  -5.454 1.96e-07 ***
## dist.m        -0.7719     0.1051  -7.344 1.17e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254.7 on 152 degrees of freedom
## Multiple R-squared:  0.5246, Adjusted R-squared:  0.5184 
## F-statistic: 83.88 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols)

par(mfrow=c(1,1))

We see all of the above-mentioned problems.

2 The Box-Cox transform

Box and Cox (1964) developed a family of transformations designed to reduce non-normality of the errors in a linear model, i.e. our problem (3) above. Applying this transform often reduces non-linearity as well, i.e., our problem (1), and heteroscedascity, i.e., our problem (4).

The linear model is: \[ Y_i=\beta_0+\mathbf{\beta X}+\varepsilon_i, \;\;\varepsilon_i\sim\mathcal{N}(0,\sigma^2) \] Where \(\mathbf{X}\) is the design matrix, the coefficients to be fitted are the vector \(\mathbf{\beta}\), and (crucially) the residuals \(\varepsilon\) are assumed to be normally-distributed with a single variance \(\sigma^2\). In the example above we see this condition is not met (as well as other problems).

The idea is to transform the response variable \(Y\) to a replacement response variable \(Y_i^{(\lambda)}\), leaving the right-hand side of the regression model unchanged, so that the regression residuals become normally-distributed. Note that the regression coefficients will also change, because the response variable has changed; therefore the regression coefficients must be interpreted with respect to the transformed variable. Also, any predictions made with the model have to be back-transformed, to be interpreted in the original units.

The standard (simple) Box-Cox transform is:

\[ Y_i^{(\lambda)}=\begin{cases} \dfrac{Y_i^\lambda-1}{\lambda}&(\lambda\neq0)\\ \log{(Y_i)} & (\lambda=0) \end{cases} \]

The inverse transform is:

\[ Y_i=\begin{cases} \exp \left( \dfrac{\log(1 + \lambda Y_i^{(\lambda)})}{\lambda} \right) & (\lambda\neq0)\\ \exp( {Y_i^\lambda)} & (\lambda=0) \end{cases} \]

Obviously, this depends on a parameter \(\lambda\). Some familiar transformations are associated with certain values of \(\lambda\):

  • \(\lambda\) = 1.00: no transformation needed; produces coefficients identical to original data, except for a \(\beta_0\) (bias) one unit lower.
  • \(\lambda\) = 0.50: square root transformation
  • \(\lambda\) = 0.33: cube root transformation
  • \(\lambda\) = 0.25: fourth root transformation
  • \(\lambda\) = 0.00: natural log transformation
  • \(\lambda\) = -0.50: reciprocal square root transformation
  • \(\lambda\) = -1.00: reciprocal (inverse) transformation=

This is solved by maximum likelihood, i.e., which value of \(\lambda\) is most likely to have given rise to the observed distribution of the model residuals. 1

The log-likelihood of the linear model, assuming independent observations (i.e., solution by OLS), is:

\[ \begin{aligned} \log\mathcal{L} = &-\frac{n}{2}\log(2\pi)-n\log\sigma\\&-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta X}_i)\right]^2\\&+(\lambda-1)\sum_{i=1}^n\log Y_i \end{aligned} \]

This can also be written:

\[ \mathcal{L} \propto -\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta_1 X}_i)}{\dot{Y}^\lambda}\right)^2\right]" \]

(the \(\propto\) is because some constants have been removed) where \(\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n \log Y_i\right]\), i.e., the exponential of the mean of the logarithms of the variable.

Fortunately, the function boxcox has been implemented in the MASS R package. This computes the likelihood at user-defined or default (\(\lambda = [-2 \ldots 2]\) by \(0.1\)); we can then find the optimum, i.e., maximum log-likelihood.

For the Meuse linear model example:

library(MASS)
bc <- boxcox(m.ols, plotit = TRUE)

str(bc) # numbers that make uTRUEstr(bc) # numbers that make up the graph
## List of 2
##  $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
##  $ y: num [1:100] -348 -344 -340 -336 -332 ...
(bc.power <- bc$x[which.max(bc$y)])
## [1] -0.3434343

The optimal transformation is -0.343, i.e., towards an inverse power. Note that the confidence interval does not include 1 (no transformation), so a transformation is indicated. A logarithmic transformatic (\(\lambda = 0\)) would not be optimal.

3 Transformation and back-transformation

Now we need to transform the response variable accordingly, and re-compute the linear model, with this transformed variable.

We can write a function corresponding to the formula:

BCTransform <- function(y, lambda=0) {
    if (lambda == 0L) { log(y) }
    else { (y^lambda - 1) / lambda }
    }

And of course we need a reverse transformation:

BCTransformInverse <- function(yt, lambda=0) {
    if (lambda == 0L) { exp(yt) }
    else { exp(log(1 + lambda * yt)/lambda) }
    }

Test that the transformation followed by its inverse recovers the original values:

# testing
yt <- BCTransform(meuse$zinc, 0)
yo <- BCTransformInverse(yt, 0)
unique(round(yo-meuse$zinc),8)
## [1] 0
yt <- BCTransform(meuse$zinc, .5)
yo <- BCTransformInverse(yt, .5)
unique(round(yo-meuse$zinc),8)
## [1] 0

Yes.

4 Applying the transformation

Use this function with the optimal value of \(\lambda\) to transform the response variable:

hist(meuse$zinc, breaks=18, main = "Untransformed"); rug(meuse$zinc)

meuse$zinc.bc <- BCTransform(meuse$zinc, bc.power)
hist(meuse$zinc.bc, breaks=18, main = "Box-Cox transformed"); rug(meuse$zinc.bc)

The skew has been removed. The target is obviously bi-modal.

Compare to the log-transform:

hist(log(meuse$zinc), breaks=18, main = "Natural-log transformed"); rug(log(meuse$zinc))

This has transformed the target variable, but the test of the transformation is the distrbution of the linear model residuals, not the target variable.

Re-run the linear model with transformation:

summary(m.ols.bc <- lm(zinc.bc ~ elev + dist.m, data=meuse))
## 
## Call:
## lm(formula = zinc.bc ~ elev + dist.m, data = meuse)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.137238 -0.029456  0.003679  0.035189  0.146718 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.860e+00  3.589e-02  79.688  < 2e-16 ***
## elev        -3.393e-02  4.709e-03  -7.206 2.52e-11 ***
## dist.m      -2.350e-04  2.198e-05 -10.692  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05327 on 152 degrees of freedom
## Multiple R-squared:  0.6845, Adjusted R-squared:  0.6804 
## F-statistic: 164.9 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc)

par(mfrow=c(1,1))

The residuals are close to normally-distributed; the residuals are less related to the fits, and the apparent heteroscedascity is mostly due to the uneven numbers of observations at different fitted values in the bimodal distribution.

Compare this to the linear model with a log-transformation:

summary(m.ols.log <- lm(log(zinc) ~ elev + dist, data=meuse))
## 
## Call:
## lm(formula = log(zinc) ~ elev + dist, data = meuse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03595 -0.27154 -0.00249  0.25329  1.10223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.48453    0.29312  28.945  < 2e-16 ***
## elev        -0.26065    0.03849  -6.772  2.6e-10 ***
## dist        -1.96003    0.20610  -9.510  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4288 on 152 degrees of freedom
## Multiple R-squared:  0.6518, Adjusted R-squared:  0.6472 
## F-statistic: 142.3 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.log)

par(mfrow=c(1,1))

Not too different here, because the selected power -0.343 is not too far from 0, i.e., the natural log transformation.

5 Another transformation function

The powerTransform of the car package also uses the maximum likelihood-like approach outlined by Box and Cox (1964) to select a transformation of a univariate or multivariate response for normality, linearity and/or constant variance. This is more comprehensive than just the normality of the linear model residuals The summary method automatically computes two or three likelihood ratio tests concerning the transformation powers.

We use this on the Meuse Zn linear model:

require(car)
## Loading required package: car
## Loading required package: carData
(bc.car <- powerTransform(m.ols))
## Estimated transformation parameter 
##         Y1 
## -0.3450989
str(bc.car, max.level=1)
## List of 15
##  $ value      : num 770
##  $ counts     : Named int [1:2] 3 3
##   ..- attr(*, "names")= chr [1:2] "function" "gradient"
##  $ convergence: int 0
##  $ message    : chr "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
##  $ hessian    : num [1, 1] 97.6
##  $ start      : num -0.345
##  $ lambda     : Named num -0.345
##   ..- attr(*, "names")= chr "Y1"
##  $ roundlam   : Named num -0.5
##   ..- attr(*, "names")= chr "Y1"
##  $ invHess    : num [1, 1] 0.0102
##  $ llik       : num 770
##  $ family     : chr "bcPower"
##  $ xqr        :List of 4
##   ..- attr(*, "class")= chr "qr"
##  $ y          : num [1:155, 1] 1022 1141 640 257 269 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ x          : num [1:155, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   ..- attr(*, "assign")= int [1:3] 0 1 2
##  $ weights    : num [1:155] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "powerTransform"
print(bc.car$lambda)
##         Y1 
## -0.3450989
summary(bc.car)
## bcPower Transformation to Normality 
##    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1   -0.3451        -0.5      -0.5434      -0.1467
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                            LRT df       pval
## LR test, lambda = (0) 11.94513  1 0.00054791
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df       pval
## LR test, lambda = (1) 175.2963  1 < 2.22e-16

We see that this estimate from car -0.345 is not quite the same estimate as from the function in MASS, i.e., -0.343.

The LR-tests here show that the untransformed variable will certainly not be acceptable, so that a transformation is surely needed. It also shows that a log-transform would not be optimal.

The car package also has a function bcPower to carry out the transformation:

meuse$zinc.bc.car <- bcPower(meuse$zinc, lambda=bc.car$lambda)
hist(meuse$zinc.bc.car, breaks=18); rug(meuse$zinc.bc.car)

Re-run the linear model with this transformation:

summary(m.ols.bc.car <- lm(zinc.bc.car ~ elev + dist.m, data=meuse))
## 
## Call:
## lm(formula = zinc.bc.car ~ elev + dist.m, data = meuse)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.135924 -0.029163  0.003639  0.034880  0.145310 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.848e+00  3.554e-02  80.142  < 2e-16 ***
## elev        -3.360e-02  4.663e-03  -7.205 2.52e-11 ***
## dist.m      -2.327e-04  2.176e-05 -10.694  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05275 on 152 degrees of freedom
## Multiple R-squared:  0.6846, Adjusted R-squared:  0.6804 
## F-statistic: 164.9 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc.car)

par(mfrow=c(1,1))

This is also quite good.

6 Analyzing the linear model of the transformed variable

The linear model of the transformed variable is conditioned on the data in two ways: (1) the usual estimate from noisy data points, and (2) the value of \(\lambda\) selected for the transformation. Condition (2) is not the case for user-selected, a priori, transforms.

7 Transforming a predictor

In this case we want the predictor (to be transformed) to be represented as the response \(Y\); this is accomplished with the null model, i.e., a variable predicted only by its mean: \(Y_i = \beta_0 + \varepsilon_i\). The distribution of the residuals is the same as that of the variable, just centred at its mean.

For example, supposing that Zn would be used as a predictor in a model, and we would like to transform it to near-normality:

m.null <- lm(zinc ~ 1, data=meuse)
summary(m.null)
## 
## Call:
## lm(formula = zinc ~ 1, data = meuse)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -356.7 -271.7 -143.7  204.8 1369.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   469.72      29.48   15.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 367.1 on 154 degrees of freedom
plot(m.null, which=2)

qqnorm(meuse$zinc); qqline(meuse$zinc)

hist(meuse$zinc, breaks=18); rug(meuse$zinc)

We see that the model residuals follow the same distribution as the original data values. There is serious skew: far too many small values, far too few large, to be normally-distributed.

Now compute the transformation parameter:

bc.null <- boxcox(m.null)

(bc.power.null <- bc.null$x[which.max(bc.null$y)])
## [1] -0.2626263
summary(bc.power.null.car <- powerTransform(m.null))
## bcPower Transformation to Normality 
##    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1   -0.2728        -0.5      -0.5168      -0.0287
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                            LRT df     pval
## LR test, lambda = (0) 4.906319  1 0.026759
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df       pval
## LR test, lambda = (1) 112.0677  1 < 2.22e-16

The optimal transformation from MASS is -0.263 and from car is -0.273. As in the linear model residuals for the predictive model of Zn, these are negative, but closer to a log transformation.

Transform the variable and re-compute the distribution:

meuse$zinc.bc.null <- BCTransform(meuse$zinc, bc.power.null)
qqnorm(meuse$zinc.bc.null); qqline(meuse$zinc.bc.null)

hist(meuse$zinc.bc.null, breaks=18); rug(meuse$zinc.bc.null)

This is somewhat better – at least it is symmetric, although the tails have become too “light”. This is because this variable has a bimodal distribution, which is difficult to see in the original histogram.

8 Caution

The Box-Cox transform should not be applied if other transformations are indicated by theory. An example from Box and Cox (1964) is a chemical reaction, where physical theory suggests that the reaction rate \(r\) should be modelled as \(r \propto c^\alpha \cdot \beta \cdot exp(t)\), where \(c\) is a concentration of the reactants and \(t\) is the temperature. This could be linearized as \(r = \alpha \cdot \log c + \beta \cdot t\) and then solved for \(\alpha\) and \(\beta\). This transformation is indicated by theory, not empirically as with Box-Cox.

Also, if the suggested Box-Cox power is close to 0, many analysts use the log-transform even though it is not optimal, for ease of interpretation.

9 References

Box, G. E. P., & Cox, D. R. (1964). An Analysis of Transformations (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252.