Linear models may not meet the assumptions of Ordinary Least Squares (OLS), which are:
can be dealt with by Generalized Linear Squares (GLS), if the nature of the dependence is known (e.g., time or space dependence, repeated measurements on the same object).
usually indicates a non-linear relation; (3) and (4) may be due to that, but also to unusual values, and especially skewed predictors or response variables.
For example, in the Meuse soil pollution dataset available in the
legacy sp R package, we can model one of the metal
concentrations in soil as a linear combination of the elevation above
sea level ‘elev’ and the distance (in meters) from the Maas river
dist.m:
data(meuse, package = "sp") # an example dataset supplied with the sp package
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
summary(meuse$zinc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 113.0 198.0 326.0 469.7 674.5 1839.0
summary(m.ols <- lm(zinc ~ elev + dist.m, data=meuse))
##
## Call:
## lm(formula = zinc ~ elev + dist.m, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -414.49 -152.10 -41.03 105.76 1120.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1696.5393 171.6115 9.886 < 2e-16 ***
## elev -122.8008 22.5172 -5.454 1.96e-07 ***
## dist.m -0.7719 0.1051 -7.344 1.17e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254.7 on 152 degrees of freedom
## Multiple R-squared: 0.5246, Adjusted R-squared: 0.5184
## F-statistic: 83.88 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols)
par(mfrow=c(1,1))
We see all of the above-mentioned problems.
Box and Cox (1964) developed a family of transformations designed to reduce non-normality of the errors in a linear model, i.e. our problem (3) above. Applying this transform often reduces non-linearity as well, i.e., our problem (1), and heteroscedascity, i.e., our problem (4).
The linear model is: \[ Y_i=\beta_0+\mathbf{\beta X}+\varepsilon_i, \;\;\varepsilon_i\sim\mathcal{N}(0,\sigma^2) \] Where \(\mathbf{X}\) is the design matrix, the coefficients to be fitted are the vector \(\mathbf{\beta}\), and (crucially) the residuals \(\varepsilon\) are assumed to be normally-distributed with a single variance \(\sigma^2\). In the example above we see this condition is not met (as well as other problems).
The idea is to transform the response variable \(Y\) to a replacement response variable \(Y_i^{(\lambda)}\), leaving the right-hand side of the regression model unchanged, so that the regression residuals become normally-distributed. Note that the regression coefficients will also change, because the response variable has changed; therefore the regression coefficients must be interpreted with respect to the transformed variable. Also, any predictions made with the model have to be back-transformed, to be interpreted in the original units.
The standard (simple) Box-Cox transform is:
\[ Y_i^{(\lambda)}=\begin{cases} \dfrac{Y_i^\lambda-1}{\lambda}&(\lambda\neq0)\\ \log{(Y_i)} & (\lambda=0) \end{cases} \]
The inverse transform is:
\[ Y_i=\begin{cases} \exp \left( \dfrac{\log(1 + \lambda Y_i^{(\lambda)})}{\lambda} \right) & (\lambda\neq0)\\ \exp( {Y_i^\lambda)} & (\lambda=0) \end{cases} \]
Obviously, this depends on a parameter \(\lambda\). Some familiar transformations are associated with certain values of \(\lambda\):
This is solved by maximum likelihood, i.e., which value of \(\lambda\) is most likely to have given rise to the observed distribution of the model residuals. 1
The log-likelihood of the linear model, assuming independent observations (i.e., solution by OLS), is:
\[ \begin{aligned} \log\mathcal{L} = &-\frac{n}{2}\log(2\pi)-n\log\sigma\\&-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta X}_i)\right]^2\\&+(\lambda-1)\sum_{i=1}^n\log Y_i \end{aligned} \]
This can also be written:
\[ \mathcal{L} \propto -\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta_1 X}_i)}{\dot{Y}^\lambda}\right)^2\right]" \]
(the \(\propto\) is because some constants have been removed) where \(\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n \log Y_i\right]\), i.e., the exponential of the mean of the logarithms of the variable.
Fortunately, the function boxcox has been implemented in
the MASS R package. This computes the likelihood at
user-defined or default (\(\lambda = [-2
\ldots 2]\) by \(0.1\)); we can
then find the optimum, i.e., maximum log-likelihood.
For the Meuse linear model example:
library(MASS)
bc <- boxcox(m.ols, plotit = TRUE)
str(bc) # numbers that make uTRUEstr(bc) # numbers that make up the graph
## List of 2
## $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
## $ y: num [1:100] -348 -344 -340 -336 -332 ...
(bc.power <- bc$x[which.max(bc$y)])
## [1] -0.3434343
The optimal transformation is -0.343, i.e., towards an inverse power. Note that the confidence interval does not include 1 (no transformation), so a transformation is indicated. A logarithmic transformatic (\(\lambda = 0\)) would not be optimal.
Now we need to transform the response variable accordingly, and re-compute the linear model, with this transformed variable.
We can write a function corresponding to the formula:
BCTransform <- function(y, lambda=0) {
if (lambda == 0L) { log(y) }
else { (y^lambda - 1) / lambda }
}
And of course we need a reverse transformation:
BCTransformInverse <- function(yt, lambda=0) {
if (lambda == 0L) { exp(yt) }
else { exp(log(1 + lambda * yt)/lambda) }
}
Test that the transformation followed by its inverse recovers the original values:
# testing
yt <- BCTransform(meuse$zinc, 0)
yo <- BCTransformInverse(yt, 0)
unique(round(yo-meuse$zinc),8)
## [1] 0
yt <- BCTransform(meuse$zinc, .5)
yo <- BCTransformInverse(yt, .5)
unique(round(yo-meuse$zinc),8)
## [1] 0
Yes.
Use this function with the optimal value of \(\lambda\) to transform the response variable:
hist(meuse$zinc, breaks=18, main = "Untransformed"); rug(meuse$zinc)
meuse$zinc.bc <- BCTransform(meuse$zinc, bc.power)
hist(meuse$zinc.bc, breaks=18, main = "Box-Cox transformed"); rug(meuse$zinc.bc)
The skew has been removed. The target is obviously bi-modal.
Compare to the log-transform:
hist(log(meuse$zinc), breaks=18, main = "Natural-log transformed"); rug(log(meuse$zinc))
This has transformed the target variable, but the test of the transformation is the distrbution of the linear model residuals, not the target variable.
Re-run the linear model with transformation:
summary(m.ols.bc <- lm(zinc.bc ~ elev + dist.m, data=meuse))
##
## Call:
## lm(formula = zinc.bc ~ elev + dist.m, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.137238 -0.029456 0.003679 0.035189 0.146718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.860e+00 3.589e-02 79.688 < 2e-16 ***
## elev -3.393e-02 4.709e-03 -7.206 2.52e-11 ***
## dist.m -2.350e-04 2.198e-05 -10.692 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05327 on 152 degrees of freedom
## Multiple R-squared: 0.6845, Adjusted R-squared: 0.6804
## F-statistic: 164.9 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc)
par(mfrow=c(1,1))
The residuals are close to normally-distributed; the residuals are less related to the fits, and the apparent heteroscedascity is mostly due to the uneven numbers of observations at different fitted values in the bimodal distribution.
Compare this to the linear model with a log-transformation:
summary(m.ols.log <- lm(log(zinc) ~ elev + dist, data=meuse))
##
## Call:
## lm(formula = log(zinc) ~ elev + dist, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03595 -0.27154 -0.00249 0.25329 1.10223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.48453 0.29312 28.945 < 2e-16 ***
## elev -0.26065 0.03849 -6.772 2.6e-10 ***
## dist -1.96003 0.20610 -9.510 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4288 on 152 degrees of freedom
## Multiple R-squared: 0.6518, Adjusted R-squared: 0.6472
## F-statistic: 142.3 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.log)
par(mfrow=c(1,1))
Not too different here, because the selected power -0.343 is not too far from 0, i.e., the natural log transformation.
The powerTransform of the car package also
uses the maximum likelihood-like approach outlined by Box and Cox (1964)
to select a transformation of a univariate or multivariate response for
normality, linearity and/or constant variance. This is more
comprehensive than just the normality of the linear model residuals The
summary method automatically computes two or three likelihood ratio
tests concerning the transformation powers.
We use this on the Meuse Zn linear model:
require(car)
## Loading required package: car
## Loading required package: carData
(bc.car <- powerTransform(m.ols))
## Estimated transformation parameter
## Y1
## -0.3450989
str(bc.car, max.level=1)
## List of 15
## $ value : num 770
## $ counts : Named int [1:2] 3 3
## ..- attr(*, "names")= chr [1:2] "function" "gradient"
## $ convergence: int 0
## $ message : chr "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
## $ hessian : num [1, 1] 97.6
## $ start : num -0.345
## $ lambda : Named num -0.345
## ..- attr(*, "names")= chr "Y1"
## $ roundlam : Named num -0.5
## ..- attr(*, "names")= chr "Y1"
## $ invHess : num [1, 1] 0.0102
## $ llik : num 770
## $ family : chr "bcPower"
## $ xqr :List of 4
## ..- attr(*, "class")= chr "qr"
## $ y : num [1:155, 1] 1022 1141 640 257 269 ...
## ..- attr(*, "dimnames")=List of 2
## $ x : num [1:155, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "dimnames")=List of 2
## ..- attr(*, "assign")= int [1:3] 0 1 2
## $ weights : num [1:155] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "class")= chr "powerTransform"
print(bc.car$lambda)
## Y1
## -0.3450989
summary(bc.car)
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 -0.3451 -0.5 -0.5434 -0.1467
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 11.94513 1 0.00054791
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 175.2963 1 < 2.22e-16
We see that this estimate from car -0.345 is not quite
the same estimate as from the function in MASS, i.e.,
-0.343.
The LR-tests here show that the untransformed variable will certainly not be acceptable, so that a transformation is surely needed. It also shows that a log-transform would not be optimal.
The car package also has a function bcPower
to carry out the transformation:
meuse$zinc.bc.car <- bcPower(meuse$zinc, lambda=bc.car$lambda)
hist(meuse$zinc.bc.car, breaks=18); rug(meuse$zinc.bc.car)
Re-run the linear model with this transformation:
summary(m.ols.bc.car <- lm(zinc.bc.car ~ elev + dist.m, data=meuse))
##
## Call:
## lm(formula = zinc.bc.car ~ elev + dist.m, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.135924 -0.029163 0.003639 0.034880 0.145310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.848e+00 3.554e-02 80.142 < 2e-16 ***
## elev -3.360e-02 4.663e-03 -7.205 2.52e-11 ***
## dist.m -2.327e-04 2.176e-05 -10.694 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05275 on 152 degrees of freedom
## Multiple R-squared: 0.6846, Adjusted R-squared: 0.6804
## F-statistic: 164.9 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc.car)
par(mfrow=c(1,1))
This is also quite good.
The linear model of the transformed variable is conditioned on the data in two ways: (1) the usual estimate from noisy data points, and (2) the value of \(\lambda\) selected for the transformation. Condition (2) is not the case for user-selected, a priori, transforms.
In this case we want the predictor (to be transformed) to be represented as the response \(Y\); this is accomplished with the null model, i.e., a variable predicted only by its mean: \(Y_i = \beta_0 + \varepsilon_i\). The distribution of the residuals is the same as that of the variable, just centred at its mean.
For example, supposing that Zn would be used as a predictor in a model, and we would like to transform it to near-normality:
m.null <- lm(zinc ~ 1, data=meuse)
summary(m.null)
##
## Call:
## lm(formula = zinc ~ 1, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -356.7 -271.7 -143.7 204.8 1369.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 469.72 29.48 15.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 367.1 on 154 degrees of freedom
plot(m.null, which=2)
qqnorm(meuse$zinc); qqline(meuse$zinc)
hist(meuse$zinc, breaks=18); rug(meuse$zinc)
We see that the model residuals follow the same distribution as the original data values. There is serious skew: far too many small values, far too few large, to be normally-distributed.
Now compute the transformation parameter:
bc.null <- boxcox(m.null)
(bc.power.null <- bc.null$x[which.max(bc.null$y)])
## [1] -0.2626263
summary(bc.power.null.car <- powerTransform(m.null))
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 -0.2728 -0.5 -0.5168 -0.0287
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 4.906319 1 0.026759
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 112.0677 1 < 2.22e-16
The optimal transformation from MASS is -0.263 and from
car is -0.273. As in the linear model residuals for the
predictive model of Zn, these are negative, but closer to a log
transformation.
Transform the variable and re-compute the distribution:
meuse$zinc.bc.null <- BCTransform(meuse$zinc, bc.power.null)
qqnorm(meuse$zinc.bc.null); qqline(meuse$zinc.bc.null)
hist(meuse$zinc.bc.null, breaks=18); rug(meuse$zinc.bc.null)
This is somewhat better – at least it is symmetric, although the tails have become too “light”. This is because this variable has a bimodal distribution, which is difficult to see in the original histogram.
The Box-Cox transform should not be applied if other transformations are indicated by theory. An example from Box and Cox (1964) is a chemical reaction, where physical theory suggests that the reaction rate \(r\) should be modelled as \(r \propto c^\alpha \cdot \beta \cdot exp(t)\), where \(c\) is a concentration of the reactants and \(t\) is the temperature. This could be linearized as \(r = \alpha \cdot \log c + \beta \cdot t\) and then solved for \(\alpha\) and \(\beta\). This transformation is indicated by theory, not empirically as with Box-Cox.
Also, if the suggested Box-Cox power is close to 0, many analysts use the log-transform even though it is not optimal, for ease of interpretation.
Box, G. E. P., & Cox, D. R. (1964). An Analysis of Transformations (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252.