Robust regression is an alternative to least squares regression when data are contaminated with *outliers* or *influential observations*, and it can also be used for the purpose of detecting influential observations.

library(MASS) library(foreign)

Let's begin our discussion on robust regression with some terms in linear regression.

**Residual**: The difference between the predicted value (based on the regression equation) and the actual, observed value, i.e., \(u=Y-\hat{Y}\).

**Outlier**: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

**Leverage**: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients.

**Influence**: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients. Influence can be thought of as the product of leverage and outlierness.

**Cook's distance** (or Cook's D): A measure that combines the information of leverage and residual of the observation.

Robust regression can be used in any situation in which you would use least squares regression. When fitting a *least squares regression*
\[
\sum_{i=1}^n(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})^2 \looparrowright \min_{\boldsymbol\beta},
\]
we might find some outliers or high leverage data points. We have decided that these data points are not data entry errors, neither they are from a different population than most of our data. So we have no compelling reason to exclude them from the analysis. Robust regression might be a good strategy since it is a compromise between excluding these points entirely from the analysis and including all the data points and treating all them equally in OLS regression. The idea of robust regression is to weigh the observations differently based on how well behaved these observations are. Roughly speaking, it is a form of weighted and reweighted least squares regression.

The `rlm`

command in the `MASS`

package command implements several versions of robust regression. In this page, we will show \(M\)-estimation with *Huber* and *Tukey* (aka *bisquare weighting*). These two are very standard. \(M\)-estimation is optimizing
\[
\sum_{i=1}^n\rho(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}) \looparrowright \min_{\boldsymbol\beta}
\]
with respect to unknown \({\boldsymbol\beta}\), where *loss function* \(\rho:\mathbb{R}\to\mathbb{R}\) is absolutely continuous, usually convex with derivative \(\psi(x)=d\rho(x)/d x\) (so-called *influence function*). The \(M\)-estimator of \({\boldsymbol\beta}\) based on the function \(\rho\) is the solution of the following equations
\[
\sum_{i=1}^n {\bf X}_{i,\bullet}^{\top}\psi(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}.
\]

If now we define a *weight function* \(w(x)=\psi(x)/x\), the previous non-linear system of equations becomes so-called *estimating equation*
\[
\sum_{i=1}^n {\bf X}_{i,\bullet}^{\top}(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})w(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}.
\]
But the weights depend on the residuals and the residuals on the weights. The equation is solved using *Iteratively Reweighted Least Squares* (IRLS). If \(r_i=Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}\), then the IRLS problem is
\[
\sum_{i=1}^nw(r_I^{(k-1)})r_i^2 \looparrowright \min_{\boldsymbol\beta},
\]
where the superscript \((k)\) indicates the iteration number. The weight \(w(r_I^{(k-1)})\) should be recomputed after each iteration in order to be used in the next iteration. For example, the coefficient vector at iteration \(k\) is
\[
\widehat{\boldsymbol\beta}^{(k)}=({\bf X}^{\top}{\bf W}^{(k-1)}{\bf X})^{-1}{\bf X}^{\top}{\bf W}^{(k-1)}{\bf Y}.
\]
The process continues until it converges.

In Huber weighting, observations with small residuals get a weight of 1 and the larger the residual, the smaller the weight. This is defined by the weight function \[ w(x)=\left\{ \begin{array}{cc} 1, & |x|\leq k,\\ k/|x|, & |x|>k. \end{array} \right. \] With Tukey's bisquare weighting, all cases with a non-zero residual get down-weighted at least a little \[ w(x)=\left\{ \begin{array}{cc} [1-(x/c)^2]^2, & |x|\leq c,\\ 0, & |x|>c. \end{array} \right. \]

For our data analysis below, we will use the crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (`sid`

), state name (`state`

), violent crimes per 100,000 people (`crime`

), murders per 1,000,000 (`murder`

), the percent of the population living in metropolitan areas (`pctmetro`

), the percent of the population that is white (`pctwhite`

), percent of population with a high school education or above (`pcths`

), percent of population living under poverty line (`poverty`

), and percent of population that are single parents (`single`

). It has 51 observations. We are going to use `poverty`

and `single`

to predict `crime`

.

cdata <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/crime.csv") summary(cdata)

## sid state crime murder ## Min. : 1.0 ak : 1 Min. : 82.0 Min. : 1.600 ## 1st Qu.:13.5 al : 1 1st Qu.: 326.5 1st Qu.: 3.900 ## Median :26.0 ar : 1 Median : 515.0 Median : 6.800 ## Mean :26.0 az : 1 Mean : 612.8 Mean : 8.727 ## 3rd Qu.:38.5 ca : 1 3rd Qu.: 773.0 3rd Qu.:10.350 ## Max. :51.0 co : 1 Max. :2922.0 Max. :78.500 ## (Other):45 ## pctmetro pctwhite pcths poverty ## Min. : 24.00 Min. :31.80 Min. :64.30 Min. : 8.00 ## 1st Qu.: 49.55 1st Qu.:79.35 1st Qu.:73.50 1st Qu.:10.70 ## Median : 69.80 Median :87.60 Median :76.70 Median :13.10 ## Mean : 67.39 Mean :84.12 Mean :76.22 Mean :14.26 ## 3rd Qu.: 83.95 3rd Qu.:92.60 3rd Qu.:80.10 3rd Qu.:17.40 ## Max. :100.00 Max. :98.50 Max. :86.60 Max. :26.40 ## ## single ## Min. : 8.40 ## 1st Qu.:10.05 ## Median :10.90 ## Mean :11.33 ## 3rd Qu.:12.05 ## Max. :22.10 ##

In most cases, we begin by running an OLS regression and doing some diagnostics. We will begin by running an OLS regression and looking at diagnostic plots examining residuals, fitted values, Cook's distance, and leverage.

summary(ols <- lm(crime ~ poverty + single, data = cdata))

## ## Call: ## lm(formula = crime ~ poverty + single, data = cdata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -811.14 -114.27 -22.44 121.86 689.82 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1368.189 187.205 -7.308 2.48e-09 *** ## poverty 6.787 8.989 0.755 0.454 ## single 166.373 19.423 8.566 3.12e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 243.6 on 48 degrees of freedom ## Multiple R-squared: 0.7072, Adjusted R-squared: 0.695 ## F-statistic: 57.96 on 2 and 48 DF, p-value: 1.578e-13

opar <- par(mfrow = c(3, 2), oma = c(0, 0, 1.1, 0)) plot(ols, las = 1) plot(ols, which=c(4,6))

cdata[c(9, 25, 51), 1:2]

## sid state ## 9 9 fl ## 25 25 ms ## 51 51 dc

d1 <- cooks.distance(ols) r <- stdres(ols) a <- cbind(cdata, d1, r) a[d1 > 4/51, ]

## sid state crime murder pctmetro pctwhite pcths poverty single d1 ## 1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 0.1254750 ## 9 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 0.1425891 ## 25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 0.6138721 ## 51 51 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 2.6362519 ## r ## 1 -1.397418 ## 9 2.902663 ## 25 -3.562990 ## 51 2.616447

We probably should drop DC to begin with since it is not even a state. We include it in the analysis just to show that it has large Cook's D and demonstrate how it will be handled by `rlm`

. Now we will look at the residuals. We will generate a new variable called `absr1`

, which is the absolute value of the residuals (because the sign of the residual doesn't matter). We then print the ten observations with the highest absolute residual values.

rabs <- abs(r) a <- cbind(cdata, d1, r, rabs) asorted <- a[order(-rabs), ] asorted[1:10, ]

## sid state crime murder pctmetro pctwhite pcths poverty single ## 25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 ## 9 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 ## 51 51 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 ## 46 46 vt 114 3.6 27.0 98.4 80.8 10.0 11.0 ## 26 26 mt 178 3.0 24.0 92.6 81.0 14.9 10.8 ## 21 21 me 126 1.6 35.7 98.5 78.8 10.7 10.6 ## 1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 ## 31 31 nj 627 5.3 100.0 80.8 76.7 10.9 9.6 ## 14 14 il 960 11.4 84.0 81.0 76.2 13.6 11.5 ## 20 20 md 998 12.7 92.8 68.9 78.4 9.7 12.0 ## d1 r rabs ## 25 0.61387212 -3.562990 3.562990 ## 9 0.14258909 2.902663 2.902663 ## 51 2.63625193 2.616447 2.616447 ## 46 0.04271548 -1.742409 1.742409 ## 26 0.01675501 -1.460885 1.460885 ## 21 0.02233128 -1.426741 1.426741 ## 1 0.12547500 -1.397418 1.397418 ## 31 0.02229184 1.354149 1.354149 ## 14 0.01265689 1.338192 1.338192 ## 20 0.03569623 1.287087 1.287087

Now let's run our first robust regression. Robust regression is done by iterated re-weighted least squares (IRLS). The command for running robust regression is `rlm`

in the `MASS`

package. There are several weighting functions that can be used for IRLS. We are going to first use the Huber weights in this example. We will then look at the final weights created by the IRLS process. This can be very useful.

summary(rr.huber <- rlm(crime ~ poverty + single, data = cdata))

## ## Call: rlm(formula = crime ~ poverty + single, data = cdata) ## Residuals: ## Min 1Q Median 3Q Max ## -846.09 -125.80 -16.49 119.15 679.94 ## ## Coefficients: ## Value Std. Error t value ## (Intercept) -1423.0373 167.5899 -8.4912 ## poverty 8.8677 8.0467 1.1020 ## single 168.9858 17.3878 9.7186 ## ## Residual standard error: 181.8 on 48 degrees of freedom

hweights <- data.frame(state = cdata$state, resid = rr.huber$resid, weight = rr.huber$w) hweights2 <- hweights[order(rr.huber$w), ] hweights2[1:15, ]

## state resid weight ## 25 ms -846.08536 0.2889618 ## 9 fl 679.94327 0.3595480 ## 46 vt -410.48310 0.5955740 ## 51 dc 376.34468 0.6494131 ## 26 mt -356.13760 0.6864625 ## 21 me -337.09622 0.7252263 ## 31 nj 331.11603 0.7383578 ## 14 il 319.10036 0.7661169 ## 1 ak -313.15532 0.7807432 ## 20 md 307.19142 0.7958154 ## 19 ma 291.20817 0.8395172 ## 18 la -266.95752 0.9159411 ## 2 al 105.40319 1.0000000 ## 3 ar 30.53589 1.0000000 ## 4 az -43.25299 1.0000000

We can see that roughly, as the absolute residual goes down, the weight goes up. In other words, cases with a large residuals tend to be down-weighted. This output shows us that the observation for Mississippi will be down-weighted the most. Florida will also be substantially down-weighted. All observations not shown above have a weight of 1. In OLS regression, all cases have a weight of 1. Hence, the more cases in the robust regression that have a weight close to one, the closer the results of the OLS and robust regressions.

Next, let's run the same model, but using the bisquare weighting function. Again, we can look at the weights.

rr.bisquare <- rlm(crime ~ poverty + single, data = cdata, psi = psi.bisquare) summary(rr.bisquare)

## ## Call: rlm(formula = crime ~ poverty + single, data = cdata, psi = psi.bisquare) ## Residuals: ## Min 1Q Median 3Q Max ## -905.59 -140.97 -14.98 114.65 668.38 ## ## Coefficients: ## Value Std. Error t value ## (Intercept) -1535.3338 164.5062 -9.3330 ## poverty 11.6903 7.8987 1.4800 ## single 175.9303 17.0678 10.3077 ## ## Residual standard error: 202.3 on 48 degrees of freedom

biweights <- data.frame(state = cdata$state, resid = rr.bisquare$resid, weight = rr.bisquare$w) biweights2 <- biweights[order(rr.bisquare$w), ] biweights2[1:15, ]

## state resid weight ## 25 ms -905.5931 0.007652565 ## 9 fl 668.3844 0.252870542 ## 46 vt -402.8031 0.671495418 ## 26 mt -360.8997 0.731136908 ## 31 nj 345.9780 0.751347695 ## 18 la -332.6527 0.768938330 ## 21 me -328.6143 0.774103322 ## 1 ak -325.8519 0.777662383 ## 14 il 313.1466 0.793658594 ## 20 md 308.7737 0.799065530 ## 19 ma 297.6068 0.812596833 ## 51 dc 260.6489 0.854441716 ## 50 wy -234.1952 0.881660897 ## 5 ca 201.4407 0.911713981 ## 10 ga -186.5799 0.924033113

We can see that the weight given to Mississippi is dramatically lower using the bisquare weighting function than the Huber weighting function and the parameter estimates from these two different weighting methods differ. When comparing the results of a regular OLS regression and a robust regression, if the results are very different, you will most likely want to use the results from the robust regression. Large differences suggest that the model parameters are being highly influenced by outliers. Different functions have advantages and drawbacks. Huber weights can have difficulties with severe outliers, and bisquare weights can have difficulties converging or may yield multiple solutions.

As you can see, the results from the two analyses are fairly different, especially with respect to the coefficients of `single`

and the constant (`intercept`

). While normally we are not interested in the constant, if you had centered one or both of the predictor variables, the constant would be useful. On the other hand, you will notice that `poverty`

is not statistically significant in either analysis, whereas `single`

is significant in both analyses.

- Robust regression does not address issues of heterogeneity of variance. This problem can be addressed by using functions in the
`sandwich`

package after the`lm`

function. - The examples shown here have presented R code for \(M\)-estimation. There are other estimation options available in
`rlm`

and other R commands and packages:*Least trimmed squares*using`ltsReg`

in the`robustbase`

package and*MM*using`rlm`

.

- UCLA: IDRE (Institute for Digital Research and Education). Data Analysis Examples. from http://www.ats.ucla.edu/stat/dae/ (accessed January 31, 2014)
- Li, G. (1985). Robust regression. In Exploring Data Tables, Trends, and Shapes, ed. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, Wiley.
- Fox, J. (1997). Applied regression analysis, linear models, and related models. Thousand Oaks, CA: Sage publications.