1
$\begingroup$

I was toying with R to see how the number of variables might affect spurious regression. Suppose that we have an $I(1)$ vector $y$ and a matrix $X$ with $I(1)$ columns. If the two are not related then OLS regression will be disastrous, with up to 50% of $X$'S columns showing significance. On the other hand suppose I set

$$y =X_1\beta + \epsilon $$

where $X_1$ is the first column of the $X$ matrix and $\epsilon$ is white noise. Then the regression works beautifully - the $y$ and $X_1$ form a cointegrating pair and the regression rightfully determines that the other columns are unrelated to the outcome, despite being nonstationary.

This begs the question - in situations where you have thousands or more variables and you would use regularized regression techniques, is spurious regression a problem? It seems that as long as there's at least one variable related to the outcome your regression will be fine.

The code for my experiment:

nruns <- 1000
nobs <- 1000
nvars <- 100
significant_coefs <- numeric(nruns)

for(i in 1:nruns) {
  X <- replicate(nvars, cumsum(rnorm(nobs)))
  y <- X[, 1] + rnorm(nobs, sd = 1000)

  model <- lm(y ~ X)
  significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <= 0.05)
}

hist(significant_coefs)

To see the impact of spurious regression just change the $y$ variable to a random walk.

nruns <- 1000
nobs <- 1000
nvars <- 100
significant_coefs <- numeric(nruns)

for(i in 1:nruns) {
  X <- replicate(nvars, cumsum(rnorm(nobs)))
  y <- cumsum(rnorm(nobs))

  model <- lm(y ~ X)
  significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <= 0.05)
}

hist(significant_coefs)

In the first case I get an average of 6 coefficients with p-values less than 0.05, in the second I get 51.

$\endgroup$
  • 1
    $\begingroup$ Can you define "spurious regression" in your question? I don't know if it a common term, but I don't know it at least. $\endgroup$ – Matthew Drury Jun 8 '18 at 15:30
  • $\begingroup$ It's when things that aren't meaningful appear to be so. I'll add code so you can see what I mean. $\endgroup$ – badmax Jun 8 '18 at 15:31
  • $\begingroup$ @badmax "spurious" refers to the conclusions drawn from a regression model. $\endgroup$ – AdamO Jun 8 '18 at 15:46
  • $\begingroup$ Are you trying to build a predictive model, or are you trying to identify which input variables the output variable is dependent on? Also, did I misunderstand your code, or are you running regressions on non-stationary data? In that case the significance test for the coefficients does not apply. $\endgroup$ – rinspy Jun 8 '18 at 15:48
  • $\begingroup$ I am trying to build a predictive model while identifying the dependence structure. I am running regressions on non-stationary data to see what would happen. $\endgroup$ – badmax Jun 8 '18 at 15:54
0
$\begingroup$

In the scenario you described, you would usually use cross-validation to tune the regularization parameter of the regression. Cross-validation will tell you if the relationships you identified were spurious, since your model would have poor performance on the validation sets (but with potentially high variance). In that sense, spurious regression will not be a problem since you will know that your model has no predictive power.

However, performing linear regression on non-iid data is a bad idea. Your model will likely pick up spurious correlations, even though you will know that they are spurious through cross-validation. You should transform the data to stationary before performing the regression. This will allow the model to ignore irrelevant variables with a similar trend to your output variable.

$\endgroup$
  • $\begingroup$ couldn't we run a regression on non-stationary data, hoping that a complex model (e.g. a Machine Learning model) can catch some of the non-stationary trends, then check for validation metrics on the (transformed) stationary data? $\endgroup$ – Tanguy Sep 19 '18 at 13:37

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.