# Is spurious regression a problem for lasso and similar techniques?

I was toying with R to see how the number of variables might affect spurious regression. Suppose that we have an $I(1)$ vector $y$ and a matrix $X$ with $I(1)$ columns. If the two are not related then OLS regression will be disastrous, with up to 50% of $X$'S columns showing significance. On the other hand suppose I set

$$y =X_1\beta + \epsilon$$

where $X_1$ is the first column of the $X$ matrix and $\epsilon$ is white noise. Then the regression works beautifully - the $y$ and $X_1$ form a cointegrating pair and the regression rightfully determines that the other columns are unrelated to the outcome, despite being nonstationary.

This begs the question - in situations where you have thousands or more variables and you would use regularized regression techniques, is spurious regression a problem? It seems that as long as there's at least one variable related to the outcome your regression will be fine.

The code for my experiment:

nruns <- 1000
nobs <- 1000
nvars <- 100
significant_coefs <- numeric(nruns)

for(i in 1:nruns) {
X <- replicate(nvars, cumsum(rnorm(nobs)))
y <- X[, 1] + rnorm(nobs, sd = 1000)

model <- lm(y ~ X)
significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <= 0.05) } hist(significant_coefs)  To see the impact of spurious regression just change the$y$variable to a random walk. nruns <- 1000 nobs <- 1000 nvars <- 100 significant_coefs <- numeric(nruns) for(i in 1:nruns) { X <- replicate(nvars, cumsum(rnorm(nobs))) y <- cumsum(rnorm(nobs)) model <- lm(y ~ X) significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <= 0.05)
}

hist(significant_coefs)


In the first case I get an average of 6 coefficients with p-values less than 0.05, in the second I get 51.

• Can you define "spurious regression" in your question? I don't know if it a common term, but I don't know it at least. – Matthew Drury Jun 8 '18 at 15:30
• It's when things that aren't meaningful appear to be so. I'll add code so you can see what I mean. – badmax Jun 8 '18 at 15:31
• @badmax "spurious" refers to the conclusions drawn from a regression model. – AdamO Jun 8 '18 at 15:46
• Are you trying to build a predictive model, or are you trying to identify which input variables the output variable is dependent on? Also, did I misunderstand your code, or are you running regressions on non-stationary data? In that case the significance test for the coefficients does not apply. – rinspy Jun 8 '18 at 15:48
• I am trying to build a predictive model while identifying the dependence structure. I am running regressions on non-stationary data to see what would happen. – badmax Jun 8 '18 at 15:54