I was toying with R to see how the number of variables might affect spurious regression. Suppose that we have an $I(1)$ vector $y$ and a matrix $X$ with $I(1)$ columns. If the two are not related then OLS regression will be disastrous, with up to 50% of $X$'S columns showing significance. On the other hand suppose I set

$$y =X_1\beta + \epsilon $$

where $X_1$ is the first column of the $X$ matrix and $\epsilon$ is white noise. Then the regression works beautifully - the $y$ and $X_1$ form a cointegrating pair and the regression rightfully determines that the other columns are unrelated to the outcome, despite being nonstationary.

This begs the question - in situations where you have thousands or more variables and you would use regularized regression techniques, is spurious regression a problem? It seems that as long as there's at least one variable related to the outcome your regression will be fine.

The code for my experiment:

```
nruns <- 1000
nobs <- 1000
nvars <- 100
significant_coefs <- numeric(nruns)
for(i in 1:nruns) {
X <- replicate(nvars, cumsum(rnorm(nobs)))
y <- X[, 1] + rnorm(nobs, sd = 1000)
model <- lm(y ~ X)
significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <= 0.05)
}
hist(significant_coefs)
```

To see the impact of spurious regression just change the $y$ variable to a random walk.

```
nruns <- 1000
nobs <- 1000
nvars <- 100
significant_coefs <- numeric(nruns)
for(i in 1:nruns) {
X <- replicate(nvars, cumsum(rnorm(nobs)))
y <- cumsum(rnorm(nobs))
model <- lm(y ~ X)
significant_coefs[i] <- sum(summary(model)$coefficients[, 4] <= 0.05)
}
hist(significant_coefs)
```

In the first case I get an average of 6 coefficients with p-values less than 0.05, in the second I get 51.