# Simulate data that arise from mediation with specified conditions

I am looking to create a function that simulates data arising from a mediation process, where a predictor (X) has an indirect effect on the outcome (Y) through the mediator (M).

I consulted the answers to the following questions:

I would like the function to simulate:

• the mediator and outcome if the user inputs the predictor,

• the predictor and outcome if the user inputs the mediator, or

• the predictor and mediator if the user inputs the outcome

I would like the user to be able to specify various conditions for simulating the data arising from mediation, including the correlation between X and Y, the correlation between X and M, the correlation between M and Y, and the proportion of the effect mediated. The proportion of the effect mediated (Pm) is the ratio of the indirect effect (ab) to the total effect (Wen & Fan, 2015). I would like the function to simulate the data that would yield a mediation model with the conditions specified by the user.

For instance, I would like the function to estimate:

• the total effect if the user inputs the correlation between X and M, the correlation between M and Y, and proportionMediated (Pm)

• proportionMediated if the user inputs the correlation between X and M, the correlation between M and Y, and the correlation between X and Y

• the correlation between X and M and the correlation between M and Y (assuming they are equal) if the user inputs the correlation between X and Y and proportionMediated

• the correlation between X and M if the user inputs the correlation between M and Y, the correlation between X and Y, and proportionMediated

• the correlation between M and Y if the user inputs the correlation between X and M, the correlation between X and Y, and proportionMediated

I used the answer to the first link (above) in writing the beginnings of a function:

simulateIndirectEffect <- function(x, m, y, a, b, cTotal, proportionMediated, seed){
if(missing(seed)){
seed <- round(runif(1, 0, 1000)*100)
}

if(missing(cTotal) == TRUE){
cTotal <- (a * b) / proportionMediated
} else if(missing(proportionMediated) == TRUE){
proportionMediated <- (a * b) / cTotal
} else if(missing(a) == TRUE & missing(b) == TRUE){
a <- sqrt(proportionMediated * cTotal)
b <- sqrt(proportionMediated * cTotal)
} else if(missing(a) == TRUE){
a <- (proportionMediated * cTotal) / b
} else if(missing(b) == TRUE){
b <- (proportionMediated * cTotal) / a
}

ab <- a * b
cPrime <- cTotal - ab

if(missing(x) == FALSE){
sampleSize <- length(x)

set.seed(seed + 1)
m <- a*x + sqrt(1-a^2) * rnorm(sampleSize) #what should I change error term to?
error <- 1 - (cPrime^2 + b^2 + 2*a*cPrime*b)

set.seed(seed + 2)
y <- cPrime*x + b*m + error*rnorm(sampleSize) #what should I change error term to?
} else if(missing(m) == FALSE){
sampleSize <- length(m)

set.seed(seed + 1)
#x <- #Not sure what to put here

set.seed(seed + 2)
#y <- #Not sure what to put here
} else if(missing(y) == FALSE){
sampleSize <- length(y)

set.seed(seed + 1)
#x <- #Not sure what to put here

set.seed(seed + 2)
#m <- #Not sure what to put here
}

simulatedData <- as.data.frame(cbind(x, m, y))

return(simulatedData)
}


I have three questions:

1. How can we simulate m and y given x (and the conditions specified) in the above function?
2. How can we simulate x and y given m (and the conditions specified) in the above function?
3. How can we simulate x and m given y (and the conditions specified) in the above function?

Note that the function above does not appear to simulate the mediation data per the conditions specified. For instance, when I simulate data based on a total effect of .6 and a proportion of the effect mediated of .4, my correlations are way too high. I want my correlation between x and y to be .6 (i.e., the total effect), but it is .99 in the simulated data (see below). I suspect that using rnorm() to generate a random variable with a mean of 0 and SD of 1 is too small to add to the error term, but am not sure what to use instead.

> predictor <- rnorm(1000, mean = 50, sd = 10)
> myData <- simulateIndirectEffect(x = predictor, cTotal = .6, proportionMediated = .4, seed = 12345)
> round(cor(myData), 2)
x    m    y
x 1.00 0.98 0.99
m 0.98 1.00 0.99
y 0.99 0.99 1.00


References:

Wen, Z., & Fan, X. (2015). Monotonicity of effect sizes: Questioning kappa-squared as mediation effect size measure. Psychological Methods, 20, 193-203. doi: 10.1037/met0000029

My answer is partly based on Caron & Valois (2018), which offers R code to simulate what you would like. I will use the usual convention for the explanation.

Let start by indicating $$a$$ as the correlation between $$X$$ to $$M$$, $$b$$ as the partial correlation between $$M$$ to $$Y$$ (while controlling for $$X$$) and $$c'$$ as the direct correlation between $$X$$ to $$Y$$.

We know that the total effect $$c$$ (the path between $$X$$ and $$Y$$ when not controlling for $$M$$ is $$c'+ab=c$$, so as a proportion : $$c' = \frac{ab}{\rho}-ab$$, where $$\rho$$ is the proportion or percentage you want. The total effect would be $$c'+ab=c$$ or $$\frac{c'}{\rho}+\frac{ab}{\rho}=100\%$$. If $$\rho=100\%$$, then it is a full mediation.

The correlation between $$X$$ and $$M$$ is $$a$$, the correlation between $$M$$ and $$Y$$ is $$=b(1-a^2)+ac$$ (remember $$b$$ is a partial correlation so we have to "unpartial" the value), and the correlation between $$X$$ and $$Y$$ is the total effect, $$c'+ab=c$$

> n = 100000
> a = .4
> b = .4
> ab = a*b
> rho = .75
> cp = ab/rho-ab #cp = c'
> x = rnorm(n)
> m = a*x + sqrt(1-a^2)*rnorm(n)
> ey = 1-(cp^2 +b^2 + 2*a*cp*b)
> y = cp*x + b*m + ey*rnorm(n)
> res1 = summary(lm(m~x))
> res1

Call:
lm(formula = m ~ x)

Residuals:
Min      1Q  Median      3Q     Max
-3.4982 -0.6181 -0.0007  0.6164  4.0573

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.001670   0.002893   0.577    0.564
x           0.403221   0.002902 138.964   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9148 on 99998 degrees of freedom
Multiple R-squared:  0.1619,    Adjusted R-squared:  0.1618
F-statistic: 1.931e+04 on 1 and 99998 DF,  p-value: < 2.2e-16

> res2 = summary(lm(y~x+m))
> res2

Call:
lm(formula = y ~ x + m)

Residuals:
Min      1Q  Median      3Q     Max
-3.4128 -0.5526 -0.0013  0.5546  4.0239

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005503  0.0025881   0.213    0.832
x           0.0541750  0.0028354  19.107   <2e-16 ***
m           0.4011823  0.0028290 141.810   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8184 on 99997 degrees of freedom
Multiple R-squared:  0.2128,    Adjusted R-squared:  0.2128
F-statistic: 1.352e+04 on 2 and 99997 DF,  p-value: < 2.2e-16

> A = res1$$coefficients[2,1] > B = res2$$coefficients[3,1]
> CP = res2\$coefficients[2,1]
> ab
[1] 0.16
> A*B
[1] 0.1617652
> cp
[1] 0.05333333
> CP
[1] 0.05417502


The actual example used standardized coefficients, you can then unstandardize the variables by adding means and multiplying by their standard deviation.

References

Caron, P.-O., & Valois, P. (2018). A computational description of simple mediation analysis. The Quantitative Methods for Psychology, 14, 147-158. doi:10.20982/tqmp.14.2.p147