Recent Questions - Cross Validatedmost recent 30 from stats.stackexchange.com2020-04-05T23:17:46Zhttps://stats.stackexchange.com/feedshttps://creativecommons.org/licenses/by-sa/4.0/rdfhttps://stats.stackexchange.com/q/4586780How to compute feature-space distance?WXJ96163https://stats.stackexchange.com/users/2777102020-04-05T22:56:05Z2020-04-05T22:56:05Z
<p>some CV research papers use nearest neighbors techniques to compare images, such as <a href="https://arxiv.org/pdf/1710.10196.pdf" rel="nofollow noreferrer">PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION</a></p>
<blockquote>
<p>Next five rows: Nearest neighbors found from the training data, based on <strong>feature-space distance</strong>.</p>
</blockquote>
<p>What is feature-space distance" in this context? How to compute the feature-space distance" in this context?</p>
<p>Could someone please give a hint? Thanks in advance.</p>
https://stats.stackexchange.com/q/4586761MCMC results fit model poorlyuser97154https://stats.stackexchange.com/users/971542020-04-05T22:44:17Z2020-04-05T22:44:17Z
<p>I've got some data that I want to fit a model to. This model depends on 9 different parameters. I started out by finding the best fit using some simple chi-square minimisation. I then wanted to give MCMC a go to try and estimate the error on the parameters. I followed the procedure suggested in the emcee documentation, initialised 900 walkers (probably a bit overkill) around this optimum value and just let it run for 5000 steps. The chains seemed to have burnt in after about 2000 steps or so, so I discarded those values and plotted the final posterior distributions for my parameters. (My priors were just flat)</p>
<p>However, I've now run into two problems:</p>
<p>Firstly, the peaks in my distributions don't coincide with my best fit values at all. I've found my best fit for v to be given by <span class="math-container">$v \simeq 0.65$</span>, however the posterior distribution of v seems to peak around a completely different value, with my best fit value barely inside the error margins.</p>
<p><a href="https://i.stack.imgur.com/qgkcY.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/qgkcY.png" alt="enter image description here"></a></p>
<p>Secondly, one of my parameters ends up with a multi-modal posterior distribution. How exactly would I interpret best fit parameters/errors from such a distribution? The 50th percentile indicated on the graph obviously misses both maxima and sits right in the trough so it seems like quite a bad estimate.</p>
<p><a href="https://i.stack.imgur.com/jWRnK.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/jWRnK.png" alt="enter image description here"></a> </p>
<p>I've tried running the chain up to 10000 steps, but it doesn't seem to improve any of the above problems.</p>
<p>Additionally it would be great if someone was able to point me to some resources which explain how/what can actually be inferred from these final distributions. I don't want to end up drawing conclusion from the plots which are not supported by them at all. </p>
https://stats.stackexchange.com/q/4586741How to calculate the coefficients of Auto regression model of multiple data?Syed Haiderhttps://stats.stackexchange.com/users/2799972020-04-05T22:28:29Z2020-04-05T22:28:29Z
<p>I have ACF of a time series data, I want to interpolate the ACF using AR model for which I need to calculate the coefficient of the AR model The problem is I have (6000x6000) matrix of ACF and I cant use individual sequence calculate the coefficients. Is there any method or algorithm available through which I can use a single model which can combine the information of all ACF to generate the coefficients of AR model? I have read a paper about the Burg algorithm for the segments of the data that effectively combines the data, but can I use that algorithm here for ACF? Please guide. Thank You.</p>
https://stats.stackexchange.com/q/4586720How do they determine the "feels like" temperatures for weather data? [closed]Jahier Volmhttps://stats.stackexchange.com/users/2799962020-04-05T22:20:00Z2020-04-05T22:20:00Z
<p>I'm pulling air temperature and other weather stats from various APIs. There's this concept called "feels like temperature", separate from "actual temperature". Apparently, this is the temperature that it <em>feels</em> like to a human, depending on the wind speed and other factors.</p>
<p>Is this determined, as the name suggests, by a human who "feels" it? Or is it mathematically calculated based on the other data such as the wind speed, so that it isn't actually "felt" by a human at all, but rather is a scientific estimation of how a human probably "will feel" the temperature?</p>
<p>Personally, I have no idea how anyone can "feel" the temperature in specific terms; to me, it's either "freezing", "cold", "chilly", "OK", "warm", "hot" and "unbearable" -- any guess I would make about what temperature it "feels like" would vary with each time you asked me and not be accurate whatsoever.</p>
https://stats.stackexchange.com/q/458670-1Crawling and Scrapping [closed]Soumya Ranjan Sahoohttps://stats.stackexchange.com/users/2086592020-04-05T22:18:07Z2020-04-05T22:18:07Z
<p>I am trying to work on a pet project that needs me to crawl through a list of Wikipedia: Picture of the day pages by month. As an example: <a href="https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/May_2004" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/May_2004</a> has a list of images followed by a brief caption for each image. I want to do the following 2 things here:</p>
<p>Scrape all the images from the page and the respective caption. (Preferably a dictionary to store an Image: Caption pair)
Crawl through other months and repeat 1.
Any help on how to accomplish this would be highly appreciated.</p>
<p>Thank you very much.</p>
https://stats.stackexchange.com/q/4586691input_shape keras NNdurnovvhttps://stats.stackexchange.com/users/2774082020-04-05T22:17:29Z2020-04-05T22:17:29Z
<p>I'm working on a problem where I need to predict one value out of many features, and I have multiple measurements for each feature. </p>
<p>Below is a simplified version of my problem. <a href="https://i.stack.imgur.com/4ghHJ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/4ghHJ.png" alt="enter image description here"></a></p>
<p>So, what I do is just taking mean value of all the measurements for each person and fitting the keras NN model. I've seen in keras documentation <code>input_shape</code> parameter. Probably I can somehow specify it. </p>
<p>My thought was I loose too much information just taking the mean.<br>
Question1: How can I say to my model: "look, each id has 3 features for which there is 3 measurements"? Can I do it through <code>input_shape</code>? </p>
<p>Question2(bonus): what is <code>noise_shape</code> by <code>dropout</code> layer? </p>
<p>PS: number of measurements is equal for each id. </p>
https://stats.stackexchange.com/q/4586681Do I need to match samples in CFA and EFA?T Augusthttps://stats.stackexchange.com/users/2764092020-04-05T22:16:58Z2020-04-05T22:16:58Z
<p>I have a sample of 1567 autistic participants (m = 925, f = 642) who all completed a 75 question questionnaire measuring 'Systemising' or the drive to understand systems/patterns. </p>
<p>For my undergraduate dissertation, I'm carrying out a exploratory factor analysis (specifically, principal components analysis) followed by a simple CFA to analyse if this questionnaire is unidimensional or has multiple factors. </p>
<p>Is it necessary to match equal number of men and women (resulting in a total sample of 642 women and 642 men)? Men in this sample tend to score statistically higher than women, and a previous DIF analysis has identified many individual items (i.e. questions) with significant sex differences. For these reasons, I thought I should match them in the total sample, although I'm not sure.</p>
<p>Thanks</p>
https://stats.stackexchange.com/q/4586650chisq.test in R isn't giving correct X-squared [closed]Stevenhttps://stats.stackexchange.com/users/2799942020-04-05T22:03:46Z2020-04-05T22:18:17Z
<pre><code> probs <- dpois(0:3, lambda = 1)
comp <- 1-sum(probs)
O <- c(4,3,2,1,0)
E <- 10*c(probs,comp)
chisq.test(O, E)
</code></pre>
<p>Running this code tells me that X-squared is 15, but when I manually calculate it I get 0.6013 which is what it should be. Why is chisq.test trying to tell me its 15? </p>
https://stats.stackexchange.com/q/4586641Gaussian mixture models for image matrix not determining E stepMarkowModelhttps://stats.stackexchange.com/users/2799932020-04-05T22:03:17Z2020-04-05T22:03:17Z
<p>I want to calculate responsibility for each of the data points, for the given MU, SIGMA and PI.</p>
<pre><code>params:
X = numpy.ndarray[numpy.ndarray[float]] - m x n
MU = numpy.ndarray[numpy.ndarray[float]] - k x n
SIGMA = numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] - k x n x n
PI = numpy.ndarray[float] - k x 1
k = int
returns:
responsibility = numpy.ndarray[numpy.ndarray[float]] - k x m
</code></pre>
<p>I tried doing </p>
<pre><code>for i in range(k):
part1 = 1 / ( ((2* np.pi)**(len(MU[i])/2)) * (np.linalg.det(SIGMA[i])**(1/2)) )
print(X-MU[i], np.linalg.inv(SIGMA[i]))
part2 = np.exp((-1/2) * ((X-MU[i]).T.dot(np.linalg.inv(SIGMA[i]))).dot((X-MU[i])))
weighted_normal = PI[i] * part1 * part2
</code></pre>
<p>The above code seems to work for each entry of X which is matrix of m x n. So it calculates the probability correctly for one entry in x. How to get it for all the values of X simultaneously. Iterating over each value is slower.</p>
https://stats.stackexchange.com/q/4586631Best approach for maximising power?MF2020https://stats.stackexchange.com/users/2775232020-04-05T21:58:41Z2020-04-05T21:58:41Z
<p>This seems like a relatively very simple question compared to the apparent complexity on this forum.</p>
<p>I'm putting my PhD method together, within-subjects experimental design (counter-balanced related subjects experiment). My G-Power skills are limited, I want to know if there is more power using a matched-pair design or using blocks of participants. I initially considered comparing the group to a non-clinical group but using G-Power the sample size needed shot up beyond what I have access to.</p>
<p>It is a 2x2x3 design, 2 main factors, one with 2 levels, one with 3 levels.
For understanding: I want to compare people wearing a pink t-shirt versus wearing a blue t-shirt on three different memory tests. The method of measurement is a MCQ test.</p>
<p>I can do this individually, as in one individual undergoes all conditions; or I can do it in groups, where blocks of participants undergo all conditions.</p>
<p>The available sample is 20-30 people. By my calculations I can get an effect size of 0.35 on Repeated Measures ANOVA tests using each person as their own control, in a sample size of 20.</p>
<p>I don't know how to calculate if using groups - possibly the same however interpreted differently in the context of being in a group?</p>
https://stats.stackexchange.com/q/4586601Ex post volatilityAndershttps://stats.stackexchange.com/users/2799832020-04-05T21:29:04Z2020-04-05T21:29:04Z
<p>I am looking into how to measure volatility, and I am not sure if I have confused myself too much in my research. So now I really need your help. So please either confirm my understanding of volatility, or else correct me. </p>
<p>The thing I am struggling with is conceptualizing that volatility isn't observable. </p>
<p>For example, to evaluate a GARCH model's performance in predicting the volatility, one way to do that would be to estimate the difference between the forecast by GARCH and the actual volatility by some evaluation function like MSE (mean squared error).
However, the actual volatility, even though it is in the past, ie. ex post, isn't observable. </p>
<p>The volatility (even ex post) isn't observable because, well it can't be. It's a measurement including two observables at at least two separate times. What times would that be that would describe the actual volatility? </p>
<p>Let's say we are looking into the volatility of Apple's stock AAPL. We have forecasted the volatility of a specific day t to be a value x. We now want to know the true volatility. Would the true volatility of day t be given by taking all the transactions throughout the day and take the square root of the variance? It is just a proxy for the volatility. Including all the trades of AAPL in a single day would mean a higher volatility than the actual volatility because of the bid/ask spread. </p>
<p>I am not sure though, if there wasn't a bid/ask spread, would taking all the observations into account (realized volatility) generate the actual ex post volatility?</p>
https://stats.stackexchange.com/q/4586590Why Catboost requires a lot of gpu memory? [closed]RABI Hamzahttps://stats.stackexchange.com/users/2257772020-04-05T21:25:03Z2020-04-05T21:25:03Z
<p>I'm running a catboost hyperparameter optimization using Bayes_opt and using a relatively small dataset(Kaggle House Price challenge), I noticed that it almost consumed all of the gpu memory (check te screenshot) to run it, and still is slower than xgboost and lighgbm (using only cpu).
<a href="https://i.stack.imgur.com/pieto.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pieto.png" alt="screenshot of catboost Bayes_opt running"></a></p>
https://stats.stackexchange.com/q/4586580What are the consequences of using year of birth, rather than age, as a covariate in logistic regression?mintypastahttps://stats.stackexchange.com/users/2799872020-04-05T21:24:23Z2020-04-05T21:24:23Z
<p>Would this affect the estimates of other effects in the model? Would it affect the estimated impact of age/YOB? I have a cohort but since they are all recruited at different times, I don't know which would be an appropriate endpoint to calculate age (as also many will have passed away). </p>
https://stats.stackexchange.com/q/4586561How do we obtain the probability density of a truncated regression with an upper and lower boundAlexhttps://stats.stackexchange.com/users/2799862020-04-05T21:18:18Z2020-04-05T21:58:36Z
<p>I know my density for <span class="math-container">$y$</span> is supposed to be something of this form <span class="math-container">$$g(y|x_{i},t)=\frac{f(y|x'\beta, \sigma^{2})}{F(t|x' \beta' \sigma^{2}}$$</span> where the numerator is the density of the normal distribution and the denominator is the CDF of the normal evaluated at <span class="math-container">$t$</span> when <span class="math-container">$t$</span> acts as an upper bound. I do not know how to generalize this process when y is bounded below by A and above by B. I started with <span class="math-container">$$Prob(y<B|y>A, x')$$</span> <span class="math-container">$$=\frac{Prob(A<y<B|x')}{Prob(y>A|x')}$$</span> <span class="math-container">$$=\frac{\Phi(\frac{B-x'\beta}{\sigma})-\Phi(\frac{A-x'\beta}{\sigma})}{1-\Phi(\frac{A-x'\beta}{\sigma})}=F_{y>A} (B)$$</span> and then I differentiated with respect to B and ended with <span class="math-container">$$f_{y|y>A, x'} (B) = \frac{\phi(\frac{B-x'\beta}{\sigma})}{\sigma \Phi(\frac{x'\beta - A}{\sigma})}$$</span> I just do not know if this is correct. Any help would be great.</p>
https://stats.stackexchange.com/q/4586542Hypothesis testing for difference in medians vs. median differencejollycathttps://stats.stackexchange.com/users/2721402020-04-05T21:15:10Z2020-04-05T23:06:40Z
<p>I found this post saying that one should test for the <em>median difference</em> instead of the <em>difference in medians</em>, in particular if the data is skewed: <a href="http://onbiostatistics.blogspot.com/2015/12/median-of-differences-versus-difference.html" rel="nofollow noreferrer">http://onbiostatistics.blogspot.com/2015/12/median-of-differences-versus-difference.html</a> The authors says "median of differences is the correct number to be used and is the number that corresponding to the signed rank test". </p>
<p>I did not find good explanations for this. <em>My question: are there any reasons from a statistical perspective why the median difference should be preferred over the difference in medians?</em></p>
<p>To give some more background: The differences are <strong>paired</strong>. Moreover, the paired <strong>differences are highly skewed</strong> to the right (in my real data set), which is why I want to use a <strong>bootstrap hypothesis test</strong>.</p>
<hr>
<p><strong>Example</strong></p>
<p>Suppose I have a two samples x1 and x2 as below. The samples are paired, for instance the <code>id</code> could specify the person and <code>x1</code> could be a measurement before intervention and <code>x2</code> after the intervention (for the same person).</p>
<pre><code>id x1 x2 difference
1 1.37 1.68 -0.31
2 2.18 2.99 -0.80
3 1.16 3.24 -2.07
4 3.60 3.08 0.52
5 2.33 2.19 0.13
</code></pre>
<p>The median difference would be: median(x1 - x2) = median(difference) = -0.31</p>
<p>The difference in medians would be: median(x1) - median(x2) = -0.80.</p>
https://stats.stackexchange.com/q/4586523scipy.stats failing to fit Weibull distribution unless location parameter is constrainedNayefhttps://stats.stackexchange.com/users/568282020-04-05T20:49:11Z2020-04-05T23:11:18Z
<p>Here is a demo set of data points that are drawn from a larger sample. I fit a Weibull distribution in R using the <code>{fitdistrplus}</code> package, and get back reasonable results for shape and scale parameters. </p>
<pre><code># in R:
library(fitdistrplus)
x <- c(4836.6, 823.6, 3131.7, 1343.4, 709.7, 610.6,
3034.2, 1973, 7358.5, 265, 4590.5, 5440.4, 4613.7, 4763.1,
115.3, 5385.1, 6398.1, 8444.6, 2397.1, 3259.7, 307.5, 4607.4,
6523.7, 600.3, 2813.5, 6119.8, 6438.8, 2799.1, 2849.8, 5309.6,
3182.4, 705.5, 5673.3, 2939.9, 2631.8, 5002.1, 1967.3, 2810.4,
2948, 6904.8)
fitdist(x, "weibull")
</code></pre>
<p>Result: </p>
<pre><code>Fitting of the distribution ' weibull ' by maximum likelihood
Parameters:
estimate Std. Error
shape 1.501077 0.2003799
scale 3912.816005 430.4170971
</code></pre>
<p>Then I try to do the same thing using scipy.stats. I use the <code>weibull_min</code> function. (I've seen recommendations to use <code>exponweib</code> with constraint <code>a=1</code> and can confirm results are the same.)</p>
<pre><code># in python
import numpy as np
import pandas as pd
from scipy import stats
x = [4836.6, 823.6, 3131.7, 1343.4, 709.7, 610.6,
3034.2, 1973, 7358.5, 265, 4590.5, 5440.4, 4613.7, 4763.1,
115.3, 5385.1, 6398.1, 8444.6, 2397.1, 3259.7, 307.5, 4607.4,
6523.7, 600.3, 2813.5, 6119.8, 6438.8, 2799.1, 2849.8, 5309.6,
3182.4, 705.5, 5673.3, 2939.9, 2631.8, 5002.1, 1967.3, 2810.4,
2948, 6904.8]
stats.weibull_min.fit(x)
</code></pre>
<p>Here are the results: </p>
<pre><code>shape, loc, scale = (0.1102610560437356, 115.29999999999998, 3.428664764594809)
</code></pre>
<p>This is clearly a terrible fit to the data, as I can see if I just sample from this fitted distribution: </p>
<pre><code>import matplotlib.pyplot as plt
import seaborn as sns
c, loc, scale = stats.weibull_min.fit(x)
x = stats.weibull_min.rvs(c, loc, scale, size=1000)
sns.distplot(x)
</code></pre>
<p><strong>Why is the fit so bad here</strong>? </p>
<p>I am aware that by constraining the loc parameter, I can recreate the results from <code>{fitdistrplus}</code>, but why should this be necessary? Shouldn't the unconstrained fit be more likely to overfit the data than to dramatically, and ridiculously under-fit it? </p>
<pre><code># recreate results from R's {fitdistrplus}
stats.weibull_min.fit(x, floc=0)
</code></pre>
https://stats.stackexchange.com/q/4586481Why is MSE used in cross validation when selecting optimum number of variables in model?Seanhttps://stats.stackexchange.com/users/2799822020-04-05T20:35:02Z2020-04-05T22:09:59Z
<p>I'm currently looking through An Introduction to Statistical Learning by Gareth James, more specfically Chapter 6. It discusses ways to select the optimal number of variables in a model using methods such as forward stepwise selection. In the lab, they use MSE in cross validation calculations to compare different sized models (e.g 3 predictors vs 4). In previous sections MSE has also been used in cross-validation settings. But in the case of selecting an optimal number of ceofficients, surely cross-validation is not appropriate here, as it uses MSE, which reduces as the number of predictors increases?</p>
https://stats.stackexchange.com/q/4586370Does anyone know which book this pdf is from?Esthttps://stats.stackexchange.com/users/1733412020-04-05T19:08:26Z2020-04-05T22:34:40Z
<p><a href="https://i.stack.imgur.com/WoeEe.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/WoeEe.png" alt="enter image description here"></a></p>
<p>Does anyone know which book this pdf is from?</p>
https://stats.stackexchange.com/q/4586350What does it mean if a fixed effects regression gives the same coefficients as one that has no fixed effects?Mathew Chandyhttps://stats.stackexchange.com/users/2778092020-04-05T18:51:33Z2020-04-05T22:59:46Z
<p>I am doing a difference-in-differences analysis, looking at whether environmental regulation has an effect on exports. I added fixed effects for country and time, but on STATA it gave me the same coefficients as my original difference-in-difference regression (with controls) and the p values only differ very slightly. What does this mean exactly? How would I be able to interpret this result? Many thanks!</p>
https://stats.stackexchange.com/q/4586342DCC GARCH is wrong: Conditional Correlation or Covariances?fernandez2001https://stats.stackexchange.com/users/2799622020-04-05T18:47:53Z2020-04-05T22:39:20Z
<p>Does DCC garch model conditional correlations or conditional covariances? and what is the difference? even if it uses conditional covariances, one is able to derive the conditional correlation, since he has the conditional covariances and variances, or is this thinking wrong?</p>
<p><a href="https://repub.eur.nl/pub/101761/" rel="nofollow noreferrer">This paper</a> argues that Qt is the conditional covariance matrix.
Why does deriving DCC from a Random Coefficient Moving Average process, prove that the DCC model is in fact a dynamic conditional covariance model? Because you take the variance of the model?</p>
<p>Also, it is not clear to me where DCC is wrong. Is it because only a special case of DCC(p,q), DCC(p,0) is derived and DCC(p,q) cannot be derived?</p>
<p>Also, why does it say that there is no connection between <span class="math-container">$Q_t$</span> and <span class="math-container">$\epsilon_t$</span>, since <span class="math-container">$Q_t$</span> is clearly related to the standardized residuals,<span class="math-container">$\eta_t$</span> as they state in the paper, and these in turn will be related to <span class="math-container">$\epsilon_t$</span>?</p>
<p>I would be grateful, if somebody could help me this. </p>
https://stats.stackexchange.com/q/4585872Approximate the mean area of 2D Voronoi cellGabrielhttps://stats.stackexchange.com/users/104162020-04-05T13:38:05Z2020-04-05T21:51:31Z
<p>Consider a random uniform distribution of <span class="math-container">$N$</span> points in <span class="math-container">$2D$</span> space bounded by <span class="math-container">$[0, 1]$</span> in both dimensions. Example:</p>
<p><a href="https://i.stack.imgur.com/vXU3E.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/vXU3E.png" alt="enter image description here"></a></p>
<p>If I want to estimate the mean area of their Voronoi cells, I have to obtain the Voronoi diagram for this distribution, calculate the areas associated to each point, and finally obtain their mean. This is a time-consuming process.</p>
<p>Given that this is a random uniform distribution of points, I thought I could just approximate this area with:</p>
<p><span class="math-container">$$\hat{A}\approx\frac{1}{N}$$</span></p>
<p>i.e., the total area of the <span class="math-container">$2D$</span> space (<span class="math-container">$1*1=1$</span>) divided by the number of points (<span class="math-container">$N$</span>). I wrote some code to test this approximation, the results are as shown:</p>
<p><a href="https://i.stack.imgur.com/jtTFa.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/jtTFa.png" alt="enter image description here"></a></p>
<p>where the <span class="math-container">$y$</span> axis is the relative difference <span class="math-container">$100 * (\hat{A_{V}}-\hat{A})/\hat{A_{V}}$</span> (<span class="math-container">$\hat{A_{V}}$</span> is the real area), and <span class="math-container">$N$</span> is the number of points. This shows that <span class="math-container">$\hat{A}$</span> underestimates the real value <span class="math-container">$\hat{A_{V}}$</span> by ~30% for small <span class="math-container">$N$</span>, and tends to zero as it grows.</p>
<p>Is there a better approximation (particularly for mall <span class="math-container">$N$</span> values)? </p>
<hr>
<p><strong>Add</strong></p>
<p>After whuber's comment I re-checked my code. I was using a somewhat convoluted method involving nearest neighbors to assign areas to unbounded points (close to the edges of the frame). I changed it to instead assign to these points the <span class="math-container">$1/N$</span> average area, and the results improved considerably (as expected). Here are the results:</p>
<p><a href="https://i.stack.imgur.com/1Cy5x.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/1Cy5x.png" alt="enter image description here"></a></p>
<p>Now it shows that the <span class="math-container">$1/N$</span> approximated area underestimates the "true" mean area (which actually depends on the technique used to handle unbounded points) for small <span class="math-container">$N$</span> but quickly goes to zero as <span class="math-container">$N$</span> grows. </p>
https://stats.stackexchange.com/q/4585310Analysing seasonal variations in time series?gis.rajanhttps://stats.stackexchange.com/users/2653012020-04-05T03:30:47Z2020-04-05T21:55:33Z
<p>I have to analyze NDVI time series over 17 years and I am specially interested in the seasonal variations of the time series. I want to see the variations or changes in the periodic components i.e frequency, amplitude and phase shift. How can I extract these from time-series? Please suggest me where to start.</p>
https://stats.stackexchange.com/q/4585000Time series object with multiple station data with different time periods?Luc Ortac Onmushttps://stats.stackexchange.com/users/2797632020-04-04T21:21:21Z2020-04-05T22:00:27Z
<p>How can make a time series object from a set of data. I have data set from a number of regional meteorological stations. And I don't know how to create a time series object that includes time series data from different meteorological stations. My data has following structure
, indicates csv file like structure for understanding)
. indicates several different data
X1,x2, x3 xn and y1,y2 y3 yn indicate different values
stationId: Each station has a unique ID number (That I want to use is as a categorical value)</p>
<pre><code>Nu, StationID, Year, MeanTemp, MaxTemp
1, 1001 , 1950, x1 , y1
2, 1001 , 1951, x2 , y2
3, 1001 , 1952, x3 , y3
., 1001 , . , xi , yi
., 1001 , . , xi+1 , yi+1
50, 1001 , 2000, x50 , y50
51, 1002 , 1970, . , .
52, 1002 , 1971
., . , 1972
., . , 1973
80, 1002 , 2000
81, 1003 , 1960
84, 1003 , 1961
., . ,
</code></pre>
<p>etc.
My question is: Is it possible to create a time series object;
that have more then one meteorological station data
where stations may have different number of collected data (one -say- 50 year-long but the other has 40 the third one is 60)</p>
<p>The problem is I have more then 250 stations, spanning different years, and have 11 different climate data: Rainfall, temperature, snow fall, freezing days etc.
I want to create a loop code to repeat the same analyses for different climate station with respect to changes in the station ID. first I need to solve this time series object problem. </p>
https://stats.stackexchange.com/q/4584791Find distribution that minimises a function of its momentsuser655870https://stats.stackexchange.com/users/2798672020-04-04T18:53:47Z2020-04-05T22:03:48Z
<p>Imagine a probability density function <span class="math-container">$f(x)$</span>, defined for positive <span class="math-container">$x$</span>, and let's note its <span class="math-container">$n$</span>th non-centred moment <span class="math-container">$x_{n}$</span>. The mean <span class="math-container">$x_{1}$</span> is fixed (and positive). </p>
<p>How can I find <span class="math-container">$f(x)$</span> that minimises some given function of its moments? In my case <span class="math-container">$\frac{ x_{3}+x_{1}^{3}-2x_{1}x_{2} }{ (x_{2}-x_{1}^{2})^{2} }$</span>.</p>
<p>I tried to take the gateau derivative of that expression in the direction of a test function <span class="math-container">$h(x)$</span>, and setting the result to be zero for any <span class="math-container">$h(x)$</span>. In the end I find a relation involving a few moments of <span class="math-container">$f(x)$</span> and the variable <span class="math-container">$x$</span>, which makes no sense. Would you have any idea of the correct approach here? </p>
https://stats.stackexchange.com/q/4584500Shannon Information | Understanding from a Microstate PerspectiveGENIVI-LEARNERhttps://stats.stackexchange.com/users/2499442020-04-04T15:14:19Z2020-04-05T22:37:34Z
<p>So Shannon's information is a way to quantify "new knowledge" by means of combination of microstates. So say 1 bit of information in binary system conveys 2 sets of information due to two possible microstates</p>
<p><span class="math-container">\begin{pmatrix}
0 \\
1
\end{pmatrix}</span></p>
<p>2 bit of information in binary conveys <span class="math-container">$2^2=4$</span> set of information due to 4 possible microstates. </p>
<p><span class="math-container">\begin{pmatrix}
0\ 0 \\
0\ 1 \\
1\ 0 \\
1\ 1
\end{pmatrix}</span></p>
<p>Now how to understand this concept using probabilities? If a particular outcome of a random variable say a biased coin flip has a probability of 0.3 for heads then what does it really mean when we say that it conveys <span class="math-container">$-\log_2(0.5)=1.73 $</span> bits of information? How does the outcome of the coin has 1.73 microstates?</p>
https://stats.stackexchange.com/q/4581951Compare non-linear model parameter estimates between conditionsbs93https://stats.stackexchange.com/users/2796342020-04-03T03:34:07Z2020-04-05T23:05:52Z
<p>What is the appropriate way to test for significant differences between the same parameter estimate from 2 nonlinear models? An example using R - here are 2 datasets: </p>
<pre><code>library(tidyverse)
#example from ?nls
DNase1 <- subset(DNase, Run == 1)
DNase2 <- subset(DNase, Run == 2)
</code></pre>
<p>Both datasets can be fit with a nonlinear function using the nls() function and coefficients extracted: </p>
<pre><code> ## fit models and extract coefficients
m1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1)
m1_coef <- tidy(m1) %>%
mutate(Run = 1)
m2 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase2)
m2_coef <- tidy(m2) %>%
mutate(Run = 2)
pars <- rbind(m1_coef, m2_coef) %>%
dplyr::filter(term == "Asym")
print(pars)
</code></pre>
<p>For simplicity, some of the results include 2 estimates of the 'Asym' parameter, one estimate for each condition (Run 1 & 2) made by each of the 2 models:</p>
<pre><code> term Estimate Std. Error t value Pr(>|t|) Run
1 Asym 2.345182 0.0781541 30.00715 2.165539e-13 1
2 Asym 2.595948 0.0646589 40.14835 5.109901e-15 2
</code></pre>
<p><strong>Is there a way test if the estimate for 'Asym' from Run 2 (2.345) is significantly different than the estimate from Run 1 (2.596)?</strong></p>
https://stats.stackexchange.com/q/4528361Regression on residuals within joint modelalan ocallaghanhttps://stats.stackexchange.com/users/1362822020-03-05T18:24:40Z2020-04-05T21:55:52Z
<p>I have a regression model (part of a larger hierarchical model) where I wish to try construct a regression using the residuals of a regression.</p>
<p>To simplify, say we have <span class="math-container">$j$</span> in <span class="math-container">$n$</span> observations, a <span class="math-container">$j \times r$</span> design matrix <span class="math-container">$X$</span>, and corresponding coefficient vector <span class="math-container">$\beta$</span>, such that
<span class="math-container">\begin{equation*}
y = X\beta + \epsilon \\
\epsilon \sim N(0, \sigma^2)
\end{equation*}</span></p>
<p>Assuming we have equivalent observations and coefficients for each <span class="math-container">$k$</span> of <span class="math-container">$m$</span> subjects, we can write
<span class="math-container">\begin{equation}
y_k = X_k\beta_k + \epsilon_k
\end{equation}</span>
The residuals, <span class="math-container">$\epsilon_k$</span> are of interest to me. I would thus like to try to model differences in the residuals between subjects, using a vector of <span class="math-container">$p$</span> subject-specific covariates, <span class="math-container">$z_k$</span>. This is where things get confusing for me. Now we would have
<span class="math-container">\begin{equation}
\epsilon_k = z_k\psi_k \sim N(0,\sigma^2_k)
\end{equation}</span>
I cannot place a prior on <span class="math-container">$z_k\psi_k$</span> though, since <span class="math-container">$z_k$</span> can vary arbitrarily. I then thought that perhaps the solution is to have a nested regression for the residuals, such that
<span class="math-container">\begin{equation}
\epsilon_k \sim N(z_k\psi_k,\sigma^2_k) \ \text{or} \\
\epsilon_k = z_k\psi_k + \tau_k, \ \ \tau_k\sim N(0, \sigma^2_k)
\end{equation}</span>
However, this seems non-identifiable to me. If we don't restrict <span class="math-container">$\epsilon_k$</span>, then the values of <span class="math-container">$X_k\beta_k$</span> and <span class="math-container">$\epsilon_k$</span> can exchange, and the likelihood would be equivalent. There seems to be no way to disentangle the effect of the two sets of covariates.</p>
<p>It seems to me that if I restrict <span class="math-container">$\beta_k$</span> to be the same across all subjects I have some chance, ie:
<span class="math-container">\begin{equation}
y_k = X_k\beta + \epsilon_k
\end{equation}</span>
since this induces some shrinkage towards a global value, but I am unsure if this would be enough to allow me to infer these effects. Is there an alteration I can make, or an alternative approach that would make this tractable?</p>
https://stats.stackexchange.com/q/3255411How do I interpret a betareg coefficient of 6.6970 for a categorical variable with only two categories, given that the response is a proportion?user193206https://stats.stackexchange.com/users/1932062018-01-29T00:24:01Z2020-04-05T23:01:15Z
<p>I cannot seem to find an exact answer to my question online. I used the betareg package in R to run a glm with a response variable that is a proportion, so it is between 0 and 1. One of my predictor variables is categorical, with two categories. The other two predictors are continuous, with an interaction included between the categorical and one of the continuous predictors. (I also included an offset to account for variable effort and area. Attached at the bottom is an image link of the R output for my model.</p>
<p>The coefficient for the categorical predictor "SIDE" is 6.6970. I think this should mean that switching from the reference category (disturbed) to the other category (undisturbed) with log.prey and log.rainfall held constant causes an increase of e^6.6970 for each unit increase in the proportion response variable. However, the proportion is never going to increase by 1 because it is a proportion. What is the unit increase in this case? Also, e^6.6970 is 809.97. I have no idea how to interpret that in terms of percent change. Can anyone help me out?</p>
<p>THANKS!</p>
<p><a href="https://i.stack.imgur.com/mEPP2.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/mEPP2.png" alt="betareg output"></a></p>
https://stats.stackexchange.com/q/2829210fixed effects model with upper level predictors and cross-sectional dataeborbathhttps://stats.stackexchange.com/users/1525442017-06-01T08:06:45Z2020-04-05T22:01:05Z
<p>I am using cross-sectional data with the following OLS model:</p>
<p>$$
Y_{(i,j)} = \beta_{(0)} + \beta X_{(i)} + \beta X_{(i,j)} + \beta fixed\; effects_{(j-1)} + \varepsilon_{i,j}
$$</p>
<p>where $i$ stands for individuals and $j$ stands for groups.</p>
<p>In my application, I am trying to predict survey respondent's satisfaction with democracy. I have the hypothesis that satisfaction with democracy is a function of a number of individual level predictors like age, education, etc ($\beta X_{(i)}$) but also of country level characteristics, like the quality of democracy, economic inequality etc. ($\beta X_{(i,j)}$). </p>
<p>Initially I wanted to run hierarchical models. But unfortunately the survey was only conducted in 7 countries. Therefore I settled for country fixed effects, with standard errors clustered by country.</p>
<p>My question is: what happens with the upper level coefficients in my model? I thought they reflect a precise estimate and the country fixed effects pick up any remaining upper level variance, not explained by $\beta X_{(i,j)}$. However, I was not sure, and I would appreciate if someone can help interpret them.</p>
https://stats.stackexchange.com/q/216596Train a SVM-based classifier while taking into account the weight informationuser3125https://stats.stackexchange.com/users/31252012-01-25T03:32:52Z2020-04-05T21:20:59Z
<p>Currently I have a data set which are known to belong to two classes, and would like to build a classifier using SVM. However, there exist different confidence levels for this data set. For example, for some data points, we are very certain that they belong to class 1; while for some data points, we think they should belong to class 1 but not so certain. We can quantify this kind of confidence information as weight. But how can we cast these weight information into the training process? Does LibSVM support this type of training?</p>