Questions tagged [data-transformation]

Mathematical re-expression, often nonlinear, of data values. Data are often transformed either to meet the assumptions of a statistical model or to make the results of an analysis more interpretable.

Filter by
Sorted by
Tagged with
0
votes
0answers
6 views

Is there I guide to decide which transformation to choose for different scenarios/ types of data and distribution?

1) how do i decide which transformation or scaling to use before passing our data into machine learning model. Can someone please guide me on which transformation to use in different situations. There ...
0
votes
0answers
5 views

What is the best way to convert a graded scale (A to G) to a numeric scale to be used in a composite index?

I'm creating a composite index and one of my indicators ranks countries in terms of grades (A, B, C, D, E, F, G). The grades come from a purely qualitative (but thorough) analysis which does not ...
0
votes
0answers
12 views

Should I impute the missing values of timeseries data?

I have the following task - predicting the next 12 hours of PM10 particles based on historical data of previous 24 hours of PM10, O3 (ozone), CO (carbon monoxide), and others (not included) using RNN'...
0
votes
0answers
7 views

Transforming mean-absolutes to mean differences for meta-analysis

For a meta-analysis project, I have been tasked to submit data in Reviewmanager software. However, often I come across papers that report on average pain or ...
2
votes
1answer
12 views

Will change in standard deviation impact covariance?

If we increase the degree of standard deviation of one variable, does it affect covariance of two variables? Example, two variables are there, A & B, the covariance of A & B is 100, and the ...
0
votes
0answers
20 views

Multiple Regression Analysis Beginner

Background: I am using an instrument that measures two physical properties, X1~Temperture and X2~ Velocity. When gathering the data to make the curve a set of predetermined concentrations are chosen ...
1
vote
0answers
13 views

Mutliple Regression Calibration Curve

Background: I am using an instrument that measures two physical properties, X1~Temperture and X2~ Velocity. When gathering the data to make the curve a set of predetermined concentrations are chosen ...
2
votes
0answers
28 views

Where does the Box-Cox Transformation actually come from?

I'm trying to figure out where the actual box-cox transformation comes from. I've looked at the original paper, and some of it's references, but for the most part, it seems that they just drop the ...
0
votes
1answer
18 views

About scaling of data in political science

Sometimes we will see a survey about social and political opinions and social opinions, the author is trying to combine the polling results, fit them into a curve and make some conclusions. Let's say ...
2
votes
2answers
41 views

How to prove a multivariate r.v. does not follow the nonparanormal distribution?

Background You may find the definition of the non-paranormal distribution at the 2nd paragraph in p.2296 of this paper. In short, $(X_1, \ldots, X_p)$ is non-paranormal if there exists a set of ...
0
votes
0answers
5 views

Reason for transformation of b variable in Boston Housing dataset

In the Boston Housing dataset (see http://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/BostonHousing for details), one of the variables is $b = 1000(B - 0.63)^2$ where $B$ is the ...
0
votes
0answers
4 views

Pivot table where I have two time-series mixed [closed]

I have a data frame where I have two codes a,b that are represented in time-series like this ...
0
votes
0answers
16 views

Regression: Is it bad practice to use log difference as approximation for % difference when changes are large?

I'm running a vector autoregression model with quarterly IPOs as one of the variables. Since the number of IPOs isn't stationary, I took the log first difference to make it stationary. However, I ...
0
votes
0answers
29 views

How can I Include extremely large outliers in analytics?

Like most of us stuck at home, I'm taking time to get back up to speed with machine learning with some pet projects and one of my projects includes trying to use machine learning to predict missing ...
0
votes
0answers
9 views

Regression interpretation after transformation of independent and dependent variable [duplicate]

How do I interpret the regression output (coefficients), when I have transformed one of the independent variables (lg10) and have transformed the dependent variable (sqrt) as well?
0
votes
0answers
21 views

Transformation and linear regression

I'm running a multivariate regression to analyze the relationship between two variables, adjusted by other remarkable variables (based on previous data). My hypothesis is that their relationship is ...
0
votes
2answers
32 views

Do I use the mean vector from my training set to center my testing set when dimension reducing for classification?

Please let me know if this is the right place to ask this (or if any of my tags are wrong) or if I need to write this any differently. Do I use the mean vector from my training set to center my ...
1
vote
1answer
19 views

What to do when a value in the testing set is bigger than the max value used to min-max normalize the training set building a histogram classifier

Please let me know what to do when there is a value in the testing set is bigger than the max value used to min-max normalize the training set building a histogram classifier. Do I go back and change ...
0
votes
0answers
20 views

Multivariate regression - multiple regressions

Objective: To formulate (regress) x in terms of y and z, where Data set 1 $x = a_1 x_1 + a_2 x_2 + a_3 x_3 + a_4 x_4 + c_1$ (linearly regressed; $R^2 = 0.70$) Data set 2 $y = b_1 y_1 + b_2 y_2 + b_3 ...
2
votes
1answer
29 views

Interpreting logistic regression coefficient of a ratio predictor

I'm fitting a logistic regression model in which my predictor of interest is a ratio of measurements in millimeters. Possible values for this ratio range from 0 to ~2.0, with typical values around 0.9-...
2
votes
1answer
22 views

Should the samples in the input data into an RNN always be temporally ordered?

From what I know, if the training set shape is [100, 500, 20], it represents 100 samples, each sample being 500 timeseries and each timeseries having 20 features. I'm wondering if I'm passing for ...
1
vote
1answer
19 views

Gini and Lift With Transformed Variable

With regards to Gini Index/Net Lift, If I build a model where the target value is transformed by something - say natural log for example - will the Gini and Lift calculated on the transformed variable ...
0
votes
1answer
54 views

How to fix heteroscedasticity (funnel shape)?

I am running a mlr in python on a dataset with 2D feature vectors, X1 and X2 on a single response, Y. The data ends up being funnel-shaped, as below: X1 v Y, with the colors being X2. It was ...
0
votes
0answers
11 views

How to estimate the fluctuations in a data?

The question is more about an method of extracting relevant "universal" information from multiple experimental data. Let say, for every $\alpha$, we have a function of the form $$f_\alpha(t) = g_\...
0
votes
0answers
13 views

Interpreting inverse hyperbolic sine transformations with indicator independent variable in polynomial regression

I have a regression discontinuity design with the following specification: The specification is a polynomial regression with an indicator variable to capture the average treatment effect. What is the ...
0
votes
2answers
45 views

Feature extraction definition

I have difficulty understanding the concept of feature extraction since there are two main ways to describe it. The first one refers to mapping the raw data into a vector in R^d or the translation of ...
1
vote
0answers
10 views

rms/R: How to apply survSplit on two time-depedent covariates, one being a discrete covariate transformed with restricted cubic splines?

I am doing a survival analysis of time p$os.neck to death p$mors using a Cox Regression. Please, find my data sample ...
0
votes
0answers
7 views

Name for the opposite of Winsorizing?

For some regressions we find it useful to focus on extreme values, and so we discard middling dependent values (which we might call "noise") from data in order to find relationships that hold at data ...
1
vote
1answer
39 views

Transformation of residual plot of linear regression model

I have a linear model which is represented by the following plot, with a fitted line: And the residual plot is as following: The distribution of the residuals is show in the following graph: I see ...
2
votes
1answer
150 views

Help with log2 transformation of normalized data

I have a dataset that I normalize so that the average equals 1. If I then log2 transform the dataset, should the average of the log2 data equal 0? For example: 1, 1, 1. The average of the dataset is ...
0
votes
0answers
18 views

Understanding research paper (online PCA)

I was reading this paper - http://pdfs.semanticscholar.org/efc7/ba57ece148f9f311a7e49639b69f70878489.pdf and got really confused by algorithm 2. Basically the paper suggests that we can input some ...
0
votes
1answer
35 views

Which criterion should be used for transformations of dependent variables? [duplicate]

When transforming the dependent variables, I know R^2 and related criterion is not suitable for model selection. Then which one should I use?
1
vote
2answers
49 views

Transformation of variables in non linear regression model

I'm trying to build a regression model based on these variables below: ...
0
votes
0answers
11 views

Spatial Sign Transformation not plotting circular shape as expected

I am attempting to execute Spatial Sign transformation using Phyton. However, I also found that there are not many libraries to use this concept, thus I had to create a function from scratch based on ...
1
vote
1answer
51 views

log-linear modelling: transforming y variable

I am conducting a study on graphical log-linear modelling and my aim is to fit a log-linear model to data. I am using R studio to carry out the analysis and I am using the glm function. When first ...
1
vote
1answer
23 views

Coding data for regression on unordered pairs

I want to fit a regression on "unordered paired data", but I'm uncertain on how to code it. What I mean by paired data is the following: The model looks like this: $$o_i \sim \text{Binom}(1,p)\\ f(...
2
votes
0answers
22 views

(G)LM prediction interval with heteroscedasticity

I am trying to get prediction intervals from some non-linear data which also exhibits heteroscedasticity. ...
0
votes
1answer
28 views

Modeling with seasonally adjusted series and BoxCox

I have time series with daily data. This time series have frequency 7. Before I start with modeling first I made seasonal adjusted series with STL decomposition (from forecast package). So next step ...
3
votes
1answer
92 views

GLM on non-integer data

I'm looking for a recommendation on what GLM I could do with non-integer data. Brief background of what I am doing: I'm wanting to combine calculated herbivory rates with abundance data, to compare ...
2
votes
2answers
124 views

Sufficient Statistics - Relating the Intuition with the Mathematical Definition

I believe the heuristic definition of a Sufficient Statistic makes sense to me - when you take a sample in order to make an inference about the parameter related to the probability distribution, and ...
3
votes
2answers
86 views

Independence under linear transformations

What is the (largest) set of matrices $\mathcal C\subset \mathbb R^{m\times n}$ ($m\le n$) for which the following statement is true? Let $x_1,\ldots,x_n\in\mathbb R$ be independent random ...
0
votes
0answers
10 views

How to use target encoding : expanding mean on the test set

The expanding mean is a way to prevent overfitting when performing target encoding. But what I do not understand is how to use ...
0
votes
0answers
25 views

Right skewed distribution of a continuous variable with outliers: replace outliers with mode or median of that column?

When I replace my outliers with the median value of that column/feature, my mode for that column/feature also changes. Is that correct?
0
votes
0answers
23 views

Mean and variance preserving skewness 'spread'

This is essentially a request for references in case what I am describing is studied somewhere, to avoid trying to come up with the machinery myself. Heuristically, what I want to do is take some ...
0
votes
1answer
29 views

R - transpose dataframe from existing data frame and convert it to time-series [closed]

I'm beginning with R and I would like to transpose the following data frame into another dataframe with the column names being the company names and the vector values for each column (company names) ...
0
votes
1answer
28 views

How to adjust/normalize/standardize mean? [closed]

I am making a reviews/ratings section for a website, with ratings that range from 0-5 stars. I am not confident that the users of this system will all have the same idea of what these stars mean, so I'...
1
vote
1answer
25 views

Should I use log transformed pharmacokinetic data or use GLM gamma regression with log link?

I was taught, that when we deal with data of multiplicative nature, following the log-normal distribution, like in pharmacokinetic analyses, we should log the data first to enable classic parametric ...
0
votes
1answer
46 views

How to reduce kurtosis of data

I'm trying to reduce the kurtosis of my dataset and make it approximately Gaussian, with a common-sense uni-modal shape. The raw data looks like this: I first tried ...
0
votes
0answers
19 views

I need to normalize this distribution, but cannot identify it

I have this distribution that I need to normalize for comparison between sub-populations. I thought it might be lognormal, but the kurtosis of the log product is still very high. How do I go about ...
1
vote
0answers
16 views

Predictions on transformed series post intervention analysis

I have taken this logged data and performed an intervention analysis: ...

1
2 3 4 5
37