# Questions tagged [data-transformation]

Mathematical re-expression, often nonlinear, of data values. Data are often transformed either to meet the assumptions of a statistical model or to make the results of an analysis more interpretable.

1,842
questions

**0**

votes

**0**answers

6 views

### Is there I guide to decide which transformation to choose for different scenarios/ types of data and distribution?

1) how do i decide which transformation or scaling to use before passing our data into machine learning model. Can someone please guide me on which transformation to use in different situations. There ...

**0**

votes

**0**answers

5 views

### What is the best way to convert a graded scale (A to G) to a numeric scale to be used in a composite index?

I'm creating a composite index and one of my indicators ranks countries in terms of grades (A, B, C, D, E, F, G). The grades come from a purely qualitative (but thorough) analysis which does not ...

**0**

votes

**0**answers

12 views

### Should I impute the missing values of timeseries data?

I have the following task - predicting the next 12 hours of PM10 particles based on historical data of previous 24 hours of PM10, O3 (ozone), CO (carbon monoxide), and others (not included) using RNN'...

**0**

votes

**0**answers

7 views

### Transforming mean-absolutes to mean differences for meta-analysis

For a meta-analysis project, I have been tasked to submit data in Reviewmanager software.
However, often I come across papers that report on average pain or ...

**2**

votes

**1**answer

12 views

### Will change in standard deviation impact covariance?

If we increase the degree of standard deviation of one variable, does it affect covariance of two variables?
Example, two variables are there, A & B, the covariance of A & B is 100, and the ...

**0**

votes

**0**answers

20 views

### Multiple Regression Analysis Beginner

Background: I am using an instrument that measures two physical properties, X1~Temperture and X2~ Velocity. When gathering the data to make the curve a set of predetermined concentrations are chosen ...

**1**

vote

**0**answers

13 views

### Mutliple Regression Calibration Curve

Background:
I am using an instrument that measures two physical properties, X1~Temperture and X2~ Velocity. When gathering the data to make the curve a set of predetermined concentrations are chosen ...

**2**

votes

**0**answers

28 views

### Where does the Box-Cox Transformation actually come from?

I'm trying to figure out where the actual box-cox transformation comes from. I've looked at the original paper, and some of it's references, but for the most part, it seems that they just drop the ...

**0**

votes

**1**answer

18 views

### About scaling of data in political science

Sometimes we will see a survey about social and political opinions and social opinions, the author is trying to combine the polling results, fit them into a curve and make some conclusions. Let's say ...

**2**

votes

**2**answers

41 views

### How to prove a multivariate r.v. does not follow the nonparanormal distribution?

Background
You may find the definition of the non-paranormal distribution at the 2nd paragraph in p.2296 of this paper.
In short, $(X_1, \ldots, X_p)$ is non-paranormal if there exists a set of ...

**0**

votes

**0**answers

5 views

### Reason for transformation of b variable in Boston Housing dataset

In the Boston Housing dataset (see http://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/BostonHousing for details), one of the variables is
$b = 1000(B - 0.63)^2$ where $B$ is the ...

**0**

votes

**0**answers

4 views

### Pivot table where I have two time-series mixed [closed]

I have a data frame where I have two codes a,b that are represented in time-series like this
...

**0**

votes

**0**answers

16 views

### Regression: Is it bad practice to use log difference as approximation for % difference when changes are large?

I'm running a vector autoregression model with quarterly IPOs as one of the variables. Since the number of IPOs isn't stationary, I took the log first difference to make it stationary. However, I ...

**0**

votes

**0**answers

29 views

### How can I Include extremely large outliers in analytics?

Like most of us stuck at home, I'm taking time to get back up to speed with machine learning with some pet projects and one of my projects includes trying to use machine learning to predict missing ...

**0**

votes

**0**answers

9 views

### Regression interpretation after transformation of independent and dependent variable [duplicate]

How do I interpret the regression output (coefficients), when I have transformed one of the independent variables (lg10) and have transformed the dependent variable (sqrt) as well?

**0**

votes

**0**answers

21 views

### Transformation and linear regression

I'm running a multivariate regression to analyze the relationship between two variables, adjusted by other remarkable variables (based on previous data). My hypothesis is that their relationship is ...

**0**

votes

**2**answers

32 views

### Do I use the mean vector from my training set to center my testing set when dimension reducing for classification?

Please let me know if this is the right place to ask this (or if any of my tags are wrong) or if I need to write this any differently.
Do I use the mean vector from my training set to center my ...

**1**

vote

**1**answer

19 views

### What to do when a value in the testing set is bigger than the max value used to min-max normalize the training set building a histogram classifier

Please let me know what to do when there is a value in the testing set is bigger than the max value used to min-max normalize the training set building a histogram classifier.
Do I go back and change ...

**0**

votes

**0**answers

20 views

### Multivariate regression - multiple regressions

Objective: To formulate (regress) x in terms of y and z, where
Data set 1
$x = a_1 x_1 + a_2 x_2 + a_3 x_3 + a_4 x_4 + c_1$ (linearly regressed; $R^2 = 0.70$)
Data set 2
$y = b_1 y_1 + b_2 y_2 + b_3 ...

**2**

votes

**1**answer

29 views

### Interpreting logistic regression coefficient of a ratio predictor

I'm fitting a logistic regression model in which my predictor of interest is a ratio of measurements in millimeters. Possible values for this ratio range from 0 to ~2.0, with typical values around 0.9-...

**2**

votes

**1**answer

22 views

### Should the samples in the input data into an RNN always be temporally ordered?

From what I know, if the training set shape is [100, 500, 20], it represents 100 samples, each sample being 500 timeseries and each timeseries having 20 features. I'm wondering if I'm passing for ...

**1**

vote

**1**answer

19 views

### Gini and Lift With Transformed Variable

With regards to Gini Index/Net Lift, If I build a model where the target value is transformed by something - say natural log for example - will the Gini and Lift calculated on the transformed variable ...

**0**

votes

**1**answer

54 views

### How to fix heteroscedasticity (funnel shape)?

I am running a mlr in python on a dataset with 2D feature vectors, X1 and X2 on a single response, Y. The data ends up being funnel-shaped, as below:
X1 v Y, with the colors being X2.
It was ...

**0**

votes

**0**answers

11 views

### How to estimate the fluctuations in a data?

The question is more about an method of extracting relevant "universal" information from multiple experimental data.
Let say, for every $\alpha$, we have a function of the form
$$f_\alpha(t) = g_\...

**0**

votes

**0**answers

13 views

### Interpreting inverse hyperbolic sine transformations with indicator independent variable in polynomial regression

I have a regression discontinuity design with the following specification:
The specification is a polynomial regression with an indicator variable to capture the average treatment effect. What is the ...

**0**

votes

**2**answers

45 views

### Feature extraction definition

I have difficulty understanding the concept of feature extraction since there are two main ways to describe it.
The first one refers to mapping the raw data into a vector in R^d or the translation of ...

**1**

vote

**0**answers

10 views

### rms/R: How to apply survSplit on two time-depedent covariates, one being a discrete covariate transformed with restricted cubic splines?

I am doing a survival analysis of time p$os.neck to death p$mors using a Cox Regression.
Please, find my data sample ...

**0**

votes

**0**answers

7 views

### Name for the opposite of Winsorizing?

For some regressions we find it useful to focus on extreme values, and so we discard middling dependent values (which we might call "noise") from data in order to find relationships that hold at data ...

**1**

vote

**1**answer

39 views

### Transformation of residual plot of linear regression model

I have a linear model which is represented by the following plot, with a fitted line:
And the residual plot is as following:
The distribution of the residuals is show in the following graph:
I see ...

**2**

votes

**1**answer

150 views

### Help with log2 transformation of normalized data

I have a dataset that I normalize so that the average equals 1. If I then log2 transform the dataset, should the average of the log2 data equal 0?
For example:
1, 1, 1. The average of the dataset is ...

**0**

votes

**0**answers

18 views

### Understanding research paper (online PCA)

I was reading this paper - http://pdfs.semanticscholar.org/efc7/ba57ece148f9f311a7e49639b69f70878489.pdf and got really confused by algorithm 2.
Basically the paper suggests that we can input some ...

**0**

votes

**1**answer

35 views

### Which criterion should be used for transformations of dependent variables? [duplicate]

When transforming the dependent variables, I know R^2 and related criterion is not suitable for model selection. Then which one should I use?

**1**

vote

**2**answers

49 views

### Transformation of variables in non linear regression model

I'm trying to build a regression model based on these variables below:
...

**0**

votes

**0**answers

11 views

### Spatial Sign Transformation not plotting circular shape as expected

I am attempting to execute Spatial Sign transformation using Phyton. However, I also found that there are not many libraries to use this concept, thus I had to create a function from scratch based on ...

**1**

vote

**1**answer

51 views

### log-linear modelling: transforming y variable

I am conducting a study on graphical log-linear modelling and my aim is to fit a log-linear model to data.
I am using R studio to carry out the analysis and I am using the glm function.
When first ...

**1**

vote

**1**answer

23 views

### Coding data for regression on unordered pairs

I want to fit a regression on "unordered paired data", but I'm uncertain on how to code it. What I mean by paired data is the following:
The model looks like this:
$$o_i \sim \text{Binom}(1,p)\\
f(...

**2**

votes

**0**answers

22 views

### (G)LM prediction interval with heteroscedasticity

I am trying to get prediction intervals from some non-linear data which also exhibits heteroscedasticity.
...

**0**

votes

**1**answer

28 views

### Modeling with seasonally adjusted series and BoxCox

I have time series with daily data. This time series have frequency 7. Before I start with modeling first I made seasonal adjusted series with STL decomposition (from forecast package).
So next step ...

**3**

votes

**1**answer

92 views

### GLM on non-integer data

I'm looking for a recommendation on what GLM I could do with non-integer data.
Brief background of what I am doing:
I'm wanting to combine calculated herbivory rates with abundance data, to compare ...

**2**

votes

**2**answers

124 views

### Sufficient Statistics - Relating the Intuition with the Mathematical Definition

I believe the heuristic definition of a Sufficient Statistic makes sense to me - when you take a sample in order to make an inference about the parameter related to the probability distribution, and ...

**3**

votes

**2**answers

86 views

### Independence under linear transformations

What is the (largest) set of matrices $\mathcal C\subset \mathbb R^{m\times n}$ ($m\le n$) for which the following statement is true?
Let $x_1,\ldots,x_n\in\mathbb R$ be independent random ...

**0**

votes

**0**answers

10 views

### How to use target encoding : expanding mean on the test set

The expanding mean is a way to prevent overfitting when performing target encoding. But what I do not understand is how to use ...

**0**

votes

**0**answers

25 views

### Right skewed distribution of a continuous variable with outliers: replace outliers with mode or median of that column?

When I replace my outliers with the median value of that column/feature, my mode for that column/feature also changes. Is that correct?

**0**

votes

**0**answers

23 views

### Mean and variance preserving skewness 'spread'

This is essentially a request for references in case what I am describing is studied somewhere, to avoid trying to come up with the machinery myself.
Heuristically, what I want to do is take some ...

**0**

votes

**1**answer

29 views

### R - transpose dataframe from existing data frame and convert it to time-series [closed]

I'm beginning with R and I would like to transpose the following data frame
into another dataframe with the column names being the company names and the vector values for each column (company names) ...

**0**

votes

**1**answer

28 views

### How to adjust/normalize/standardize mean? [closed]

I am making a reviews/ratings section for a website, with ratings that range from 0-5 stars. I am not confident that the users of this system will all have the same idea of what these stars mean, so I'...

**1**

vote

**1**answer

25 views

### Should I use log transformed pharmacokinetic data or use GLM gamma regression with log link?

I was taught, that when we deal with data of multiplicative nature, following the log-normal distribution, like in pharmacokinetic analyses, we should log the data first to enable classic parametric ...

**0**

votes

**1**answer

46 views

### How to reduce kurtosis of data

I'm trying to reduce the kurtosis of my dataset and make it approximately Gaussian, with a common-sense uni-modal shape. The raw data looks like this:
I first tried ...

**0**

votes

**0**answers

19 views

### I need to normalize this distribution, but cannot identify it

I have this distribution that I need to normalize for comparison between sub-populations. I thought it might be lognormal, but the kurtosis of the log product is still very high.
How do I go about ...

**1**

vote

**0**answers

16 views

### Predictions on transformed series post intervention analysis

I have taken this logged data and performed an intervention analysis:
...