Questions tagged [feature-engineering]

Feature engineering is the process of using domain knowledge of the data to create features for machine learning models. This tag is meant for both theoretical and practical questions regarding feature engineering, excluding questions asking for code, that would be off-topic on CrossValidated.

Filter by
Sorted by
Tagged with
1
vote
1answer
34 views

Expected Counts in Chi-Squared Goodness-of-Fit Tests of Normality

I have a variable with of 200 values that I would like to test for normality using the Chi-square Goodness of Fit test. To do this, I have to calculate, for each value, the expected value in a normal ...
0
votes
0answers
7 views

Priority between feature engineering and normalisation

My question is related to the priority between feature engineering (for example a simple transformation) and normalisation. It is a general question and I am not sure I understand all the ...
0
votes
0answers
6 views

Feature extraction for exponentially damped signals

I am looking into exponentially damped signals where it is a stationary signal (after implementing the Adfuller statistical test) and I would like to look into how can I extract meaningful features ...
0
votes
0answers
14 views

How to approach Feature Extraction and Feature Selection part in machone learning in python?

I am a bit new to machine learning and I have the following questions: Question 1: When dealing with feature extraction with signals from sensors, what is the typical approach to extract features ...
0
votes
0answers
6 views

TSFRESH - features extracted by a symmetric sliding window [closed]

As raw data we have measurements m_{i,j}, measured every 30 seconds (i=0, 30, 60, 90,...720,..) for every subject ...
0
votes
0answers
6 views

Is there I guide to decide which transformation to choose for different scenarios/ types of data and distribution?

1) how do i decide which transformation or scaling to use before passing our data into machine learning model. Can someone please guide me on which transformation to use in different situations. There ...
0
votes
0answers
10 views

handling counting features in classification model

I'm working on training a binary classification model. In my data I have 29 numerical features, continuous and discrete, apart from the target. Discrete features are all count features. I know that ...
0
votes
0answers
13 views

Should I impute the missing values of timeseries data?

I have the following task - predicting the next 12 hours of PM10 particles based on historical data of previous 24 hours of PM10, O3 (ozone), CO (carbon monoxide), and others (not included) using RNN'...
0
votes
1answer
13 views

How do you interpret your features when you standardize your data?

Let's say I have built a boosting tree or neural network and I standardized my features beforehand. When I built my model, I split my data into training, validation, and test sets - each with their ...
0
votes
0answers
31 views

How do you code missing values if 0 is meaningful?

Per this deep learning book I am reading: In general, with neural networks, it’s safe to input missing values as 0, with the condition that 0 isn’t already a meaningful value. The network will ...
2
votes
1answer
29 views

Handling zeros in features of a binary classification problem

I'm working on training a binary classification model. In my data I have 29 numerical features, continuous and discrete, apart from the target which is categorical. I have 29 features, 8 of them have ...
1
vote
0answers
22 views

How to deal with a features that overweight others in a regression?

I have been facing a problem that has been taking quite a while to over. In my problem I have basically 3 input features in my model and one single output. I have been using GP to fit my model to data,...
0
votes
0answers
8 views

Implementing Scikit Learn's FeatureHasher for High Cardinality Categorical Data

Background: I am working on a binary classification of health insurance claims. The data I am working with has approximately 1 million rows and a mix of numeric features and categorical features (all ...
0
votes
0answers
9 views

Is there a good score function for finding stationary-covariance features from time series via variational inference?

There are various ad-hoc methods for picking differencing orders or fractional difference orders of time series. Am looking for sound scoring functions and discussions that target automatic stationary ...
0
votes
0answers
8 views

How to select the best features for Support Vector Classifier in sklearn

I have a range of different technical analysis indicators as a feature set for my SVM. I would like to think some indicators are better than others at predicting and that there must be some sort of ...
2
votes
2answers
27 views

Is constructing the target variable manually a form of data leakage?

Let's say, I have a data table with numerical features A, B, C. I do not have the target variable but I extract the target variable Y from the features A, B, and C. like so: ...
0
votes
0answers
19 views

How to Find features for my model?

So i am a newbie in all Machine Learning stuff, i am trying to build a model of detecting fake news articles, as a starting point, i am just trying to build a simple model using known classifiers (...
1
vote
0answers
39 views

Standardization vs dividing uncentered data by Standard Deviation

I am working with a dataset that involves a collection of one-hot encoded, ordinal, and numerical features. I am using a LASSO model. As the difference in scales can influence the estimates, I am ...
1
vote
1answer
15 views

Feature generation for anomaly detection

I have Room Temperature data(T1) and Outside Temperature data(T2) with me for various houses which are having HVAC system installed. I am building a system which detects faulty HVAC (heating, ...
0
votes
2answers
45 views

Feature extraction definition

I have difficulty understanding the concept of feature extraction since there are two main ways to describe it. The first one refers to mapping the raw data into a vector in R^d or the translation of ...
0
votes
0answers
15 views

Why to say neural network can extract implicit feature combinations?

I just couldn't understand why to say neural network layers can extract implicit feature interactions in the DeepFM model. What does the keyword, implicit feature interactions, exactly mean here? And ...
0
votes
0answers
10 views

new features selection

I am in a project in which I have a specific description of a certain binary profile for which I have about 200 positive examples and another 200 negative. This description is given from about 60 ...
0
votes
0answers
11 views

Should I reduce data points for a feature if there are many inputs?

I have an assignment for college based on the MediaEval Competition, to predict video memorability. We have 8000 videos, each with a score in terms of it's memorability. As well as this, we have ...
6
votes
1answer
76 views

Feature Engineering : combine a categorical Feature and a continuous Feature

When we analyze data , we can observe several variables that may contain mutual information. For an example , There can be a binary variable such as Y=Have you ever smoke ? And then there will be a ...
0
votes
1answer
41 views

Dealing with over 1000 categorical values (which are also a unique identifiers)

I am preparing my dataset for a logistic regression and need to check how best to handle a column with categorical values. As the dataset is for sales transactions, the column in question is the ...
0
votes
0answers
9 views

Feature Selection by individual AUC

I am creating a model for classification and I have several ways to get subset of features but I was wondering if the following is reasonable: Use the train set to calculate LOOCV or LPOCV AUC values ...
0
votes
1answer
24 views

How to set feature engineering for day of a week?

Apologies if this is a very basic question. I'm currently learning data science and was wondering to help validating what I'm trying to do. So I have a model set up to predict event duration by ...
0
votes
1answer
12 views

Best way to encode information for input to neural network?

If I have a 20x20 grid of cells, each cell can take on one of four values (Red, Green, Black, Blue) What is the best way to encode this information? My first guess would be that one hot encoding is ...
0
votes
0answers
21 views

Increased computation time for training and prediction with reduced feature space?

I implemented a PCA algorithm to reduce the input feature space of my neural network from 230 to 110 features. My naive expectation was that if I train a neural network using the same hyper ...
2
votes
1answer
30 views

Performance drops when adding a feature using XGBoost

I did some feature engineering with my data set. When I added on of the new features, the performance significantly dropped. How is this possible? I thought XGBoost is robust to irrelevant variables.
0
votes
0answers
15 views

Feature expansion (multiplication) - What to do with higher correlations?

If I have a set of features {x1,x2,x3} and I expand the feature set by multiplying all the features to have the following: {x1,x2,x3,x1*x2,x2*x3,x1*x3}. Now, I find that two of my features {x1*x2 and ...
1
vote
1answer
24 views

Encoding cyclical feature minutes and hours

I'm working with time-series data to train a binary classification model that predicts if an event is going to happen or not in the future. The likelihood of the event depends on the specific time ...
0
votes
0answers
10 views

Time-dependent feature analysis

I have a linear relation between variable A and variable B. Variable A is an area under the curve, where the curve is a gaussian fitted to a time series evolution. Now, apart from the time series data ...
0
votes
0answers
10 views

How to use target encoding : expanding mean on the test set

The expanding mean is a way to prevent overfitting when performing target encoding. But what I do not understand is how to use ...
0
votes
0answers
22 views

modeling a electrical pulse which is technically time dependent

I have an electrical pulse that I need to fit a curve to a certain area of but not the entire thing. The whole pulse looks like this However the only part that I need to model is this My boss ...
1
vote
0answers
20 views

Feature engineering: including counter-parties of a transaction in a dataset

Background Say I have a dataset of transfers between bank accounts structured like so: ...
2
votes
2answers
243 views

Difference between Feature engineering and hyperparameter optimizations?

Hyperparameter optimizations and feature engineering can(in my understanding) both be used to create a machine learning model. But what is the difference? And what is done to the y = wx + b formula in ...
0
votes
0answers
11 views

Extracting Features to Determine Periodicity

I have accelerometer time series data sampled at 30Hz from participants and am extracting features from each separate movement in the collection period per person to use for machine learning. I have ~...
0
votes
1answer
21 views

How to handle potential ambiguity when one-hot encoding?

Let's say I have two categorical features: Movie, Director. I one-hot encode both the Movie and Director features for use in a linear regression model. The problem is that two or more movies may be ...
1
vote
1answer
19 views

Does mislabeling due to adversarial noise in features count as adversarial machine learning?

According to the traditional definition, Adversarial machine learning is a technique employed in the field of machine learning which attempts to fool models through malicious input. However, I have ...
1
vote
1answer
24 views

Different scales of input features for stacking ensembles?

I have two models to predict future stock market behavior based on historical data: ARIMA time series model lstm model (including data from various other sources) ARIMA tries to model the daily ...
3
votes
1answer
59 views

How to construct a function with given local minima?

I need to construct a function $f(x,y)$ in which there are 3 minima: 2 local and 1 global as given below. Locals are: z = f(0.2,0.3) = 0.7 | z = f(0.6,0.8) = 0.8 Global is: z = f(0.85,0.5) = ...
1
vote
0answers
20 views

Can RFs find a product interaction between two independent variables?

I'm doing the FastAI course on ML, and the main topic that is currently being discussed is random forests. Jeremy Howard explains how random forests, unlike something such as logistic regression, can ...
0
votes
0answers
24 views

Combine TFIDF with non-textual features

I am dealing with an email classification problem in which I have email requests coming from different groups of people. I am building a classifier to classify these emails based on historical email ...
1
vote
0answers
32 views

Machine learning: use benchmark as a feature

The project I am doing is to predict surgery lengths. The benchmark I am trying to compare is to take the average of most recent 20 cases for the cases with the same ID. What I tested is to use this ...
0
votes
0answers
13 views

The most basic question about Feature Important or Permutation_Importance

Consider the XOR gate with three inputs. The truth table will be: Now all the variables on their own are near random as far as the model is concerned. Each input 1 or 0 has a 50% chance of being ...
0
votes
0answers
17 views

Reduction in number of observation by extracting piecemeal signal features,while keeping the no of features same. Can it be called feature extraction?

I have a dataset generated from 9 sensors in an E-nose system for a binary class classification problem. The system provides a response for 240 seconds for each sample. i.e. I have a data set of 240 * ...
1
vote
1answer
73 views

“Deep learning removes the need for feature engineering”?

I have seen it written in several papers and currently see it written in Deep Learning with Python by Francois Chollet that Deep learning removes the need for feature engineering What does this ...
0
votes
0answers
20 views

Should I engine features from coordinates (positional data)?

I am trying to do a regression on housing price (price/m^2). Apart from the lat and lng of the property, I also have city_code, district_id, street_id. I am thinking whether I should remove city_code,...
1
vote
0answers
11 views

Imputing null values for metrics used as features in ML model

I have a data set from a live application. Each row is a user interacting with the app. We are predicting a feature for which we currently have a deterministic solution for. There is ample training ...

1
2 3 4 5
12