Feature Engineering¶

Estimated time to read: 5 minutes

Feature engineering transforms raw data into a structured format that machine learning algorithms can better understand and process. In other words, it involves creating and selecting the most relevant features (also called variables or attributes) from the raw data to improve the performance of machine learning models.

Feature engineering aims to extract valuable information from the data, reduce noise, and represent it in a way that makes it easier for algorithms to learn from. It plays a crucial role in building an effective machine learning model, as the quality of the features directly impacts the model's performance.

Feature engineering can involve several techniques, such as:

Feature extraction:¶

Extracting new, meaningful features from the raw data. For example, creating interaction terms, polynomial features, or decomposing timestamps into separate components like a day of the week, month, or year.

Interaction terms: Creating new features by multiplying or dividing two or more existing features. This can help capture the relationship between features that are not apparent individually.
Polynomial features: Generating new features by raising existing features to a power. This can help capture non-linear relationships in the data.
Time-based features: Breaking down time-based features like timestamps into separate components such as day of the week, month, or year can help reveal patterns or seasonality in the data.

Feature scaling:¶

Scaling features to a common scale, so that their values can be compared fairly. Common scaling techniques include min-max scaling, standardisation (z-score), and normalisation.

**Min-max scaling: **Scaling features to a specific range, typically [0, 1], by subtracting the minimum value and dividing by the range (max-min). This technique is sensitive to outliers.
Standardization (z-score): Scaling features by subtracting the mean and dividing by the standard deviation. This results in features with a mean of 0 and a standard deviation of 1, making it less sensitive to outliers than min-max scaling.
Normalization: Scaling features by dividing each value by the feature's L2-norm (Euclidean length). This is useful when dealing with features that have different units or magnitudes.

Feature selection:¶

Identifying the most important features for the model by either eliminating irrelevant or redundant features or selecting a subset of features that contribute the most to the model's performance. Feature selection techniques include filter, wrapper, and embedded methods.

Filter methods: Select features based on their relationship with the target variable, using metrics like correlation, mutual information, or chi-squared test. This method is computationally efficient but may not capture interactions between features.
Wrapper methods: Selecting features by evaluating the performance of different feature subsets using a specific machine learning algorithm. Examples include forward selection, backward elimination, and recursive feature elimination. These methods can be computationally expensive but often yield better performance.
Embedded methods: Selecting features as part of the model training process. Examples include LASSO regularisation for linear regression or feature importance from tree-based models like Random Forest or XGBoost.

Feature transformation:¶

Applying mathematical transformations to the features to achieve a more desirable distribution or relationship with the target variable. Examples include log transformation, square root transformation, and power transformation. - Log transformation: Applying the natural logarithm to the features can help reduce the impact of outliers or skewed distributions. - Square root transformation: Taking the square root of the features can help reduce the impact of outliers and better capture non-linear relationships. - Power transformation: Applying a power (exponent) to the features, such as the Box-Cox or Yeo-Johnson transformations, which can help stabilise variance or achieve normality.

Handling missing values:¶

Filling in missing values in the data using various strategies like imputation, deletion, or interpolation.

Imputation: Filling in missing values with a specific value, such as the mean, median, or mode of the feature. This can be done globally or on a per-group basis.
Deletion: Removing instances (rows) with missing values can be useful when the amount of missing data is small and missing values are missing at random.
Interpolation: Estimating missing values based on the values of neighbouring instances, such as linear interpolation or spline interpolation. This can be particularly useful for time series data.

Handling categorical variables:¶

Converting categorical variables into numerical values using one-hot, label, or target encoding techniques.

One-hot encoding: Creating binary features for each category of a categorical variable can be useful for linear models or when the number of categories is small.
Label encoding: Assigning an integer to each category of a categorical variable can be useful for tree-based models but may introduce an arbitrary ordering.
Target encoding: Replacing the categorical variable with the mean of the target variable for each category can be useful when the number of categories is large but may introduce leakage if not done carefully.

Feature engineering for specific algorithms:¶

Adapting features to the requirements of specific machine learning algorithms, such as creating polynomial features for linear regression or distance-based features for clustering algorithms.

Linear regression: Creating polynomial features or interaction terms can help capture non-linear relationships in the data.
Clustering algorithms: Creating distance-based features, such as Euclidean distance or cosine similarity between instances, can improve the performance of clustering algorithms like K-means or hierarchical clustering.

Domain-specific feature engineering:¶

In some cases, domain knowledge can be used to create more relevant features that better capture the underlying patterns in the data. For example:

Text data: Techniques like text tokenisation, stopword removal, stemming, lemmatisation, n-grams, term frequency-inverse document frequency (TF-IDF), or word embeddings (e.g., Word2Vec, GloVe, or BERT) can be used to transform text data into a structured format.
Image data: Image processing techniques like resizing, cropping, histogram equalisation, or feature extraction using convolutional neural networks (CNNs) can be employed to extract useful information from images.
Time series data: Time-based features like lagged variables, rolling windows, or Fourier transformations can help reveal patterns, trends, or seasonality in time series data.

By carefully engineering features, data scientists can improve the performance and interpretability of machine learning models and ultimately derive more value from the data.