Preprocessing

Preprocessing#

Preprocessing in scikit-learn (sklearn) is the process of transforming raw data into a clean, structured format suitable for machine learning algorithms. Raw datasets often contain missing values, inconsistent scales, categorical text, or outliers that can degrade model performance. To resolve this, sklearn provides built-in utilities to handle features efficiently, such as standardizing numerical ranges via StandardScaler, encoding text categories into numbers using OneHotEncoder, and imputing missing data with SimpleImputer. By converting messy real-world data into uniform numerical matrices, preprocessing ensures that mathematical models can learn patterns accurately and generalize well to new data. To import p

Note

Scaler and Encoder are imported from preporcessing from sklearn.preprocessing import StandardScaler, OneHotEncoder

StandardScaler#

The StandardScaler in machine learning normalizes data by shifting and scaling its values. Specifically, it centers each feature so that it has a mean of zero and a standard deviation of one. This process ensures that all variables contribute equally to model training, preventing features with larger numerical ranges from disproportionately dominating the algorithm.

To implement StandardScaler in Python using scikit-learn, use the following code:

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler().set_output(transform="pandas")

By adding set_output(transform="pandas") you make the output to remain a Pandas DataFrame instead of converting into a NumPy array.

Since this scaler works with numerical data type, it is better to print those :select_dtypes(include='number') helps us in this situation. Converting number to ``object` provides non-numeric data types that might be useful for Encoders.

import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("../data/StudentsPerformance.csv")
df.head(3)

df['total_score'] = (df['math score'] + df['reading score'] + df['writing score'])/3

X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
y = df["total_score"]

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=0.1 )

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")

df.select_dtypes(include="number")

	math score	reading score	writing score	total_score
0	72	72	74	72.666667
1	69	90	88	82.333333
2	90	95	93	92.666667
3	47	57	44	49.333333
4	76	78	75	76.333333
...	...	...	...	...
995	88	99	95	94.000000
996	62	55	55	57.333333
997	59	71	65	65.000000
998	68	78	77	74.333333
999	77	86	86	83.000000

1000 rows × 4 columns

X_train_scaled = scaler.fit_transform(X_train)
y_train_scaled = scaler.fit_transform(y_train)

X_test_scaled = scaler.transform(X_test)
y_test_scaled = scaler.transform(y_test)

Warning

Data Leakage: Only use fit_transform on training data. Use transform on test data. Fitting on test data leaks information.

X_train_scaled = scaler.fit_transform(X_train)
y_train_scaled = scaler.fit_transform(y_train)

X_test_scaled = scaler.transform(X_test)
y_test_scaled = scaler.transform(y_test)

Preprocessing

Contents

Preprocessing#

StandardScaler#

OneHotEncoder#

OrdinalEncoder#

SimpleImputer#