Preprocessing#
Preprocessing in scikit-learn (sklearn) is the process of transforming raw data into a clean, structured format suitable for machine learning algorithms. Raw datasets often contain missing values, inconsistent scales, categorical text, or outliers that can degrade model performance. To resolve this, sklearn provides built-in utilities to handle features efficiently, such as standardizing numerical ranges via StandardScaler, encoding text categories into numbers using OneHotEncoder, and imputing missing data with SimpleImputer. By converting messy real-world data into uniform numerical matrices, preprocessing ensures that mathematical models can learn patterns accurately and generalize well to new data. To import p
Note
Scaler and Encoder are imported from preporcessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
StandardScaler#
The StandardScaler in machine learning normalizes data by shifting and scaling its values. Specifically, it centers each feature so that it has a mean of zero and a standard deviation of one. This process ensures that all variables contribute equally to model training, preventing features with larger numerical ranges from disproportionately dominating the algorithm.
To implement StandardScaler in Python using scikit-learn, use the following code:
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler().set_output(transform="pandas")
By adding set_output(transform="pandas") you make the output to remain a Pandas DataFrame instead of converting into a NumPy array.
Since this scaler works with numerical data type, it is better to print those :select_dtypes(include='number') helps us in this situation. Converting number to ``object` provides non-numeric data types that might be useful for Encoders.
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("../data/StudentsPerformance.csv")
df.head(3)
df['total_score'] = (df['math score'] + df['reading score'] + df['writing score'])/3
X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
y = df["total_score"]
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=0.1 )
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")
df.select_dtypes(include="number")
| math score | reading score | writing score | total_score | |
|---|---|---|---|---|
| 0 | 72 | 72 | 74 | 72.666667 |
| 1 | 69 | 90 | 88 | 82.333333 |
| 2 | 90 | 95 | 93 | 92.666667 |
| 3 | 47 | 57 | 44 | 49.333333 |
| 4 | 76 | 78 | 75 | 76.333333 |
| ... | ... | ... | ... | ... |
| 995 | 88 | 99 | 95 | 94.000000 |
| 996 | 62 | 55 | 55 | 57.333333 |
| 997 | 59 | 71 | 65 | 65.000000 |
| 998 | 68 | 78 | 77 | 74.333333 |
| 999 | 77 | 86 | 86 | 83.000000 |
1000 rows × 4 columns
X_train_scaled = scaler.fit_transform(X_train)
y_train_scaled = scaler.fit_transform(y_train)
X_test_scaled = scaler.transform(X_test)
y_test_scaled = scaler.transform(y_test)
Warning
Data Leakage: Only use fit_transform on training data. Use transform on test data. Fitting on test data leaks information.
X_train_scaled = scaler.fit_transform(X_train)
y_train_scaled = scaler.fit_transform(y_train)
X_test_scaled = scaler.transform(X_test)
y_test_scaled = scaler.transform(y_test)