Train Test Split#
A train-test split is a technique used to evaluate the performance of a machine learning model. It involves dividing a single dataset into two separate subsets: one to train the model and one to test it.
To apply that first you need to define your target and feature variables.
# features are X, that affect target y
X = df.drop(columns=["target variable"]) # everything other than target
y = df["target varible"] # target variable itself
Before that you need to have DataFrame, whih we have defined as df. Additionally, you can exclude more than 1 variables for X, so it it does not have to be only target variable. The other features that are higly coorelated with target varaible indeed should not be added. Data Leakage Prevention: Excluding features that exhibit an artificially high correlation with the target variable prevents the model from relying on proxy targets. This ensures the model generalizes accurately to genuine, unseen production data.
Note
train test split is added from sklearn.model_selection
Let’s load our dataset and then implement the train test split. Data load is done with pandas feature of read_csv.
import pandas as pd
df = pd.read_csv("../data/StudentsPerformance.csv")
df.head(3)
| gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
|---|---|---|---|---|---|---|---|---|
| 0 | female | group B | bachelor's degree | standard | none | 72 | 72 | 74 |
| 1 | female | group C | some college | standard | completed | 69 | 90 | 88 |
| 2 | female | group B | master's degree | standard | none | 90 | 95 | 93 |
df.columns
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
'test preparation course', 'math score', 'reading score',
'writing score'],
dtype='object')
For this dataset we can merge the math-reading-writing scores under the column name of just *score
df['total_score'] = (df['math score'] + df['reading score'] + df['writing score'])/3
df["total_score"].head()
0 72.666667
1 82.333333
2 92.666667
3 49.333333
4 76.333333
Name: total_score, dtype: float64
Now we can propose our X and y variables.
X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
y = df["total_score"]
X.head()
| gender | race/ethnicity | parental level of education | lunch | test preparation course | |
|---|---|---|---|---|---|
| 0 | female | group B | bachelor's degree | standard | none |
| 1 | female | group C | some college | standard | completed |
| 2 | female | group B | master's degree | standard | none |
| 3 | male | group A | associate's degree | free/reduced | none |
| 4 | male | group C | some college | standard | none |
y.head()
0 72.666667
1 82.333333
2 92.666667
3 49.333333
4 76.333333
Name: total_score, dtype: float64
X.shape
(1000, 5)
y.shape
(1000,)
Since we have ready \(X\) and \(y\) now we can split our dataset inot train and test dataset, with the test size of 10 percent.
from sklearn.model_selection import train_test_split
# X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
# y = df["total_score"]
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=0.1 )
Interpretation
This code splits the dataset by allocating \(90\%\) of the observations to the training subsets \(X_{\text{train}}\), \(y_{\text{train}}\) for model optimization, while reserving the remaining \(10\%\) for the testing subsets \(X_{\text{test}}\) , \(y_{\text{test}}\) to evaluate generalization performance. By setting \(random_{\text{state}}\) to 1, the stochastic shuffling process is locked, ensuring that the exact same rows are deterministically partitioned every time the notebook is executed.