Train Test Split

Train Test Split#

A train-test split is a technique used to evaluate the performance of a machine learning model. It involves dividing a single dataset into two separate subsets: one to train the model and one to test it.

To apply that first you need to define your target and feature variables.

# features are X, that affect target y
X = df.drop(columns=["target variable"]) # everything other than target 
y = df["target varible"] # target variable itself

Before that you need to have DataFrame, whih we have defined as df. Additionally, you can exclude more than 1 variables for X, so it it does not have to be only target variable. The other features that are higly coorelated with target varaible indeed should not be added. Data Leakage Prevention: Excluding features that exhibit an artificially high correlation with the target variable prevents the model from relying on proxy targets. This ensures the model generalizes accurately to genuine, unseen production data.

Note

train test split is added from sklearn.model_selection

Let’s load our dataset and then implement the train test split. Data load is done with pandas feature of read_csv.

import pandas as pd
df = pd.read_csv("../data/StudentsPerformance.csv")
df.head(3)
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
df.columns
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

For this dataset we can merge the math-reading-writing scores under the column name of just *score

df['total_score'] = (df['math score'] + df['reading score'] + df['writing score'])/3
df["total_score"].head()
0    72.666667
1    82.333333
2    92.666667
3    49.333333
4    76.333333
Name: total_score, dtype: float64

Now we can propose our X and y variables.

X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
y = df["total_score"]
X.head()
gender race/ethnicity parental level of education lunch test preparation course
0 female group B bachelor's degree standard none
1 female group C some college standard completed
2 female group B master's degree standard none
3 male group A associate's degree free/reduced none
4 male group C some college standard none
y.head()
0    72.666667
1    82.333333
2    92.666667
3    49.333333
4    76.333333
Name: total_score, dtype: float64
X.shape
(1000, 5)
y.shape
(1000,)

Since we have ready \(X\) and \(y\) now we can split our dataset inot train and test dataset, with the test size of 10 percent.

from sklearn.model_selection import train_test_split
# X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
# y = df["total_score"]
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=0.1 )

Interpretation

This code splits the dataset by allocating \(90\%\) of the observations to the training subsets \(X_{\text{train}}\), \(y_{\text{train}}\) for model optimization, while reserving the remaining \(10\%\) for the testing subsets \(X_{\text{test}}\) , \(y_{\text{test}}\) to evaluate generalization performance. By setting \(random_{\text{state}}\) to 1, the stochastic shuffling process is locked, ensuring that the exact same rows are deterministically partitioned every time the notebook is executed.