Train Test Split

Train Test Split#

A train-test split is a technique used to evaluate the performance of a machine learning model. It involves dividing a single dataset into two separate subsets: one to train the model and one to test it.

To apply that first you need to define your target and feature variables.

# features are X, that affect target y
X = df.drop(columns=["target variable"]) # everything other than target 
y = df["target varible"] # target variable itself

Before that you need to have DataFrame, whih we have defined as df. Additionally, you can exclude more than 1 variables for X, so it it does not have to be only target variable. The other features that are higly coorelated with target varaible indeed should not be added. Data Leakage Prevention: Excluding features that exhibit an artificially high correlation with the target variable prevents the model from relying on proxy targets. This ensures the model generalizes accurately to genuine, unseen production data.

Note

train test split is added from sklearn.model_selection

Let’s load our dataset and then implement the train test split. Data load is done with pandas feature of read_csv.

import pandas as pd
df = pd.read_csv("../data/StudentsPerformance.csv")
df.head(3)

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	group B	bachelor's degree	standard	none	72	72	74
1	female	group C	some college	standard	completed	69	90	88
2	female	group B	master's degree	standard	none	90	95	93

df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

For this dataset we can merge the math-reading-writing scores under the column name of just *score

df['total_score'] = (df['math score'] + df['reading score'] + df['writing score'])/3

df["total_score"].head()

  72.666667
  82.333333
  92.666667
  49.333333
  76.333333
Name: total_score, dtype: float64

Now we can propose our X and y variables.

X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
y = df["total_score"]

X.head()

	gender	race/ethnicity	parental level of education	lunch	test preparation course
0	female	group B	bachelor's degree	standard	none
1	female	group C	some college	standard	completed
2	female	group B	master's degree	standard	none
3	male	group A	associate's degree	free/reduced	none
4	male	group C	some college	standard	none

y.head()

  72.666667
  82.333333
  92.666667
  49.333333
  76.333333
Name: total_score, dtype: float64

X.shape

(1000, 5)

y.shape

(1000,)

Since we have ready \(X\) and \(y\) now we can split our dataset inot train and test dataset, with the test size of 10 percent.

from sklearn.model_selection import train_test_split

# X = df.drop(columns=["total_score", "math score", "reading score", "writing score"])
# y = df["total_score"]

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=0.1 )

Interpretation

This code splits the dataset by allocating \(90\%\) of the observations to the training subsets \(X_{\text{train}}\), \(y_{\text{train}}\) for model optimization, while reserving the remaining \(10\%\) for the testing subsets \(X_{\text{test}}\) , \(y_{\text{test}}\) to evaluate generalization performance. By setting \(random_{\text{state}}\) to 1, the stochastic shuffling process is locked, ensuring that the exact same rows are deterministically partitioned every time the notebook is executed.