IBM HR Employee Attrition: A Statistical Analysis

IBM HR Employee Attrition: A Statistical Analysis#

Project goal#

This notebook studies four meaningful questions in the employee dataset. Rather than applying the same test repeatedly, each question uses a method that fits the type of variables involved.

Scenario	Research question	Method
1	Do employees who leave have shorter company tenure than employees who stay?	Mann–Whitney U + rank-biserial effect size
2	Is overtime work associated with employee attrition?	Chi-square + Cramér’s V
3	Is total working experience related to monthly income?	Spearman correlation
4	Does monthly income differ across job roles?	Kruskal–Wallis + Dunn post-hoc

Significance level#

Throughout the notebook, the significance level is:

alpha = 0.05

A statistically significant relationship does not prove that one factor causes another. This dataset supports association-based conclusions.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from itertools import combinations
from scipy.stats import (
    shapiro,
    mannwhitneyu,
    chi2_contingency,
    spearmanr,
    kruskal,
    rankdata,
    norm
)

pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda number: f"{number:,.4f}")

alpha = 0.05

1. Load and inspect the dataset#

df = pd.read_csv("../data/IBM_HR_Employee_Attrition.csv")

print("Dataset shape:", df.shape)
display(df.head())

Dataset shape: (1470, 35)

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2

print("Missing values in the dataset:", df.isna().sum().sum())
print("Duplicate rows:", df.duplicated().sum())

constant_columns = [
    column for column in df.columns
    if df[column].nunique() == 1
]

print("Constant columns:", constant_columns)

df = df.drop(
    columns=[
        "EmployeeNumber",
        "EmployeeCount",
        "Over18",
        "StandardHours"
    ]
)

print("Shape after removing ID and constant columns:", df.shape)

Missing values in the dataset: 0
Duplicate rows: 0
Constant columns: ['EmployeeCount', 'Over18', 'StandardHours']
Shape after removing ID and constant columns: (1470, 31)

Overall attrition summary#

Attrition is the central workplace outcome in this project. Before testing possible factors, we first check how many employees left and stayed.

attrition_summary = df["Attrition"].value_counts().rename_axis("Attrition").reset_index(name="Employees")
attrition_summary["Percentage"] = (
    attrition_summary["Employees"] / len(df) * 100
).round(1)

display(attrition_summary)

	Attrition	Employees	Percentage
0	No	1233	83.9000
1	Yes	237	16.1000

2. Scenario 1: Company tenure and attrition#

Research question#

Do employees who leave have shorter company tenure than employees who stay?

Grouping variable: Attrition with two independent groups, Yes and No
Outcome variable: YearsAtCompany
Since tenure is non-normal inside both attrition groups, the appropriate comparison is the Mann–Whitney U test
Effect size: rank-biserial correlation

left_tenure = df.loc[df["Attrition"] == "Yes", "YearsAtCompany"]
stayed_tenure = df.loc[df["Attrition"] == "No", "YearsAtCompany"]

tenure_normality = pd.DataFrame({
    "Group": ["Left the company", "Stayed with the company"],
    "N": [len(left_tenure), len(stayed_tenure)],
    "Median Years": [left_tenure.median(), stayed_tenure.median()],
    "Skewness": [left_tenure.skew(), stayed_tenure.skew()],
    "Shapiro p-value": [
        shapiro(left_tenure).pvalue,
        shapiro(stayed_tenure).pvalue
    ]
})

tenure_normality["Normally Distributed?"] = np.where(
    tenure_normality["Shapiro p-value"] >= alpha,
    "Yes",
    "No"
)

display(tenure_normality.round(4))

	Group	N	Median Years	Skewness	Shapiro p-value	Normally Distributed?
0	Left the company	237	3.0000	2.6822	0.0000	No
1	Stayed with the company	1233	6.0000	1.6580	0.0000	No

plt.figure(figsize=(6, 4))

plt.boxplot(
    [left_tenure, stayed_tenure],
    tick_labels=["Left", "Stayed"]
)

plt.title("Company Tenure by Attrition Status")
plt.xlabel("Employee Group")
plt.ylabel("Years at Company")
plt.show()

../_images/e7887a29a232ab310bc529acb7dc855cb4c11a200efb891d9aa0d62a226f562b.png

mann_whitney_result = mannwhitneyu(
    left_tenure,
    stayed_tenure,
    alternative="two-sided",
    method="asymptotic"
)

rank_biserial = (
    2 * mann_whitney_result.statistic /
    (len(left_tenure) * len(stayed_tenure))
) - 1

tenure_result = pd.DataFrame({
    "Test": ["Mann–Whitney U"],
    "Median: Left": [left_tenure.median()],
    "Median: Stayed": [stayed_tenure.median()],
    "U Statistic": [mann_whitney_result.statistic],
    "p-value": [mann_whitney_result.pvalue],
    "Rank-Biserial r": [rank_biserial],
    "Significant?": ["Yes" if mann_whitney_result.pvalue < alpha else "No"]
})

display(tenure_result.round(4))

	Test	Median: Left	Median: Stayed	U Statistic	p-value	Rank-Biserial r	Significant?
0	Mann–Whitney U	3.0000	6.0000	102,582.0000	0.0000	-0.2979	Yes

Interpretation#

Employees who left the company had a median tenure of 3 years, whereas employees who stayed had a median tenure of 6 years. The difference is statistically significant, and the effect size is close to moderate. This suggests that retention challenges are more visible among employees with shorter organisational tenure.

3. Scenario 2: Overtime and attrition#

Research question#

Is overtime work associated with employee attrition?

Both variables are categorical:

OverTime: Yes or No
Attrition: Yes or No

Therefore, the suitable method is the chi-square test of independence.
Cramér’s V is added to report the strength of the association.

overtime_table = pd.crosstab(df["OverTime"], df["Attrition"])

overtime_rates = (
    pd.crosstab(df["OverTime"], df["Attrition"], normalize="index")
    .mul(100)
    .round(1)
)

display(overtime_table)
display(overtime_rates.rename(columns={"Yes": "Left (%)", "No": "Stayed (%)"}))

Attrition	No	Yes
OverTime
No	944	110
Yes	289	127

Attrition	Stayed (%)	Left (%)
OverTime
No	89.6000	10.4000
Yes	69.5000	30.5000

attrition_rate_by_overtime = overtime_rates["Yes"]

plt.figure(figsize=(6, 4))
plt.bar(attrition_rate_by_overtime.index, attrition_rate_by_overtime.values)

plt.title("Attrition Rate by Overtime Status")
plt.xlabel("Overtime")
plt.ylabel("Attrition Rate (%)")

for position, value in enumerate(attrition_rate_by_overtime.values):
    plt.text(position, value + 0.6, f"{value:.1f}%", ha="center")

plt.ylim(0, attrition_rate_by_overtime.max() + 6)
plt.show()

../_images/989441905d8544b321f44ffeaedcd93a20aa866cc819fe71aec9b6b0c22fbc3e.png

chi_square_result = chi2_contingency(overtime_table)

chi_square = chi_square_result.statistic
chi_p_value = chi_square_result.pvalue
expected_counts = chi_square_result.expected_freq

n = overtime_table.to_numpy().sum()
cramers_v = np.sqrt(
    chi_square / (n * (min(overtime_table.shape) - 1))
)

risk_ratio = (
    attrition_rate_by_overtime["Yes"] /
    attrition_rate_by_overtime["No"]
)

chi_square_summary = pd.DataFrame({
    "Test": ["Chi-square test"],
    "Chi-square Statistic": [chi_square],
    "p-value": [chi_p_value],
    "Cramer's V": [cramers_v],
    "Minimum Expected Count": [expected_counts.min()],
    "Significant?": ["Yes" if chi_p_value < alpha else "No"]
})

display(chi_square_summary.round(4))
print("Attrition rate risk ratio:", round(risk_ratio, 2))

	Test	Chi-square Statistic	p-value	Cramer's V	Minimum Expected Count	Significant?
0	Chi-square test	87.5643	0.0000	0.2441	67.0694	Yes

Attrition rate risk ratio: 2.93

Interpretation#

Employees working overtime had an attrition rate of 30.5%, compared with 10.4% among employees who did not work overtime. The association is statistically significant with a meaningful Cramér’s V effect size. In practical terms, the attrition rate among overtime workers is about 2.93 times the rate among non-overtime workers.

4. Scenario 3: Work experience and monthly income#

Research question#

Is total working experience related to monthly income?

Both variables are numerical. Since they are not normally distributed, a non-parametric correlation is appropriate:

Variable 1: TotalWorkingYears
Variable 2: MonthlyIncome
Method: Spearman rank correlation

experience = df["TotalWorkingYears"]
income = df["MonthlyIncome"]

correlation_normality = pd.DataFrame({
    "Variable": ["Total Working Years", "Monthly Income"],
    "Skewness": [experience.skew(), income.skew()],
    "Shapiro p-value": [
        shapiro(experience).pvalue,
        shapiro(income).pvalue
    ]
})

correlation_normality["Normally Distributed?"] = np.where(
    correlation_normality["Shapiro p-value"] >= alpha,
    "Yes",
    "No"
)

display(correlation_normality.round(4))

	Variable	Skewness	Shapiro p-value	Normally Distributed?
0	Total Working Years	1.1172	0.0000	No
1	Monthly Income	1.3698	0.0000	No

plt.figure(figsize=(7, 5))
plt.scatter(experience, income, alpha=0.35)

trend_values = np.polyfit(experience, income, 1)
trend_line = np.poly1d(trend_values)
ordered_experience = experience.sort_values()

plt.plot(ordered_experience, trend_line(ordered_experience))

plt.title("Monthly Income and Total Working Experience")
plt.xlabel("Total Working Years")
plt.ylabel("Monthly Income")
plt.show()

../_images/4f09cdeced5d76ac39769aff58043c8e2ec8936f9fd7b47d749c9240f4123256.png

spearman_result = spearmanr(experience, income)

spearman_summary = pd.DataFrame({
    "Test": ["Spearman Correlation"],
    "Variable 1": ["TotalWorkingYears"],
    "Variable 2": ["MonthlyIncome"],
    "Spearman rho": [spearman_result.statistic],
    "p-value": [spearman_result.pvalue],
    "Significant?": ["Yes" if spearman_result.pvalue < alpha else "No"]
})

display(spearman_summary.round(4))

	Test	Variable 1	Variable 2	Spearman rho	p-value	Significant?
0	Spearman Correlation	TotalWorkingYears	MonthlyIncome	0.7100	0.0000	Yes

Interpretation#

There is a strong positive and statistically significant relationship between total working experience and monthly income. Employees with more years of overall work experience generally have higher monthly salaries.

5. Scenario 4: Job role and monthly income#

Research question#

Does monthly income differ across job roles?

Grouping variable: JobRole, containing more than two independent groups
Outcome variable: MonthlyIncome
Monthly income is skewed, so the suitable test is Kruskal–Wallis
When the overall test is significant, Dunn post-hoc comparisons with Holm adjustment identify which role pairs differ

role_income_summary = (
    df.groupby("JobRole")["MonthlyIncome"]
    .agg(Employees="count", Median_Income="median", Mean_Income="mean")
    .sort_values("Median_Income")
)

display(role_income_summary.round(2))

	Employees	Median_Income	Mean_Income
JobRole
Sales Representative	83	2,579.0000	2,626.0000
Laboratory Technician	259	2,886.0000	3,237.1700
Research Scientist	292	2,887.5000	3,239.9700
Human Resources	52	3,093.0000	4,235.7500
Sales Executive	326	6,231.0000	6,924.2800
Manufacturing Director	145	6,447.0000	7,295.1400
Healthcare Representative	131	6,811.0000	7,528.7600
Research Director	80	16,510.0000	16,033.5500
Manager	102	17,454.5000	17,181.6800

role_order = role_income_summary.index.tolist()

income_groups = [
    df.loc[df["JobRole"] == role, "MonthlyIncome"]
    for role in role_order
]

plt.figure(figsize=(10, 6))

plt.boxplot(
    income_groups,
    tick_labels=role_order,
    vert=False
)

plt.title("Monthly Income Distribution by Job Role")
plt.xlabel("Monthly Income")
plt.ylabel("Job Role")
plt.show()

../_images/d8e5557ec2b7fb958867a32659c028598836cbc3dad321a1f0997b3527401bf6.png

kruskal_result = kruskal(*income_groups)

epsilon_squared = (
    (kruskal_result.statistic - len(role_order) + 1) /
    (len(df) - len(role_order))
)

kruskal_summary = pd.DataFrame({
    "Test": ["Kruskal–Wallis"],
    "Number of Job Roles": [len(role_order)],
    "H Statistic": [kruskal_result.statistic],
    "p-value": [kruskal_result.pvalue],
    "Epsilon-Squared": [epsilon_squared],
    "Significant?": ["Yes" if kruskal_result.pvalue < alpha else "No"]
})

display(kruskal_summary.round(4))

	Test	Number of Job Roles	H Statistic	p-value	Epsilon-Squared	Significant?
0	Kruskal–Wallis	9	1,073.4101	0.0000	0.7292	Yes

Dunn post-hoc test with Holm correction#

Kruskal–Wallis tells us that at least one job role differs in monthly income, but it does not tell us which roles differ. The following small helper function carries out pairwise Dunn comparisons and applies Holm correction so that we do not overstate significance after testing many pairs.

def dunn_posthoc_holm(data, group_col, outcome_col):
    clean_data = data[[group_col, outcome_col]].dropna().copy()
    clean_data["Rank"] = rankdata(clean_data[outcome_col])

    total_n = len(clean_data)
    groups = clean_data[group_col].unique()

    group_counts = clean_data.groupby(group_col).size()
    mean_ranks = clean_data.groupby(group_col)["Rank"].mean()

    ties = clean_data[outcome_col].value_counts()
    tie_sum = ((ties ** 3) - ties).sum()

    rank_variance = (
        total_n * (total_n + 1) / 12
        - tie_sum / (12 * (total_n - 1))
    )

    comparisons = []

    for group_1, group_2 in combinations(groups, 2):
        denominator = np.sqrt(
            rank_variance *
            (1 / group_counts[group_1] + 1 / group_counts[group_2])
        )

        z_value = (
            mean_ranks[group_1] - mean_ranks[group_2]
        ) / denominator

        p_value = 2 * norm.sf(abs(z_value))

        comparisons.append({
            "Group 1": group_1,
            "Group 2": group_2,
            "z": z_value,
            "p-value": p_value
        })

    results = pd.DataFrame(comparisons).sort_values("p-value").reset_index(drop=True)

    number_of_tests = len(results)
    adjusted_values = []

    for index, p_value in enumerate(results["p-value"]):
        adjusted_values.append(
            min((number_of_tests - index) * p_value, 1)
        )

    results["Adjusted p-value"] = np.maximum.accumulate(adjusted_values)
    results["Significant after Holm?"] = np.where(
        results["Adjusted p-value"] < alpha,
        "Yes",
        "No"
    )

    return results


dunn_results = dunn_posthoc_holm(
    data=df,
    group_col="JobRole",
    outcome_col="MonthlyIncome"
)

significant_dunn_results = dunn_results[
    dunn_results["Significant after Holm?"] == "Yes"
]

print("Number of significant pairwise differences:", len(significant_dunn_results))
display(significant_dunn_results.round(4))

Number of significant pairwise differences: 27

	Group 1	Group 2	z	p-value	Adjusted p-value	Significant after Holm?
0	Research Scientist	Manager	-20.5471	0.0000	0.0000	Yes
1	Laboratory Technician	Manager	-20.2427	0.0000	0.0000	Yes
2	Research Scientist	Research Director	-18.2765	0.0000	0.0000	Yes
3	Laboratory Technician	Research Director	-18.0552	0.0000	0.0000	Yes
4	Manager	Sales Representative	17.9813	0.0000	0.0000	Yes
5	Sales Representative	Research Director	-16.6022	0.0000	0.0000	Yes
6	Sales Executive	Research Scientist	16.1414	0.0000	0.0000	Yes
7	Sales Executive	Laboratory Technician	15.6617	0.0000	0.0000	Yes
8	Research Scientist	Healthcare Representative	-13.6251	0.0000	0.0000	Yes
9	Laboratory Technician	Healthcare Representative	-13.3926	0.0000	0.0000	Yes
10	Research Scientist	Manufacturing Director	-13.3204	0.0000	0.0000	Yes
11	Laboratory Technician	Manufacturing Director	-13.0771	0.0000	0.0000	Yes
12	Sales Executive	Sales Representative	12.9765	0.0000	0.0000	Yes
13	Healthcare Representative	Sales Representative	12.3145	0.0000	0.0000	Yes
14	Manager	Human Resources	12.0272	0.0000	0.0000	Yes
15	Manufacturing Director	Sales Representative	11.9740	0.0000	0.0000	Yes
16	Research Director	Human Resources	11.1856	0.0000	0.0000	Yes
17	Sales Executive	Manager	-9.3666	0.0000	0.0000	Yes
18	Sales Executive	Research Director	-8.0612	0.0000	0.0000	Yes
19	Manufacturing Director	Manager	-7.8153	0.0000	0.0000	Yes
20	Healthcare Representative	Manager	-7.0461	0.0000	0.0000	Yes
21	Manufacturing Director	Research Director	-6.8435	0.0000	0.0000	Yes
22	Healthcare Representative	Human Resources	6.8268	0.0000	0.0000	Yes
23	Sales Executive	Human Resources	6.6078	0.0000	0.0000	Yes
24	Manufacturing Director	Human Resources	6.4304	0.0000	0.0000	Yes
25	Healthcare Representative	Research Director	-6.1566	0.0000	0.0000	Yes
26	Sales Representative	Human Resources	-3.4417	0.0006	0.0058	Yes

Interpretation#

Monthly income differs significantly across job roles. The median-income table makes the pattern clear: lower-paid roles include Sales Representative, Laboratory Technician and Research Scientist, while Manager and Research Director have the highest median monthly income. Dunn’s post-hoc results show the specific pairs that remain significantly different after correcting for multiple comparisons.

6. Final findings summary#

final_summary = pd.DataFrame({
    "Business Question": [
        "Does company tenure differ by attrition?",
        "Is overtime associated with attrition?",
        "Is work experience related to income?",
        "Does income differ across job roles?"
    ],
    "Method": [
        "Mann–Whitney U",
        "Chi-square",
        "Spearman correlation",
        "Kruskal–Wallis + Dunn"
    ],
    "Main Statistical Result": [
        f"U = {mann_whitney_result.statistic:.0f}, p < 0.001, r = {rank_biserial:.3f}",
        f"χ² = {chi_square:.2f}, p < 0.001, V = {cramers_v:.3f}",
        f"ρ = {spearman_result.statistic:.3f}, p < 0.001",
        f"H = {kruskal_result.statistic:.2f}, p < 0.001, ε² = {epsilon_squared:.3f}"
    ],
    "Meaning": [
        "Employees who left had shorter median tenure: 3 vs 6 years.",
        "Overtime employees had 30.5% attrition compared with 10.4% without overtime.",
        "Greater overall experience is strongly related to higher monthly income.",
        "Monthly income differs substantially across job roles."
    ]
})

display(final_summary)

	Business Question	Method	Main Statistical Result	Meaning
0	Does company tenure differ by attrition?	Mann–Whitney U	U = 102582, p < 0.001, r = -0.298	Employees who left had shorter median tenure: ...
1	Is overtime associated with attrition?	Chi-square	χ² = 87.56, p < 0.001, V = 0.244	Overtime employees had 30.5% attrition compare...
2	Is work experience related to income?	Spearman correlation	ρ = 0.710, p < 0.001	Greater overall experience is strongly related...
3	Does income differ across job roles?	Kruskal–Wallis + Dunn	H = 1073.41, p < 0.001, ε² = 0.729	Monthly income differs substantially across jo...

Conclusion#

This analysis identifies four meaningful patterns in the employee data:

Employees who left tended to have shorter company tenure than those who stayed.
Overtime was strongly associated with attrition, with overtime employees leaving at nearly three times the rate of non-overtime employees.
Total working experience was strongly positively related to monthly income.
Monthly income varied substantially across job roles.

The findings should be treated as associations rather than proof of cause. However, they point toward practical areas for further HR investigation, particularly early-tenure retention and overtime workload.

IBM HR Employee Attrition: A Statistical Analysis

Contents

IBM HR Employee Attrition: A Statistical Analysis#

Project goal#

Significance level#

1. Load and inspect the dataset#

Overall attrition summary#

2. Scenario 1: Company tenure and attrition#

Research question#

Interpretation#

3. Scenario 2: Overtime and attrition#

Research question#

Interpretation#

4. Scenario 3: Work experience and monthly income#

Research question#

Interpretation#

5. Scenario 4: Job role and monthly income#

Research question#

Dunn post-hoc test with Holm correction#

Interpretation#

6. Final findings summary#

Conclusion#