Human Activity Recognition from Smartphone Sensors

Human Activity Recognition from Smartphone Sensors#

PCA, KMeans Clustering and Supervised Classification#

Goal#

This notebook studies whether smartphone sensor features reveal human activities in two ways:

Unsupervised exploration: PCA and KMeans discover activity-like groups without using the activity labels while building clusters.
Supervised prediction: Logistic Regression, Random Forest and XGBoost predict activity labels for volunteers in the official unseen test set.

The dataset contains motion windows from 30 volunteers performing six activities. Each row contains 561 prepared sensor features. The official train/test split is preserved: model tuning happens only with the training volunteers, and the test set is checked only at the final stage.

Source: UCI Human Activity Recognition Using Smartphones

# Run only if needed:
# !pip install umap-learn xgboost -q

1. Load the data#

from pathlib import Path
import pandas as pd

train = pd.read_csv("../data/sensor_train.csv")
test = pd.read_csv("../data/sensor_test.csv")

feature_columns = [c for c in train.columns if c not in ["subject", "Activity"]]

summary = pd.DataFrame({
    "Set": ["Train", "Test"],
    "Rows": [len(train), len(test)],
    "Sensor features": [len(feature_columns), len(feature_columns)],
    "Volunteers": [train["subject"].nunique(), test["subject"].nunique()],
    "Activities": [train["Activity"].nunique(), test["Activity"].nunique()],
    "Missing values": [train.isna().sum().sum(), test.isna().sum().sum()]
})

summary

	Set	Rows	Sensor features	Volunteers	Activities	Missing values
0	Train	7352	561	21	6	0
1	Test	2947	561	9	6	0

train["Activity"].value_counts().rename_axis("Activity").to_frame("Train rows")

	Train rows
Activity
LAYING	1407
STANDING	1374
SITTING	1286
WALKING	1226
WALKING_UPSTAIRS	1073
WALKING_DOWNSTAIRS	986

X_train = train[feature_columns]
y_train = train["Activity"]
groups_train = train["subject"]

X_test = test[feature_columns]
y_test = test["Activity"]

X_train.shape, X_test.shape

((7352, 561), (2947, 561))

The Activity label is used only after clustering for evaluation, and later as the target for classification. The subject column identifies volunteers and is used to make cross-validation group-aware.

2. PCA dimensionality reduction#

import numpy as np
from sklearn.decomposition import PCA

pca_full = PCA(random_state=42)
pca_full.fit(X_train)

cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

pca_report = pd.DataFrame({
    "PCA components": [2, 5, 10, 11, 36, 50, 69],
    "Variance retained (%)": [
        cumulative_variance[1] * 100,
        cumulative_variance[4] * 100,
        cumulative_variance[9] * 100,
        cumulative_variance[10] * 100,
        cumulative_variance[35] * 100,
        cumulative_variance[49] * 100,
        cumulative_variance[68] * 100
    ]
}).round(2)

pca_report

	PCA components	Variance retained (%)
0	2	67.47
1	5	75.16
2	10	80.50
3	11	81.27
4	36	90.55
5	50	93.09
6	69	95.23

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(range(1, 81), cumulative_variance[:80] * 100, marker="o", markersize=3)
ax.axhline(90, linestyle="--", linewidth=1, label="90%")
ax.axhline(95, linestyle="--", linewidth=1, label="95%")
ax.axvline(50, linestyle=":", linewidth=1, label="50 PCs")
ax.axvline(69, linestyle=":", linewidth=1, label="69 PCs")
ax.set_title("PCA Cumulative Explained Variance")
ax.set_xlabel("Number of PCA components")
ax.set_ylabel("Variance retained (%)")
ax.legend()
ax.grid(False)
plt.tight_layout()
plt.show()

../_images/f50fc8806a2fac79573c36dc6b4af6fcbaedd7135d3ebc203c280115444f9692.png

We report PCA with 36, 50 and 69 components for clustering. For the supervised models, we use only PCA 69, because it preserves approximately 95% of training variation while still reducing the original 561 features heavily.

3. KMeans clustering after PCA#

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score, normalized_mutual_info_score

clustering_results = []
cluster_labels = {}
pca_spaces = {}

for n_components in [36, 50, 69]:
    pca = PCA(n_components=n_components, random_state=42)
    X_pca = pca.fit_transform(X_train)

    kmeans = KMeans(n_clusters=6, n_init=20, random_state=42)
    clusters = kmeans.fit_predict(X_pca)

    clustering_results.append({
        "PCA components": n_components,
        "Variance retained (%)": pca.explained_variance_ratio_.sum() * 100,
        "Silhouette": silhouette_score(X_pca, clusters, sample_size=3000, random_state=42),
        "ARI": adjusted_rand_score(y_train, clusters),
        "NMI": normalized_mutual_info_score(y_train, clusters)
    })

    cluster_labels[n_components] = clusters
    pca_spaces[n_components] = X_pca

clustering_results = pd.DataFrame(clustering_results).round(3)
clustering_results

	PCA components	Variance retained (%)	Silhouette	ARI	NMI
0	36	90.548	0.182	0.455	0.584
1	50	93.086	0.170	0.456	0.585
2	69	95.226	0.160	0.455	0.584

best_pca_for_clustering = int(
    clustering_results.sort_values("Silhouette", ascending=False).iloc[0]["PCA components"]
)

X_cluster_pca = pca_spaces[best_pca_for_clustering]
best_clusters = cluster_labels[best_pca_for_clustering]

best_pca_for_clustering

cluster_activity = pd.crosstab(
    pd.Series(best_clusters, name="KMeans cluster"),
    y_train.reset_index(drop=True).rename("Known activity"),
    normalize="index"
).mul(100).round(1)

cluster_activity

Known activity	LAYING	SITTING	STANDING	WALKING	WALKING_DOWNSTAIRS	WALKING_UPSTAIRS
KMeans cluster
0	0.7	0.1	0.0	37.6	8.7	53.0
1	0.0	49.2	50.8	0.0	0.0	0.0
2	0.0	0.0	0.0	40.1	44.5	15.4
3	16.8	35.5	47.6	0.0	0.0	0.0
4	96.1	3.9	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	23.8	64.3	11.9

fig, ax = plt.subplots(figsize=(10, 5))
image = ax.imshow(cluster_activity.values, aspect="auto")
ax.set_title(f"Cluster Composition by Known Activity — PCA {best_pca_for_clustering}")
ax.set_xlabel("Known activity")
ax.set_ylabel("KMeans cluster")
ax.set_xticks(range(len(cluster_activity.columns)))
ax.set_xticklabels(cluster_activity.columns, rotation=45, ha="right")
ax.set_yticks(range(len(cluster_activity.index)))
ax.set_yticklabels(cluster_activity.index)
ax.grid(False)
plt.colorbar(image, ax=ax, label="Percentage within cluster")
plt.tight_layout()
plt.show()

../_images/16207ffad2ef0463f449b49dfbfce31a3e309b7ca8e41d51461557ee9686b282.png

Silhouette measures natural cluster separation without using activity labels. ARI and NMI check afterward how closely the discovered clusters align with the real six activities.

4. PCA, t-SNE and UMAP visualisation#

def plot_embedding(embedding, labels, title, x_label, y_label, legend_title):
    fig, ax = plt.subplots(figsize=(10, 7))

    for label in sorted(pd.Series(labels).unique()):
        mask = pd.Series(labels).to_numpy() == label
        ax.scatter(embedding[mask, 0], embedding[mask, 1], s=18, alpha=0.78, label=label)

    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.grid(False)
    ax.legend(title=legend_title, bbox_to_anchor=(1.02, 1), loc="upper left", fontsize=8)
    plt.tight_layout()
    plt.show()


visual_sample = (
    train.groupby("Activity", group_keys=False)
    .sample(n=50, random_state=42)
    .sample(frac=1, random_state=42)
)

visual_index = visual_sample.index
visual_activities = visual_sample["Activity"]
X_visual = X_cluster_pca[visual_index]
visual_clusters = best_clusters[visual_index]

plot_embedding(
    X_visual[:, :2],
    visual_activities,
    f"PCA Projection by Known Activity — PCA {best_pca_for_clustering}",
    "Principal Component 1",
    "Principal Component 2",
    "Activity"
)

../_images/7188a66590b7981a9c1483823e9a75da047b03690064e8fd0e8a3535287c3e99.png

from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,
    init="pca",
    learning_rate="auto",
    max_iter=500,
    random_state=42
)

X_tsne = tsne.fit_transform(X_visual)

plot_embedding(X_tsne, visual_activities, "t-SNE by Known Activity", "t-SNE 1", "t-SNE 2", "Activity")
plot_embedding(X_tsne, visual_clusters, "t-SNE by KMeans Cluster", "t-SNE 1", "t-SNE 2", "Cluster")

../_images/f5e0a32e5e03ee9687adcfa96b7cd23e9db77de48efca9efcc945681735600ec.png

../_images/a6c5c02c9f88900fe213e68f6b63d676ecea4093e628f7187f7b08f8cf04f336.png

!pip install umap-learn

Requirement already satisfied: umap-learn in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (0.5.12)
Requirement already satisfied: numpy>=1.23 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from umap-learn) (2.0.2)
Requirement already satisfied: scipy>=1.3.1 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from umap-learn) (1.16.1)
Requirement already satisfied: scikit-learn>=1.6 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from umap-learn) (1.7.1)
Requirement already satisfied: numba>=0.51.2 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from umap-learn) (0.60.0)
Requirement already satisfied: pynndescent>=0.5 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from umap-learn) (0.6.0)
Requirement already satisfied: tqdm in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from umap-learn) (4.66.1)
Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from numba>=0.51.2->umap-learn) (0.43.0)
Requirement already satisfied: joblib>=0.11 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from pynndescent>=0.5->umap-learn) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from scikit-learn>=1.6->umap-learn) (3.6.0)
Requirement already satisfied: colorama in C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages (from tqdm->umap-learn) (0.4.6)

WARNING: Ignoring invalid distribution ~tatsmodels (C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages)
WARNING: Ignoring invalid distribution ~tatsmodels (C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages)
WARNING: Ignoring invalid distribution ~tatsmodels (C:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages)

[notice] A new release of pip is available: 26.0.1 -> 26.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

# UMAP can take longer on its first run while the library compiles in a fresh environment.
RUN_UMAP = True

if RUN_UMAP:
    import umap.umap_ as umap

    umap_model = umap.UMAP(
        n_neighbors=15,
        min_dist=0.20,
        n_epochs=100,
        random_state=42
    )
    X_umap = umap_model.fit_transform(X_visual)

    plot_embedding(X_umap, visual_activities, "UMAP by Known Activity", "UMAP 1", "UMAP 2", "Activity")
    plot_embedding(X_umap, visual_clusters, "UMAP by KMeans Cluster", "UMAP 1", "UMAP 2", "Cluster")
else:
    print("UMAP code is ready. Change RUN_UMAP to True when you want to generate the UMAP plots.")

c:\Users\zirad\AppData\Local\Programs\Python\Python312\Lib\site-packages\umap\umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(

../_images/4548354a2e951e3fed4843a150d863141b29f3cdb983b9d0aedf2c433ffda61a.png

../_images/915d2db88010f64f8e78ee68cc0ad472dd356f3873d1955e9b848521d342f0cd.png

KMeans is trained on PCA features. t-SNE and UMAP are used only to display the activity structure and discovered clusters in two dimensions.

5. Supervised classification with PCA 69#

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)

encoder.classes_

array(['LAYING', 'SITTING', 'STANDING', 'WALKING', 'WALKING_DOWNSTAIRS',
       'WALKING_UPSTAIRS'], dtype=object)

from joblib import Memory
from sklearn.model_selection import GroupKFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report

cv = GroupKFold(n_splits=3)
cache = Memory("har_pipeline_cache", verbose=0)

results = []
searches = {}

PCA is inside each pipeline, so it is fitted only within the training portion of every cross-validation fold. GroupKFold keeps the recordings of one volunteer together during tuning.

Logistic Regression#

from sklearn.linear_model import LogisticRegression

logistic = Pipeline([
    ("pca", PCA(n_components=69, random_state=42)),
    ("model", LogisticRegression(max_iter=2000))
], memory=cache)

logistic_search = RandomizedSearchCV(
    logistic,
    {"model__C": [0.001, 0.01, 0.1, 1, 10, 100]},
    n_iter=3,
    scoring="f1_macro",
    cv=cv,
    random_state=42,
    n_jobs=1,
    refit=True
)

logistic_search.fit(X_train, y_train_encoded, groups=groups_train)
logistic_pred = logistic_search.predict(X_test)

results.append({
    "Model": "Logistic Regression",
    "CV Macro F1": logistic_search.best_score_,
    "Test Accuracy": accuracy_score(y_test_encoded, logistic_pred),
    "Test Macro F1": f1_score(y_test_encoded, logistic_pred, average="macro")
})

searches["Logistic Regression"] = logistic_search
logistic_search.best_params_

{'model__C': 100}

Random Forest#

from sklearn.ensemble import RandomForestClassifier

forest = Pipeline([
    ("pca", PCA(n_components=69, random_state=42)),
    ("model", RandomForestClassifier(random_state=42, n_jobs=1))
], memory=cache)

forest_search = RandomizedSearchCV(
    forest,
    {
        "model__n_estimators": [50, 80, 120],
        "model__max_depth": [None, 15, 25],
        "model__min_samples_leaf": [1, 2, 4],
        "model__max_features": ["sqrt", "log2"]
    },
    n_iter=2,
    scoring="f1_macro",
    cv=cv,
    random_state=42,
    n_jobs=1,
    refit=True
)

forest_search.fit(X_train, y_train_encoded, groups=groups_train)
forest_pred = forest_search.predict(X_test)

results.append({
    "Model": "Random Forest",
    "CV Macro F1": forest_search.best_score_,
    "Test Accuracy": accuracy_score(y_test_encoded, forest_pred),
    "Test Macro F1": f1_score(y_test_encoded, forest_pred, average="macro")
})

searches["Random Forest"] = forest_search
forest_search.best_params_

{'model__n_estimators': 80,
 'model__min_samples_leaf': 2,
 'model__max_features': 'log2',
 'model__max_depth': 25}

XGBoost#

from xgboost import XGBClassifier

xgb = Pipeline([
    ("pca", PCA(n_components=69, random_state=42)),
    ("model", XGBClassifier(
        objective="multi:softprob",
        num_class=6,
        eval_metric="mlogloss",
        tree_method="hist",
        max_bin=128,
        random_state=42,
        n_jobs=1
    ))
], memory=cache)

xgb_search = RandomizedSearchCV(
    xgb,
    {
        "model__n_estimators": [30, 50, 80],
        "model__max_depth": [3, 5],
        "model__learning_rate": [0.05, 0.10, 0.20],
        "model__subsample": [0.8, 1.0],
        "model__colsample_bytree": [0.8, 1.0]
    },
    n_iter=2,
    scoring="f1_macro",
    cv=cv,
    random_state=42,
    n_jobs=1,
    refit=True
)

xgb_search.fit(X_train, y_train_encoded, groups=groups_train)
xgb_pred = xgb_search.predict(X_test)

results.append({
    "Model": "XGBoost",
    "CV Macro F1": xgb_search.best_score_,
    "Test Accuracy": accuracy_score(y_test_encoded, xgb_pred),
    "Test Macro F1": f1_score(y_test_encoded, xgb_pred, average="macro")
})

searches["XGBoost"] = xgb_search
xgb_search.best_params_

{'model__subsample': 0.8,
 'model__n_estimators': 50,
 'model__max_depth': 3,
 'model__learning_rate': 0.2,
 'model__colsample_bytree': 1.0}

6. Final test-set results#

results_table = (
    pd.DataFrame(results)
    .sort_values("Test Macro F1", ascending=False)
    .reset_index(drop=True)
)

results_table.round(4)

	Model	CV Macro F1	Test Accuracy	Test Macro F1
0	Logistic Regression	0.9002	0.9464	0.9455
1	Random Forest	0.8664	0.9084	0.9062
2	XGBoost	0.8671	0.9036	0.9014

winner = results_table.loc[0, "Model"]
final_model = searches[winner]
final_prediction = final_model.predict(X_test)

winner

'Logistic Regression'

final_report = classification_report(
    y_test_encoded,
    final_prediction,
    target_names=encoder.classes_,
    output_dict=True
)

pd.DataFrame(final_report).transpose().round(3)

	precision	recall	f1-score	support
LAYING	0.998	1.000	0.999	537.000
SITTING	0.941	0.904	0.922	491.000
STANDING	0.921	0.947	0.934	532.000
WALKING	0.909	0.986	0.946	496.000
WALKING_DOWNSTAIRS	0.961	0.950	0.956	420.000
WALKING_UPSTAIRS	0.952	0.883	0.916	471.000
accuracy	0.946	0.946	0.946	0.946
macro avg	0.947	0.945	0.946	2947.000
weighted avg	0.947	0.946	0.946	2947.000

from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(9, 7))

ConfusionMatrixDisplay.from_predictions(
    y_test_encoded,
    final_prediction,
    display_labels=encoder.classes_,
    xticks_rotation=45,
    cmap="Blues",
    ax=ax
)

ax.set_title(f"Confusion Matrix — {winner}")
ax.grid(False)
plt.tight_layout()
plt.show()

../_images/5089d93c2b280179795f57f1e002b3f4c2615a995616eff56dfbb0d4281fbed4.png

Conclusion#

This notebook connects unsupervised and supervised learning in one activity-recognition workflow:

PCA reduces 561 sensor features to compact representations.
KMeans investigates whether activity patterns can be discovered without target labels.
Silhouette, ARI and NMI evaluate the clustering.
t-SNE visualises known activities and discovered groups; a ready-to-run UMAP cell is included for an additional nonlinear visual.
PCA with 69 components is used consistently for classification.
Logistic Regression, Random Forest and XGBoost are tuned with group-aware cross-validation.
The official test volunteers are used only for the final evaluation.