ML Odyssey: Part 6 - Dimensionality Reduction with Principal Component Analysis

2025-08-27 2898 words 14 minutes

Contents

🎯 Introduction: The Curse of High Dimensions

Imagine you’re trying to describe a person using 100 different measurements: height, weight, shoe size, eye color, hair length, favorite food, number of siblings, and 93 more characteristics. While comprehensive, this approach has a problem: too many dimensions make patterns hard to see!

This is the curse of dimensionality - a fundamental challenge in machine learning where having too many features actually makes it harder to find meaningful patterns.

💡 Real-World Example:

Low dimensions (2-3): Easy to visualize and understand relationships
Medium dimensions (4-10): Challenging but manageable with tools
High dimensions (10+): Nearly impossible to visualize, patterns become hidden
Very high dimensions (100+): Data becomes sparse, distances lose meaning

🧩 What is Dimensionality Reduction?

Dimensionality reduction is the process of reducing the number of features (dimensions) in your dataset while preserving the most important information. Think of it as compressing your data without losing the essence.

🌟 Principal Component Analysis (PCA): The Magic Wand

PCA is the most popular and powerful dimensionality reduction technique. It works by finding the directions of maximum variance in your data and projecting the data onto these new directions.

The Intuition Behind PCA

Imagine you’re looking at a 3D cloud of points floating in space. PCA finds the best “viewing angles” that show you the most variation in your data:

First Principal Component (PC1): The direction where your data varies the most
Second Principal Component (PC2): The direction of second-most variation (perpendicular to PC1)
And so on…

🔍 When to Use PCA?

✅ Perfect Use Cases:

High-dimensional data (10+ features)
Correlated features (redundant information)
Data visualization needs
Feature engineering for other algorithms
Noise reduction in datasets

⚠️ When to Be Cautious:

Categorical data (PCA works best with numerical data)
Non-linear relationships (PCA assumes linear relationships)
Interpretability is crucial (PCA components are abstract)
Small datasets (can lead to overfitting)

🚀 PCA with scikit-learn: Super Simple Example

Let’s see PCA in action with a simple, real-world example that you can run right now!

Step 1: Create Sample Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Create sample data: 100 people with 4 correlated features
np.random.seed(42)

# Generate correlated features
n_samples = 100
age = np.random.normal(30, 10, n_samples)
income = age * 1000 + np.random.normal(0, 2000, n_samples)  # Correlated with age
experience = age - 18 + np.random.normal(0, 2, n_samples)    # Correlated with age
education_years = np.random.normal(16, 2, n_samples)         # Less correlated

# Create DataFrame
data = pd.DataFrame({
    'Age': age,
    'Income': income,
    'Experience': experience,
    'Education_Years': education_years
})

print("📊 Original Data Shape:", data.shape)
print("\n🔍 First few rows:")
print(data.head())
print("\n📈 Correlation Matrix:")
print(data.corr())

📊 Original Data Shape: (100, 4) 🔍 First few rows: Age Income Experience Education_Years 0 34.967142 32136.400046 17.682716 14.342010 1 28.617357 27776.066343 11.738926 14.879638 2 36.476885 35791.456348 20.642988 17.494587 3 45.230299 43625.744026 29.337903 17.220741 4 27.658466 27335.894829 6.903128 15.958197 📈 Correlation Matrix: Age Income Experience Education_Years Age 1.000000 0.977821 0.975781 -0.170227 Income 0.977821 1.000000 0.953640 -0.175085 Experience 0.975781 0.953640 1.000000 -0.158922 Education_Years -0.170227 -0.175085 -0.158922 1.000000

Step 2: Visualize Original Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Visualize correlations
plt.figure(figsize=(12, 4))

# Correlation heatmap
plt.subplot(1, 2, 1)
plt.imshow(data.corr(), cmap='coolwarm', aspect='auto')
plt.colorbar()
plt.xticks(range(len(data.columns)), data.columns, rotation=45)
plt.yticks(range(len(data.columns)), data.columns)
plt.title('Feature Correlations')

# Scatter plot of Age vs Income
plt.subplot(1, 2, 2)
plt.scatter(data['Age'], data['Income'], alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Age vs Income (Correlated Features)')

plt.tight_layout()
plt.show()

06-pca-example_files/06-pca-example_3_0.png

Step 3: Apply PCA

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Step 1: Standardize the data (crucial for PCA!)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Step 2: Apply PCA
pca = PCA()
data_pca = pca.fit_transform(data_scaled)

# Step 3: Analyze results
print("🧩 PCA Results:")
print("=" * 50)

# Explained variance for each component
print("Explained variance by component:")
for i, var_ratio in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {var_ratio:.3f} ({var_ratio*100:.1f}%)")

# Cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
print(f"\nCumulative explained variance:")
for i, cum_var in enumerate(cumulative_variance):
    print(f"  PC1 to PC{i+1}: {cum_var:.3f} ({cum_var*100:.1f}%)")

# Feature loadings (how much each original feature contributes to each PC)
print(f"\nFeature loadings (how original features contribute to PCs):")
loadings_df = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(pca.components_.shape[0])],
    index=data.columns
)
print(loadings_df.round(3))

🧩 PCA Results: ================================================== Explained variance by component: PC1: 0.745 (74.5%) PC2: 0.239 (23.9%) PC3: 0.012 (1.2%) PC4: 0.004 (0.4%) Cumulative explained variance: PC1 to PC1: 0.745 (74.5%) PC1 to PC2: 0.985 (98.5%) PC1 to PC3: 0.996 (99.6%) PC1 to PC4: 1.000 (100.0%) Feature loadings (how original features contribute to PCs): PC1 PC2 PC3 PC4 Age 0.574 0.083 -0.031 0.814 Income 0.570 0.076 -0.692 -0.437 Experience 0.569 0.093 0.721 -0.384 Education_Years -0.145 0.989 -0.012 0.001

Step 4: Visualize PCA Results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Visualize PCA results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Enhanced Scree plot with both individual and cumulative variance
ax1_twin = ax1.twinx()

# Individual explained variance (bars)
bars = ax1.bar(range(1, len(pca.explained_variance_ratio_) + 1), 
               pca.explained_variance_ratio_, 
               alpha=0.7, color='steelblue', label='Individual Variance')

# Cumulative explained variance (line)
line = ax1_twin.plot(range(1, len(cumulative_variance) + 1), 
                     cumulative_variance, 'ro-', linewidth=2, markersize=6, 
                     label='Cumulative Variance')

# Add percentage labels on bars
for i, (var_ratio, cum_var) in enumerate(zip(pca.explained_variance_ratio_, cumulative_variance)):
    ax1.text(i+1, var_ratio + 0.005, f'{var_ratio*100:.1f}%', 
             ha='center', va='bottom', fontsize=9)
    ax1_twin.text(i+1, cum_var + 0.02, f'{cum_var*100:.1f}%', 
                  ha='center', va='bottom', fontsize=9, color='red')

# Formatting
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Individual Explained Variance Ratio', color='steelblue')
ax1_twin.set_ylabel('Cumulative Explained Variance Ratio', color='red')
ax1.set_title('Enhanced PCA Scree Plot\n(Blue: Individual, Red: Cumulative)')
ax1.grid(True, alpha=0.3)

# Add reference lines
ax1_twin.axhline(y=0.8, color='green', linestyle='--', alpha=0.7, label='80% threshold')
ax1_twin.axhline(y=0.95, color='orange', linestyle='--', alpha=0.7, label='95% threshold')

# Legend
ax1.legend(loc='upper right')
ax1_twin.legend(loc='center right')

# First two principal components scatter plot
team_rocket_mask = y == 'Yes'
ax2.scatter(X_pca[~team_rocket_mask, 0], X_pca[~team_rocket_mask, 1], 
           c='#2E8B57', alpha=0.6, label='Non-Team Rocket', s=30)
ax2.scatter(X_pca[team_rocket_mask, 0], X_pca[team_rocket_mask, 1], 
           c='#DC143C', alpha=0.6, label='Team Rocket', s=30)
ax2.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
ax2.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
ax2.set_title('First Two Principal Components')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

06-pca-example_files/06-pca-example_7_0.png

Step 5: Interpret the Results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
print("🔍 INTERPRETING PCA RESULTS:")
print("=" * 50)

# How many components to keep?
print("1️⃣ How many components should we keep?")
print(f"   • PC1 + PC2 capture {cumulative_variance[1]*100:.1f}% of variance")
print(f"   • PC1 + PC2 + PC3 capture {cumulative_variance[2]*100:.1f}% of variance")

if cumulative_variance[1] >= 0.8:
    print("   ✅ We can reduce from 4D to 2D and keep 80%+ of information!")
elif cumulative_variance[2] >= 0.9:
    print("   ✅ We can reduce from 4D to 3D and keep 90%+ of information!")
else:
    print("   ⚠️ Need more components to preserve sufficient information")

# What does each component represent?
print(f"\n2️⃣ What does each component represent?")
print("   PC1 (strongest patterns):")
for feature, loading in zip(data.columns, loadings_df['PC1']):
    print(f"      • {feature}: {loading:.3f}")

print("   PC2 (secondary patterns):")
for feature, loading in zip(data.columns, loadings_df['PC2']):
    print(f"      • {feature}: {loading:.3f}")

# Practical application
print(f"\n3️⃣ Practical application:")
print("   • Instead of tracking 4 separate features, we can use 2-3 PCs")
print("   • This reduces complexity while preserving most information")
print("   • Great for visualization and further analysis!")

🔍 INTERPRETING PCA RESULTS: ================================================== 1️⃣ How many components should we keep? • PC1 + PC2 capture 98.5% of variance • PC1 + PC2 + PC3 capture 99.6% of variance ✅ We can reduce from 4D to 2D and keep 80%+ of information! 2️⃣ What does each component represent? PC1 (strongest patterns): • Age: 0.574 • Income: 0.570 • Experience: 0.569 • Education_Years: -0.145 PC2 (secondary patterns): • Age: 0.083 • Income: 0.076 • Experience: 0.093 • Education_Years: 0.989 3️⃣ Practical application: • Instead of tracking 4 separate features, we can use 2-3 PCs • This reduces complexity while preserving most information • Great for visualization and further analysis!

🎯 Key Concepts You Just Learned

🤔 Feature Importance vs PCA: Different Perspectives

Feature Importance: “Which original features matter most for prediction?”
PCA Components: “What combinations of features capture the most variation?”

Example: Feature importance might say “Age is most important,” while PCA might create a component that combines age, income, and experience into a “life stage” dimension.

📚 PCA Terminology Cheat Sheet

Principal Component (PC): A new feature that’s a combination of original features
Explained Variance: How much information each PC captures
Loadings: How much each original feature contributes to each PC
Scree Plot: Visual tool to decide how many PCs to keep
Standardization: Crucial preprocessing step (mean=0, std=1)

🎯 When to Use How Many Components?

Keep 95%+ variance: For most applications
Keep 80%+ variance: For visualization and exploration
Keep 2-3 components: For easy plotting and interpretation
Use elbow method: Look for “knee” in scree plot

⚠️ Common Pitfalls and How to Avoid Them

1. Forgetting to Standardize

1
2
3
4
5
6
7
8
# ❌ Wrong - features on different scales
pca = PCA()
X_pca = pca.fit_transform(X)  # Age (0-100) vs Income (0-1000000)

# ✅ Correct - standardize first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_pca = pca.fit_transform(X_scaled)

2. Using Too Many Components

1
2
3
4
5
6
7
# ❌ Wrong - keeping all components defeats the purpose
pca = PCA(n_components=X.shape[1])

# ✅ Correct - choose meaningful number
pca = PCA(n_components=0.95)  # Keep 95% variance
# or
pca = PCA(n_components=3)     # Keep 3 components

3. Ignoring the Data Distribution

1
2
3
4
5
6
7
# ❌ Wrong - PCA assumes linear relationships
# If your data has curved patterns, consider Kernel PCA

# ✅ Correct - check data first
plt.scatter(X[:, 0], X[:, 1])
plt.title('Check for non-linear patterns')
plt.show()

🚀 Now let’s combine everything into a complete analytical strategy!

Q10. 🎯 Are there any multi-feature interactions that reveal hidden Rocket operatives?

📈 Graph: Pairplot (seaborn) of selected numerical variables.
🧩 Test: PCA for dimensionality reduction.

📊 Real-World Dimensionality Analysis

In the next section, you’ll apply PCA to simplify complex Pokemon data:

Feature Reduction: “Can we reduce 15 Pokemon stats to 3-4 meaningful components?”
- You’ll apply PCA and interpret explained variance
Component Interpretation: “What does ‘Component 1’ represent in Pokemon terms?”
- You’ll analyze loadings to understand what each component captures
Visualization: “How do Pokemon cluster when plotted using PCA components?”
- You’ll create 2D/3D plots to reveal hidden patterns

💡 The Connection: PCA will help you visualize high-dimensional Pokemon data and discover underlying structure!

Q10.1 Visualization: Multi-Feature Relationship Analysis

In this section, we’ll create a comprehensive visualization to explore how multiple features interact with each other and how these interactions differ between Team Rocket members and regular trainers. We’ll use a pairplot to examine all possible pairwise relationships between our key numerical features, which will help us identify patterns that might not be visible when looking at features individually.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Select key numerical features for multi-feature analysis
numerical_features = [
    'Age', 'Average Pokemon Level', 'Win Ratio', 
    'Number of Migrations', 'Debt to Kanto', 'Number of Gym Badges'
]

# Create a subset of data with these features plus the target
analysis_df = pokemon_df[numerical_features + ['Team Rocket']].copy()

# Pairplot to visualize relationships between features
plt.figure(figsize=(16, 12))
pair_plot = sns.pairplot(
    analysis_df, 
    hue='Team Rocket', 
    palette={'No': '#2E8B57', 'Yes': '#DC143C'},  # Sea green vs Crimson
    plot_kws={'alpha': 0.6, 's': 30},
    diag_kind='hist'
)
pair_plot.fig.suptitle('Multi-Feature Interactions: Team Rocket vs Non-Team Rocket', 
                       fontsize=16, y=1.02)

# Adjust legend
pair_plot.add_legend(title='Team Rocket Member', bbox_to_anchor=(1.05, 0.8))

plt.tight_layout()
plt.show()

06-pca-example_files/03-04-pokemon_42_1.png

Q10.2 Correlation matrix analysis

Here we’ll quantify the relationships we observed in the pairplot by calculating correlation coefficients between all feature pairs. This numerical analysis will help us identify which features are most strongly related to each other and reveal any hidden dependencies that could be important for understanding Team Rocket behavior patterns.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
correlation_matrix = analysis_df[numerical_features].corr()

# Heatmap of correlations
plt.figure(figsize=(10, 8))
sns.heatmap(
    correlation_matrix, 
    annot=True, 
    cmap='coolwarm', 
    center=0,
    square=True,
    fmt='.2f',
    cbar_kws={'shrink': 0.8}
)
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Identify strongest correlations
print("\n🔍 Strongest Feature Correlations (|r| > 0.3):")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.3:
            print(f"  • {correlation_matrix.columns[i]} ↔ {correlation_matrix.columns[j]}: {corr_value:.3f}")

06-pca-example_files/03-04-pokemon_44_0.png

Q10.3 Statistical Analysis: PCA

Now we’ll apply Principal Component Analysis to reduce the dimensionality of our feature space and discover the underlying structure in our data. PCA will help us identify the most important combinations of features that capture the maximum variance in our dataset, potentially revealing hidden patterns that distinguish Team Rocket members from regular trainers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Prepare data for PCA (standardize features)
X = analysis_df[numerical_features]
y = analysis_df['Team Rocket']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

print("🧩 PCA Results:")
print("Explained variance by component:")
for i, (var_ratio, cum_var) in enumerate(zip(pca.explained_variance_ratio_, cumulative_variance)):
    print(f"  PC{i+1}: {var_ratio:.3f} ({var_ratio*100:.1f}%) | Cumulative: {cum_var:.3f} ({cum_var*100:.1f}%)")

🧩 PCA Results: Explained variance by component: PC1: 0.199 (19.9%) | Cumulative: 0.199 (19.9%) PC2: 0.171 (17.1%) | Cumulative: 0.370 (37.0%) PC3: 0.166 (16.6%) | Cumulative: 0.537 (53.7%) PC4: 0.166 (16.6%) | Cumulative: 0.703 (70.3%) PC5: 0.162 (16.2%) | Cumulative: 0.865 (86.5%) PC6: 0.135 (13.5%) | Cumulative: 1.000 (100.0%)

Q10.4 Statistical Analysis: PCA Visualization

Now, we’ll visualize the PCA results to understand how the components capture the variance in the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
ax1_twin = ax1.twinx()

# Individual explained variance (bars)
bars = ax1.bar(range(1, len(pca.explained_variance_ratio_) + 1), 
               pca.explained_variance_ratio_, 
               alpha=0.7, color='steelblue', label='Individual Variance')

# Cumulative explained variance (line)
line = ax1_twin.plot(range(1, len(cumulative_variance) + 1), 
                     cumulative_variance, 'ro-', linewidth=2, markersize=6, 
                     label='Cumulative Variance')

# Add percentage labels on bars
for i, (var_ratio, cum_var) in enumerate(zip(pca.explained_variance_ratio_, cumulative_variance)):
    ax1.text(i+1, var_ratio + 0.005, f'{var_ratio*100:.1f}%', 
             ha='center', va='bottom', fontsize=9)
    ax1_twin.text(i+1, cum_var + 0.02, f'{cum_var*100:.1f}%', 
                  ha='center', va='bottom', fontsize=9, color='red')
ax1.set_title('Enhanced PCA Scree Plot')
ax1.grid(True, alpha=0.3)

# Add reference lines
ax1_twin.axhline(y=0.8, color='green', linestyle='--', alpha=0.7, label='80% threshold')
ax1_twin.axhline(y=0.95, color='orange', linestyle='--', alpha=0.7, label='95% threshold')

# Legend
ax1.legend(loc='upper right')
ax1_twin.legend(loc='center right')

# First two principal components scatter plot
y = analysis_df['Team Rocket']  # Target variable
X_pca = data_pca  # PCA transformed data

team_rocket_mask = y == 'Yes'
ax2.scatter(X_pca[~team_rocket_mask, 0], X_pca[~team_rocket_mask, 1], 
           c='#2E8B57', alpha=0.6, label='Non-Team Rocket', s=30)
ax2.scatter(X_pca[team_rocket_mask, 0], X_pca[team_rocket_mask, 1], 
           c='#DC143C', alpha=0.6, label='Team Rocket', s=30)
ax2.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
ax2.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
ax2.set_title('First Two Principal Components')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

🔍 SCREE PLOT INTERPRETATION:

📊 Individual Variance (Blue bars): • Shows how much variance each component captures • Look for a clear ’elbow’ where variance drops significantly • In this case, variance is quite evenly distributed across components

📈 Cumulative Variance (Red line): • Shows total variance captured by first N components • Green line (80%): Good threshold for most applications • Orange line (95%): Conservative threshold for critical applications

⚠️ CAUTION: No clear elbow in scree plot • Consider using 4-5 components to capture 80%+ variance • This dataset may not benefit much from PCA

Q10.5 Statistical Analysis: Feature Importance

Finally, we’ll use a Random Forest classifier to determine which individual features are most predictive of Team Rocket membership. This analysis will complement our PCA results by showing us which original features matter most for classification, helping us understand both the individual feature importance and the combined patterns we discovered through dimensionality reduction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Use Random Forest to identify most important features for classification
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_scaled, y)

# Get feature importances
feature_importance = pd.DataFrame({
    'Feature': numerical_features,
    'Importance': rf_classifier.feature_importances_
}).sort_values('Importance', ascending=False)

print(f"\n🎯 Feature Importance for Team Rocket Classification:")
print("Features ranked by predictive power:")
for idx, row in feature_importance.iterrows():
    print(f"  {row['Feature']}: {row['Importance']:.3f}")

# Visualize feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='Importance', y='Feature', hue='Importance', palette='viridis')
plt.title('Feature Importance for Team Rocket Detection', fontsize=14, fontweight='bold')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

🎯 Feature Importance for Team Rocket Classification: Features ranked by predictive power: Debt to Kanto: 0.722 Number of Migrations: 0.133 Win Ratio: 0.062 Average Pokemon Level: 0.035 Age: 0.031 Number of Gym Badges: 0.016

06-pca-example_files/03-04-pokemon_50_1.png

This blog post is part of the ML Odyssey series. Check out the previous parts to build a solid foundation in machine learning fundamentals!