Contents

ML Odyssey: Part 3 - Statistical Methods for Exploratory Analysis

Note
This is Part 3 of our EDA series. If you haven’t already, check out Part 2 where we covered the programming fundamentals with pandas, matplotlib, and seaborn. Here we’ll dive deep into the statistical concepts that power effective data analysis.

Welcome to the analytical heart of exploratory data analysis! While Part 2 taught you how to manipulate and visualize data, this post focuses on why and when to use different analytical techniques.

The Statistical Foundation of EDA

Exploratory Data Analysis isn’t just about creating pretty charts—it’s about scientifically investigating your data to uncover reliable insights. Think of the tools we’ll cover as your “data detective toolkit” that helps you prove whether patterns you observe are real discoveries or just coincidences.

🎯 The Core Question: Every statistical test asks the same fundamental question: “Could this pattern have happened by random chance, or is there something real going on here?”

Before we dive into the details, here’s the logical progression we’ll follow to build your statistical analysis skills:

📚 The Complete EDA Learning Journey

Group 1: Analysis Methods

  1. 🔍 Test Selection by Data Type: Choose tests based on what you’re analyzing
    • 1.1 Categorical vs Categorical: Chi-Square vs Fisher’s Exact
    • 1.2 Continuous vs Categorical - Two Groups: T-Test vs Mann-Whitney U
    • 1.3 Continuous vs Categorical - 3+ Groups: ANOVA vs Kruskal-Wallis
    • 1.4 Geographic Patterns: Moran’s I Spatial Analysis
  2. 🔧 Parametric vs Non-Parametric: When to use each approach

Group 2: Evaluation

  1. 📊 Relationships: Testing relationships and understanding correlations
    X. 🎯 Complexity: Reducing complexity with dimensionality reduction. This is a very important topic and we will cover it in Part 6.

Group 3: Integration

  1. 🚀 Strategy: Combining everything into a complete analytical strategy

💡 Each section builds on the previous one, creating a systematic approach to data analysis!


🎯 Now that you understand the statistical foundation, let’s learn how to choose the right tests for your data!

1. 🔍 Analysis Methods: Test Selection by Data Type

📚 Start here! The first step in statistical analysis is understanding what you’re testing and choosing the right method. Different data types require different approaches. Let’s organize tests by the questions they answer rather than by complexity.

🎯 Statistical Test Decision Tree

Step 1: What are you testing?

  • Categorical vs Categorical → Chi-Square Test or Fisher’s Exact Test
  • Continuous vs Categorical (2 groups) → T-Test or Mann-Whitney U Test
  • Continuous vs Categorical (3+ groups) → ANOVA or Kruskal-Wallis Test
  • Geographic patterns → Moran’s I Spatial Analysis

Step 2: How complex is your data?

  • Normal data, large samples → Parametric tests (more powerful)
  • Non-normal data, small samples → Non-parametric tests (more robust)

1.1 Categorical vs Categorical: Testing Independence

Question: “Are these two categorical variables related or independent?”

MethodWhat It DoesKey ConceptsWhen to Use
Chi-Square TestTests relationships between categorical variables to determine if they’re independent or relatedNull Hypothesis: Variables are independent
Test Statistic: Measures deviation from expected frequencies
Assumption: Expected frequency ≥ 5 per cell
Effect Size: Cramer’s V (0-1, measures association strength)
• Large sample sizes
• All expected frequencies ≥ 5
• General independence testing
Fisher’s Exact TestExact test for small categorical tables with no distribution assumptionsNull Hypothesis: Variables are independent
Test Statistic: Exact probability calculation
Assumption: None (distribution-free)
Effect Size: Phi coefficient or odds ratio
• Small sample sizes
• Expected frequencies < 5
• 2×2 contingency tables

💡 Effect Size in Categorical Tests: What Do Cramer’s V and Phi Mean?

  • Cramer’s V (Chi-Square):
    Cramer’s V measures the strength of association between categorical variables. It ranges from 0 (no association) to 1 (perfect association), adjusted for the number of categories.
    Interpretation:

    • V > 0.1 = weak effect
    • V > 0.3 = moderate effect
    • V > 0.5 = strong effect
    Think of it as: "How much does knowing one variable help predict the other?"
  • Phi Coefficient (Fisher’s Exact):
    Phi coefficient measures association in 2×2 tables, similar to a correlation coefficient. It ranges from -1 to +1.
    Interpretation:

    • |φ| > 0.1 = weak effect
    • |φ| > 0.3 = moderate effect
    • |φ| > 0.5 = strong effect
    Think of it as: "How strongly are the two binary variables related?"

🔍 Interpreting Categorical Tests

📊 Chi-Square Test Results:

  • p-value < 0.05: Variables are NOT independent (there’s a real relationship)
  • p-value ≥ 0.05: Variables are independent (no relationship detected)
  • Effect Size: Cramer’s V > 0.1 = weak, > 0.3 = moderate, > 0.5 = strong
  • Practical Insight: “Pokemon type and Team Rocket membership are related (p < 0.01, V = 0.4), suggesting certain types are preferred by villains”

🔬 Fisher’s Exact Test Results:

  • p-value < 0.05: Variables are NOT independent (there’s a real relationship)
  • p-value ≥ 0.05: Variables are independent (no relationship detected)
  • Effect Size: |φ| > 0.1 = weak, > 0.3 = moderate, > 0.5 = strong
  • Practical Insight: “In small samples, Pokemon type and Team Rocket membership show moderate association (p < 0.05, φ = 0.35)”

1.2 Continuous vs Categorical - Two Groups: T-Test vs Mann-Whitney U

Question: “Do two groups have different average values?”

MethodWhat It DoesKey ConceptsWhen to Use
T-TestCompares means between exactly two groupsPaired vs. Independent: Related vs. unrelated samples
T-Statistic: Difference in means relative to standard error
Assumption: Data approximately normal
Effect Size: Cohen’s d (standardized mean difference)
• Normal data distribution
• Large sample sizes
• Comparing two groups (e.g., before/after)
Mann-Whitney U TestNon-parametric alternative to t-test using ranksRank Transformation: Converts values to ranks
U-Statistic: Measures separation between group rankings
Assumption: None (distribution-free)
Effect Size: r = Z/√N (0-1, strength of ranking differences)
• Non-normal data
• Small sample sizes
• Ordinal or continuous data with outliers

💡 Effect Size in Two Groups Comparison: What Do Cohen’s d and r Mean?

  • Cohen’s d (T-Test):
    Cohen’s d measures the standardized difference between two group means. It’s the difference in means divided by the pooled standard deviation.
    Interpretation:

    • d > 0.2 = small effect (groups overlap by 85%)
    • d > 0.5 = medium effect (groups overlap by 67%)
    • d > 0.8 = large effect (groups overlap by 53%)
    Think of it as: "How many standard deviations apart are the groups?"
  • r (Mann-Whitney U):
    The r effect size is calculated as r = Z/√N, where Z is the standardized test statistic and N is the total sample size. It ranges from 0 to 1, similar to a correlation coefficient.
    Interpretation:

    • r > 0.1 = small effect
    • r > 0.3 = medium effect
    • r > 0.5 = large effect
    Think of it as: "How much do the group rankings differ from random chance?"

🔍 Interpreting Two Groups Comparison

⚖️ T-Test Results:

  • p-value < 0.05: Groups are significantly different
  • t-statistic: Higher absolute values indicate stronger differences
  • Effect Size: Cohen’s d > 0.2 = small, > 0.5 = medium, > 0.8 = large
  • Practical Insight: “Team Rocket members have higher win ratios (t = 3.1, p < 0.01, d = 0.6), suggesting they’re more skilled battlers”

🔄 Mann-Whitney U Results:

  • p-value < 0.05: Groups differ significantly in rankings
  • Effect Size: r = Z/√N, where r > 0.1 = small, > 0.3 = medium, > 0.5 = large
  • Practical Insight: “Team Rocket migration patterns differ from regular trainers (U = 45, p < 0.05, r = 0.4), showing distinct movement strategies”

1.3 Continuous vs Categorical - 3+ Groups: ANOVA vs Kruskal-Wallis

Question: “Do multiple groups have different average values?”

MethodWhat It DoesKey ConceptsWhen to Use
ANOVACompares means across multiple groups simultaneouslyF-Statistic: Ratio of between-group to within-group variance
Post-hoc Tests: Identify which specific groups differ
Assumption: Groups have similar variances
Effect Size: Eta² (0-1, proportion of variance explained)
• Normal data distribution
• 3+ groups to compare
• Experimental design analysis
Kruskal-Wallis TestNon-parametric alternative to ANOVA using ranksH-Statistic: Measures overall group differences using ranks
No Distribution Assumptions: Works with any data shape
Post-hoc Analysis: Dunn’s test for pairwise comparisons
Effect Size: Epsilon² (0-1, proportion of ranking variance explained)
• Non-normal data
• Multiple group comparisons
• Ordinal or continuous data

💡 Effect Size in Multiple Groups Comparison: What Do Eta² and Epsilon² Mean?

  • Eta² (ANOVA):
    Eta² measures the proportion of variance in the dependent variable explained by the independent variable. It ranges from 0 to 1.
    Interpretation:

    • 0.01 = 1% variance explained (small effect)
    • 0.06 = 6% (medium effect)
    • 0.14 = 14% (large effect)
    Think of it as: "How much of the difference in attack stats is due to Pokemon type?"
  • Epsilon² (Kruskal-Wallis):
    Epsilon² is the non-parametric equivalent of Eta², measuring the proportion of variance explained by group differences in rankings. It also ranges from 0 to 1.
    Interpretation:

    • 0.01 = 1% variance explained (small effect)
    • 0.06 = 6% (medium effect)
    • 0.14 = 14% (large effect)
    Think of it as: "How much of the ranking differences are due to Pokemon type?"

🔍 Interpreting Multiple Groups Comparison

📈 ANOVA Results:

  • p-value < 0.05: At least one group differs significantly
  • F-statistic: Higher values indicate stronger group differences
  • Post-hoc needed: Use Tukey’s HSD to identify which specific groups differ
  • Effect Size: Eta² > 0.01 = small, > 0.06 = medium, > 0.14 = large
  • Practical Insight: “Pokemon types have different attack stats (F = 8.2, p < 0.001, η² = 0.15), with Fighting types being strongest”

🎯 Kruskal-Wallis Results:

  • p-value < 0.05: At least one group differs in rankings
  • H-statistic: Higher values indicate stronger group differences
  • Post-hoc needed: Use Dunn’s test for pairwise comparisons
  • Effect Size: Epsilon² > 0.01 = small, > 0.06 = medium, > 0.14 = large
  • Practical Insight: “Pokemon types have different strength rankings (H = 12.3, p < 0.01, ε² = 0.18), with Dragon types ranking highest”

1.3 Geographic Patterns: Spatial Analysis

Question: “Are there geographic patterns or clusters in your data?”

MethodWhat It DoesKey ConceptsWhen to Use
Moran’s ITests for spatial autocorrelation to detect geographic clustering patternsSpatial Autocorrelation: Measures if similar values cluster geographically
I-Statistic: Ranges from -1 (dispersed) to +1 (clustered)
Spatial Weights: Defines what “nearby” means in your context
Effect Size:
I

💡 Effect Size in Spatial Analysis: What Does Moran’s I Mean?

  • Moran’s I (Spatial Autocorrelation):
    Moran’s I measures spatial autocorrelation - how much similar values tend to cluster geographically. It ranges from -1 (perfect dispersion) to +1 (perfect clustering), with 0 indicating random distribution.
    Interpretation:

    • |I| > 0.1 = weak spatial pattern
    • |I| > 0.3 = moderate spatial pattern
    • |I| > 0.5 = strong spatial pattern
    Think of it as: "How much do similar values (like Team Rocket members) tend to be located near each other?"
  • Direction Matters:

    • I > 0: Positive spatial autocorrelation (similar values cluster together)
    • I < 0: Negative spatial autocorrelation (similar values are dispersed)
    • I ≈ 0: No spatial pattern (random distribution)

🔍 Interpreting Spatial Analysis

🗺️ Moran’s I Results:

  • I > 0: Positive spatial autocorrelation (similar values cluster together)
  • I < 0: Negative spatial autocorrelation (similar values are dispersed)
  • I ≈ 0: No spatial pattern (random distribution)
  • p-value < 0.05: Spatial pattern is statistically significant
  • Effect Size: |I| > 0.1 = weak, > 0.3 = moderate, > 0.5 = strong
  • Practical Insight: “Team Rocket members show strong clustering (I = 0.45, p < 0.001), indicating they operate in specific geographic regions”

2. 🔧 Analysis Methods: Parametric vs Non-Parametric Decision

Now that you know which test type to use, let’s understand when to choose the parametric or non-parametric version of each test.

🎯 Key Concept: Parametric tests are more powerful when assumptions are met, but non-parametric tests are more robust when data is messy. Always check your data before choosing!

Parametric Tests

✅ Use when:
  • Data is approximately normal (bell curve shape)
  • Sample sizes are large (n > 30)
  • Groups have similar variances (homogeneity)
  • Data is continuous and well-behaved
🎯 Benefits:
  • More statistical power (better chance of detecting real effects)
  • Exact p-values from known distributions
  • Standard effect size measures (Cohen's d, Eta²)
  • Widely understood and reported in literature
⚠️ Risks:
  • Results can be misleading if assumptions are violated
  • Type I/II errors increase with assumption violations
  • May need data transformation to meet assumptions

Non-Parametric Tests

✅ Use when:
  • Data is not normally distributed (skewed, irregular)
  • Sample sizes are small (n < 30)
  • Data has outliers that can't be removed
  • Data is ordinal (rankings, ratings)
  • Variances are unequal between groups
🎯 Benefits:
  • No distribution assumptions required
  • Robust against outliers and extreme values
  • Works with any data shape or size
  • More reliable when parametric assumptions fail
⚠️ Trade-offs:
  • Slightly less powerful than parametric tests
  • May need larger samples to detect the same effect
  • Effect sizes are less standardized
  • Results are rank-based, not value-based

💡 Recommended Approach:

  1. Start with parametric tests (more powerful when assumptions are met)
  2. Check assumptions (normality, homogeneity, sample size)
  3. If assumptions fail, switch to non-parametric alternatives
  4. Report both results if possible (parametric + non-parametric)

🔍 Real-World Application: Pokemon Team Rocket Dataset Analysis

In Part 4, you’ll use every single statistical method from Section 1 to analyze real Pokemon data. Here’s exactly what you’ll investigate:

📊 Categorical vs Categorical (Section 1.1)

  • Chi-Square/Fisher’s Exact: “Are Pokemon types independent of Team Rocket membership?”
    • You’ll choose between Chi-Square (large samples) and Fisher’s Exact (small samples)

📈 Two Groups Comparison (Section 1.2)

  • T-Test vs Mann-Whitney U: “Do Team Rocket members have higher attack stats?”
    • You’ll choose based on data normality and sample size

📊 Multiple Groups Comparison (Section 1.3)

  • ANOVA vs Kruskal-Wallis: “Do different Pokemon types have different average attack stats?”
    • You’ll compare multiple types and choose the appropriate test

🗺️ Geographic Patterns (Section 1.4)

  • Moran’s I: “Are Team Rocket members geographically clustered?”
    • You’ll analyze location data to find spatial patterns

🔧 Test Selection Strategy (Section 2)

  • Parametric vs Non-Parametric: You’ll learn when to use each approach based on your data characteristics

💡 The Connection: Every concept you just learned will be applied to real data, helping you understand not just the theory, but how to use these methods in practice!


🎯 TRANSITION: From Analysis Methods to Evaluation Criteria

📚 What You’ve Learned So Far (Group 1: ANALYSIS METHODS):

  • Section 1: How to choose tests based on data type (categorical, continuous, geographic)
    • 1.1 Categorical independence testing (Chi-Square vs Fisher’s Exact)
    • 1.2 Two groups comparison (T-Test vs Mann-Whitney U)
    • 1.3 Multiple groups comparison (ANOVA vs Kruskal-Wallis)
    • 1.4 Geographic patterns (Moran’s I)
  • Section 2: When to use parametric vs non-parametric approaches

🔍 What’s Coming Next (Group 2: EVALUATION):

  • 📊 Section 4: How to evaluate relationships between variables
  • 🎯 Section 5: How to simplify complex, high-dimensional data
  • 📏 Section 6: How to measure the practical importance of your findings

📊 Now let’s understand how to analyze relationships between variables and evaluate if our analysis was useful!

3. 📊 Evaluation: Correlation vs Causation: The Data Science Golden Rule

Understanding relationships between variables is crucial, but there’s a big difference between correlation and causation!

✅ Correlation

What it is: Two things tend to change together

Example: Ice cream sales and drowning deaths both increase in summer

Measurement: Correlation coefficient (-1 to +1)

⚠️ Causation

What it is: One thing directly causes another

Reality: Hot weather causes both (confounding variable!)

Proof needed: Controlled experiments or strong theory

Understanding relationships between variables is crucial, but there’s a big difference between correlation and causation! Here are the key types of correlation you need to know:

Correlation TypeWhen to UseWhat It Measures
PearsonLinear relationships, normal dataStraight-line relationships
SpearmanNon-linear relationships, ordinal dataMonotonic relationships (consistently increasing/decreasing)

🎯 Practice Preview: Correlation Analysis in Part 4

🔗 Real-World Correlation Analysis

In Part 4, you’ll apply these correlation concepts to discover relationships in Pokemon data:

  • Pearson Correlation: “How strongly do Pokemon attack and defense stats correlate?”
    • You’ll calculate r values and interpret the strength of linear relationships
  • Spearman Correlation: “Do Pokemon types have consistent strength rankings?”
    • You’ll test for monotonic relationships when data isn’t linear
  • Correlation vs Causation: “Does high attack cause high defense, or are they both influenced by Pokemon level?”
    • You’ll identify confounding variables and avoid causal misinterpretations

💡 The Connection: Understanding correlation types helps you choose the right analysis method and interpret results correctly!

🎯 Now let’s learn how to simplify complex, high-dimensional data!

4. 🚀 Putting It All Together: Your Complete EDA Strategy

Now you have all the analytical tools! Here’s how they work together in a systematic analysis:

6.1 The Complete EDA Journey with Analytical Tools

  1. 🔍 Data Understanding

    • Classify variables by type (numerical, categorical, datetime)
    • Understand the statistical properties of each variable
  2. 🧹 Preprocessing Strategy

    • Choose appropriate encoding based on variable type
    • Apply statistical feature selection (ANOVA, Chi-square)
    • Consider the implications for downstream analysis
  3. 📊 Statistical Hypothesis Testing

    • Select appropriate tests from our complete toolkit (Chi-square, ANOVA, T-test, Mann-Whitney U, Kruskal-Wallis, Moran’s I)
    • Always check both significance AND effect size
    • Use tests to validate patterns found in visualizations
  4. 📈 Advanced Pattern Recognition

    • Use correlation analysis to understand relationships
    • Apply visualization theory to reveal hidden patterns
    • Distinguish between correlation and causation
  5. 🗺️ Geographic Analysis (when location data exists)

    • Apply Moran’s I for spatial autocorrelation testing
    • Identify geographic clusters and patterns
    • Integrate spatial insights with other statistical findings
  6. 🎯 Dimensionality Strategy

    • Apply PCA for complex, high-dimensional datasets
    • Compare PCA components with statistical feature importance
    • Understand explained variance and information retention
  7. 🔬 Scientific Validation

    • Interpret results in context of domain knowledge
    • Consider practical significance (effect size), not just statistical significance
    • Build a coherent analytical narrative supported by evidence

What’s Next?

Now that you understand both the programming tools (from Part 2) and the analytical concepts (from this Part 3), you’re ready for the exciting practical application!

In Part 4, we’ll combine everything:

  1. Apply pandas programming skills to real Pokemon data
  2. Use statistical tests to validate patterns we discover
  3. Evaluate the practical usefulness of each finding
  4. Create advanced visualizations guided by theory
  5. Build a complete analytical narrative that guides ML decisions

🎯 Ready for the Real Challenge?: You now have both the programming skills AND the analytical foundation to tackle the Pokemon Team Rocket dataset in Part 4. We’ll use every single tool and concept covered in Parts 2 and 3 to uncover hidden patterns and build insights that matter!

All theoretical concepts and practical implementations are available in our ML Odyssey repository.