ML Odyssey: Part 3 - Statistical Methods for Exploratory Analysis

Welcome to the analytical heart of exploratory data analysis! While Part 2 taught you how to manipulate and visualize data, this post focuses on why and when to use different analytical techniques.
The Statistical Foundation of EDA
Exploratory Data Analysis isn’t just about creating pretty charts—it’s about scientifically investigating your data to uncover reliable insights. Think of the tools we’ll cover as your “data detective toolkit” that helps you prove whether patterns you observe are real discoveries or just coincidences.
🎯 The Core Question: Every statistical test asks the same fundamental question: “Could this pattern have happened by random chance, or is there something real going on here?”
Before we dive into the details, here’s the logical progression we’ll follow to build your statistical analysis skills:
📚 The Complete EDA Learning Journey
Group 1: Analysis Methods
- 🔍 Test Selection by Data Type: Choose tests based on what you’re analyzing
- 1.1 Categorical vs Categorical: Chi-Square vs Fisher’s Exact
- 1.2 Continuous vs Categorical - Two Groups: T-Test vs Mann-Whitney U
- 1.3 Continuous vs Categorical - 3+ Groups: ANOVA vs Kruskal-Wallis
- 1.4 Geographic Patterns: Moran’s I Spatial Analysis
- 🔧 Parametric vs Non-Parametric: When to use each approach
Group 2: Evaluation
- 📊 Relationships: Testing relationships and understanding correlations
X. 🎯 Complexity: Reducing complexity with dimensionality reduction. This is a very important topic and we will cover it in Part 6.
Group 3: Integration
- 🚀 Strategy: Combining everything into a complete analytical strategy
💡 Each section builds on the previous one, creating a systematic approach to data analysis!
🎯 Now that you understand the statistical foundation, let’s learn how to choose the right tests for your data!
1. 🔍 Analysis Methods: Test Selection by Data Type
📚 Start here! The first step in statistical analysis is understanding what you’re testing and choosing the right method. Different data types require different approaches. Let’s organize tests by the questions they answer rather than by complexity.
🎯 Statistical Test Decision Tree
Step 1: What are you testing?
- Categorical vs Categorical → Chi-Square Test or Fisher’s Exact Test
- Continuous vs Categorical (2 groups) → T-Test or Mann-Whitney U Test
- Continuous vs Categorical (3+ groups) → ANOVA or Kruskal-Wallis Test
- Geographic patterns → Moran’s I Spatial Analysis
Step 2: How complex is your data?
- Normal data, large samples → Parametric tests (more powerful)
- Non-normal data, small samples → Non-parametric tests (more robust)
1.1 Categorical vs Categorical: Testing Independence
Question: “Are these two categorical variables related or independent?”
Method | What It Does | Key Concepts | When to Use |
---|---|---|---|
Chi-Square Test | Tests relationships between categorical variables to determine if they’re independent or related | • Null Hypothesis: Variables are independent • Test Statistic: Measures deviation from expected frequencies • Assumption: Expected frequency ≥ 5 per cell • Effect Size: Cramer’s V (0-1, measures association strength) | • Large sample sizes • All expected frequencies ≥ 5 • General independence testing |
Fisher’s Exact Test | Exact test for small categorical tables with no distribution assumptions | • Null Hypothesis: Variables are independent • Test Statistic: Exact probability calculation • Assumption: None (distribution-free) • Effect Size: Phi coefficient or odds ratio | • Small sample sizes • Expected frequencies < 5 • 2×2 contingency tables |
💡 Effect Size in Categorical Tests: What Do Cramer’s V and Phi Mean?
Cramer’s V (Chi-Square):
Cramer’s V measures the strength of association between categorical variables. It ranges from 0 (no association) to 1 (perfect association), adjusted for the number of categories.
Interpretation:- V > 0.1 = weak effect
- V > 0.3 = moderate effect
- V > 0.5 = strong effect
Phi Coefficient (Fisher’s Exact):
Phi coefficient measures association in 2×2 tables, similar to a correlation coefficient. It ranges from -1 to +1.
Interpretation:- |φ| > 0.1 = weak effect
- |φ| > 0.3 = moderate effect
- |φ| > 0.5 = strong effect
🔍 Interpreting Categorical Tests
📊 Chi-Square Test Results:
- p-value < 0.05: Variables are NOT independent (there’s a real relationship)
- p-value ≥ 0.05: Variables are independent (no relationship detected)
- Effect Size: Cramer’s V > 0.1 = weak, > 0.3 = moderate, > 0.5 = strong
- Practical Insight: “Pokemon type and Team Rocket membership are related (p < 0.01, V = 0.4), suggesting certain types are preferred by villains”
🔬 Fisher’s Exact Test Results:
- p-value < 0.05: Variables are NOT independent (there’s a real relationship)
- p-value ≥ 0.05: Variables are independent (no relationship detected)
- Effect Size: |φ| > 0.1 = weak, > 0.3 = moderate, > 0.5 = strong
- Practical Insight: “In small samples, Pokemon type and Team Rocket membership show moderate association (p < 0.05, φ = 0.35)”
1.2 Continuous vs Categorical - Two Groups: T-Test vs Mann-Whitney U
Question: “Do two groups have different average values?”
Method | What It Does | Key Concepts | When to Use |
---|---|---|---|
T-Test | Compares means between exactly two groups | • Paired vs. Independent: Related vs. unrelated samples • T-Statistic: Difference in means relative to standard error • Assumption: Data approximately normal • Effect Size: Cohen’s d (standardized mean difference) | • Normal data distribution • Large sample sizes • Comparing two groups (e.g., before/after) |
Mann-Whitney U Test | Non-parametric alternative to t-test using ranks | • Rank Transformation: Converts values to ranks • U-Statistic: Measures separation between group rankings • Assumption: None (distribution-free) • Effect Size: r = Z/√N (0-1, strength of ranking differences) | • Non-normal data • Small sample sizes • Ordinal or continuous data with outliers |
💡 Effect Size in Two Groups Comparison: What Do Cohen’s d and r Mean?
Cohen’s d (T-Test):
Cohen’s d measures the standardized difference between two group means. It’s the difference in means divided by the pooled standard deviation.
Interpretation:- d > 0.2 = small effect (groups overlap by 85%)
- d > 0.5 = medium effect (groups overlap by 67%)
- d > 0.8 = large effect (groups overlap by 53%)
r (Mann-Whitney U):
The r effect size is calculated asr = Z/√N
, where Z is the standardized test statistic and N is the total sample size. It ranges from 0 to 1, similar to a correlation coefficient.
Interpretation:- r > 0.1 = small effect
- r > 0.3 = medium effect
- r > 0.5 = large effect
🔍 Interpreting Two Groups Comparison
⚖️ T-Test Results:
- p-value < 0.05: Groups are significantly different
- t-statistic: Higher absolute values indicate stronger differences
- Effect Size: Cohen’s d > 0.2 = small, > 0.5 = medium, > 0.8 = large
- Practical Insight: “Team Rocket members have higher win ratios (t = 3.1, p < 0.01, d = 0.6), suggesting they’re more skilled battlers”
🔄 Mann-Whitney U Results:
- p-value < 0.05: Groups differ significantly in rankings
- Effect Size: r = Z/√N, where r > 0.1 = small, > 0.3 = medium, > 0.5 = large
- Practical Insight: “Team Rocket migration patterns differ from regular trainers (U = 45, p < 0.05, r = 0.4), showing distinct movement strategies”
1.3 Continuous vs Categorical - 3+ Groups: ANOVA vs Kruskal-Wallis
Question: “Do multiple groups have different average values?”
Method | What It Does | Key Concepts | When to Use |
---|---|---|---|
ANOVA | Compares means across multiple groups simultaneously | • F-Statistic: Ratio of between-group to within-group variance • Post-hoc Tests: Identify which specific groups differ • Assumption: Groups have similar variances • Effect Size: Eta² (0-1, proportion of variance explained) | • Normal data distribution • 3+ groups to compare • Experimental design analysis |
Kruskal-Wallis Test | Non-parametric alternative to ANOVA using ranks | • H-Statistic: Measures overall group differences using ranks • No Distribution Assumptions: Works with any data shape • Post-hoc Analysis: Dunn’s test for pairwise comparisons • Effect Size: Epsilon² (0-1, proportion of ranking variance explained) | • Non-normal data • Multiple group comparisons • Ordinal or continuous data |
💡 Effect Size in Multiple Groups Comparison: What Do Eta² and Epsilon² Mean?
Eta² (ANOVA):
Eta² measures the proportion of variance in the dependent variable explained by the independent variable. It ranges from 0 to 1.
Interpretation:- 0.01 = 1% variance explained (small effect)
- 0.06 = 6% (medium effect)
- 0.14 = 14% (large effect)
Epsilon² (Kruskal-Wallis):
Epsilon² is the non-parametric equivalent of Eta², measuring the proportion of variance explained by group differences in rankings. It also ranges from 0 to 1.
Interpretation:- 0.01 = 1% variance explained (small effect)
- 0.06 = 6% (medium effect)
- 0.14 = 14% (large effect)
🔍 Interpreting Multiple Groups Comparison
📈 ANOVA Results:
- p-value < 0.05: At least one group differs significantly
- F-statistic: Higher values indicate stronger group differences
- Post-hoc needed: Use Tukey’s HSD to identify which specific groups differ
- Effect Size: Eta² > 0.01 = small, > 0.06 = medium, > 0.14 = large
- Practical Insight: “Pokemon types have different attack stats (F = 8.2, p < 0.001, η² = 0.15), with Fighting types being strongest”
🎯 Kruskal-Wallis Results:
- p-value < 0.05: At least one group differs in rankings
- H-statistic: Higher values indicate stronger group differences
- Post-hoc needed: Use Dunn’s test for pairwise comparisons
- Effect Size: Epsilon² > 0.01 = small, > 0.06 = medium, > 0.14 = large
- Practical Insight: “Pokemon types have different strength rankings (H = 12.3, p < 0.01, ε² = 0.18), with Dragon types ranking highest”
1.3 Geographic Patterns: Spatial Analysis
Question: “Are there geographic patterns or clusters in your data?”
Method | What It Does | Key Concepts | When to Use |
---|---|---|---|
Moran’s I | Tests for spatial autocorrelation to detect geographic clustering patterns | • Spatial Autocorrelation: Measures if similar values cluster geographically • I-Statistic: Ranges from -1 (dispersed) to +1 (clustered) • Spatial Weights: Defines what “nearby” means in your context • Effect Size: | I |
💡 Effect Size in Spatial Analysis: What Does Moran’s I Mean?
Moran’s I (Spatial Autocorrelation):
Moran’s I measures spatial autocorrelation - how much similar values tend to cluster geographically. It ranges from -1 (perfect dispersion) to +1 (perfect clustering), with 0 indicating random distribution.
Interpretation:- |I| > 0.1 = weak spatial pattern
- |I| > 0.3 = moderate spatial pattern
- |I| > 0.5 = strong spatial pattern
Direction Matters:
- I > 0: Positive spatial autocorrelation (similar values cluster together)
- I < 0: Negative spatial autocorrelation (similar values are dispersed)
- I ≈ 0: No spatial pattern (random distribution)
🔍 Interpreting Spatial Analysis
🗺️ Moran’s I Results:
- I > 0: Positive spatial autocorrelation (similar values cluster together)
- I < 0: Negative spatial autocorrelation (similar values are dispersed)
- I ≈ 0: No spatial pattern (random distribution)
- p-value < 0.05: Spatial pattern is statistically significant
- Effect Size: |I| > 0.1 = weak, > 0.3 = moderate, > 0.5 = strong
- Practical Insight: “Team Rocket members show strong clustering (I = 0.45, p < 0.001), indicating they operate in specific geographic regions”
2. 🔧 Analysis Methods: Parametric vs Non-Parametric Decision
Now that you know which test type to use, let’s understand when to choose the parametric or non-parametric version of each test.
🎯 Key Concept: Parametric tests are more powerful when assumptions are met, but non-parametric tests are more robust when data is messy. Always check your data before choosing!
Parametric Tests
✅ Use when:- Data is approximately normal (bell curve shape)
- Sample sizes are large (n > 30)
- Groups have similar variances (homogeneity)
- Data is continuous and well-behaved
- More statistical power (better chance of detecting real effects)
- Exact p-values from known distributions
- Standard effect size measures (Cohen's d, Eta²)
- Widely understood and reported in literature
- Results can be misleading if assumptions are violated
- Type I/II errors increase with assumption violations
- May need data transformation to meet assumptions
Non-Parametric Tests
✅ Use when:- Data is not normally distributed (skewed, irregular)
- Sample sizes are small (n < 30)
- Data has outliers that can't be removed
- Data is ordinal (rankings, ratings)
- Variances are unequal between groups
- No distribution assumptions required
- Robust against outliers and extreme values
- Works with any data shape or size
- More reliable when parametric assumptions fail
- Slightly less powerful than parametric tests
- May need larger samples to detect the same effect
- Effect sizes are less standardized
- Results are rank-based, not value-based
💡 Recommended Approach:
- Start with parametric tests (more powerful when assumptions are met)
- Check assumptions (normality, homogeneity, sample size)
- If assumptions fail, switch to non-parametric alternatives
- Report both results if possible (parametric + non-parametric)
🔍 Real-World Application: Pokemon Team Rocket Dataset Analysis
In Part 4, you’ll use every single statistical method from Section 1 to analyze real Pokemon data. Here’s exactly what you’ll investigate:
📊 Categorical vs Categorical (Section 1.1)
- Chi-Square/Fisher’s Exact: “Are Pokemon types independent of Team Rocket membership?”
- You’ll choose between Chi-Square (large samples) and Fisher’s Exact (small samples)
📈 Two Groups Comparison (Section 1.2)
- T-Test vs Mann-Whitney U: “Do Team Rocket members have higher attack stats?”
- You’ll choose based on data normality and sample size
📊 Multiple Groups Comparison (Section 1.3)
- ANOVA vs Kruskal-Wallis: “Do different Pokemon types have different average attack stats?”
- You’ll compare multiple types and choose the appropriate test
🗺️ Geographic Patterns (Section 1.4)
- Moran’s I: “Are Team Rocket members geographically clustered?”
- You’ll analyze location data to find spatial patterns
🔧 Test Selection Strategy (Section 2)
- Parametric vs Non-Parametric: You’ll learn when to use each approach based on your data characteristics
💡 The Connection: Every concept you just learned will be applied to real data, helping you understand not just the theory, but how to use these methods in practice!
🎯 TRANSITION: From Analysis Methods to Evaluation Criteria
📚 What You’ve Learned So Far (Group 1: ANALYSIS METHODS):
- ✅ Section 1: How to choose tests based on data type (categorical, continuous, geographic)
- 1.1 Categorical independence testing (Chi-Square vs Fisher’s Exact)
- 1.2 Two groups comparison (T-Test vs Mann-Whitney U)
- 1.3 Multiple groups comparison (ANOVA vs Kruskal-Wallis)
- 1.4 Geographic patterns (Moran’s I)
- ✅ Section 2: When to use parametric vs non-parametric approaches
🔍 What’s Coming Next (Group 2: EVALUATION):
- 📊 Section 4: How to evaluate relationships between variables
- 🎯 Section 5: How to simplify complex, high-dimensional data
- 📏 Section 6: How to measure the practical importance of your findings
📊 Now let’s understand how to analyze relationships between variables and evaluate if our analysis was useful!
3. 📊 Evaluation: Correlation vs Causation: The Data Science Golden Rule
Understanding relationships between variables is crucial, but there’s a big difference between correlation and causation!
✅ Correlation
What it is: Two things tend to change together
Example: Ice cream sales and drowning deaths both increase in summer
Measurement: Correlation coefficient (-1 to +1)
⚠️ Causation
What it is: One thing directly causes another
Reality: Hot weather causes both (confounding variable!)
Proof needed: Controlled experiments or strong theory
Understanding relationships between variables is crucial, but there’s a big difference between correlation and causation! Here are the key types of correlation you need to know:
Correlation Type | When to Use | What It Measures |
---|---|---|
Pearson | Linear relationships, normal data | Straight-line relationships |
Spearman | Non-linear relationships, ordinal data | Monotonic relationships (consistently increasing/decreasing) |
🎯 Practice Preview: Correlation Analysis in Part 4
🔗 Real-World Correlation Analysis
In Part 4, you’ll apply these correlation concepts to discover relationships in Pokemon data:
- Pearson Correlation: “How strongly do Pokemon attack and defense stats correlate?”
- You’ll calculate r values and interpret the strength of linear relationships
- Spearman Correlation: “Do Pokemon types have consistent strength rankings?”
- You’ll test for monotonic relationships when data isn’t linear
- Correlation vs Causation: “Does high attack cause high defense, or are they both influenced by Pokemon level?”
- You’ll identify confounding variables and avoid causal misinterpretations
💡 The Connection: Understanding correlation types helps you choose the right analysis method and interpret results correctly!
🎯 Now let’s learn how to simplify complex, high-dimensional data!
4. 🚀 Putting It All Together: Your Complete EDA Strategy
Now you have all the analytical tools! Here’s how they work together in a systematic analysis:
6.1 The Complete EDA Journey with Analytical Tools
🔍 Data Understanding
- Classify variables by type (numerical, categorical, datetime)
- Understand the statistical properties of each variable
🧹 Preprocessing Strategy
- Choose appropriate encoding based on variable type
- Apply statistical feature selection (ANOVA, Chi-square)
- Consider the implications for downstream analysis
📊 Statistical Hypothesis Testing
- Select appropriate tests from our complete toolkit (Chi-square, ANOVA, T-test, Mann-Whitney U, Kruskal-Wallis, Moran’s I)
- Always check both significance AND effect size
- Use tests to validate patterns found in visualizations
📈 Advanced Pattern Recognition
- Use correlation analysis to understand relationships
- Apply visualization theory to reveal hidden patterns
- Distinguish between correlation and causation
🗺️ Geographic Analysis (when location data exists)
- Apply Moran’s I for spatial autocorrelation testing
- Identify geographic clusters and patterns
- Integrate spatial insights with other statistical findings
🎯 Dimensionality Strategy
- Apply PCA for complex, high-dimensional datasets
- Compare PCA components with statistical feature importance
- Understand explained variance and information retention
🔬 Scientific Validation
- Interpret results in context of domain knowledge
- Consider practical significance (effect size), not just statistical significance
- Build a coherent analytical narrative supported by evidence
What’s Next?
Now that you understand both the programming tools (from Part 2) and the analytical concepts (from this Part 3), you’re ready for the exciting practical application!
In Part 4, we’ll combine everything:
- Apply pandas programming skills to real Pokemon data
- Use statistical tests to validate patterns we discover
- Evaluate the practical usefulness of each finding
- Create advanced visualizations guided by theory
- Build a complete analytical narrative that guides ML decisions
🎯 Ready for the Real Challenge?: You now have both the programming skills AND the analytical foundation to tackle the Pokemon Team Rocket dataset in Part 4. We’ll use every single tool and concept covered in Parts 2 and 3 to uncover hidden patterns and build insights that matter!
All theoretical concepts and practical implementations are available in our ML Odyssey repository.