Contents

ML Odyssey: Part 2 - Data Manipulation & Visualization with Python

Note
This is Part 2 of our EDA series, focusing on the essential programming tools for data analysis. After mastering these hands-on skills, check out Part 3 for the statistical concepts that power effective data analysis.
🚀

💻 Hands-On Code Execution

Want to run all the code examples from this blog post?
Follow along with the complete Jupyter notebook!

📓 Open Jupyter Notebook →

✨ All code examples are ready to run and experiment with!

As a developer transitioning into machine learning, you’ll spend significant time exploring and understanding your data before training any models. This post focuses on the programming fundamentals—the essential coding skills with Python’s data science stack that form the foundation of every ML project.

1
📊

Data Processing

NumPy & Pandas

2
📈

Data Visualization

Matplotlib & Seaborn

3
🤖

Traditional ML

Scikit-learn

4
🧠

Deep Learning

PyTorch

🎯 Today’s Focus: We’re mastering the foundation of any successful ML project—understanding your data through processing and visualization. These skills are essential before any model training begins!

🗺️ Your Learning Roadmap: From EDA Theory to Practical Programming

Before we dive into the details, here’s the logical progression we’ll follow to build your data manipulation skills:

📚 The Complete Learning Journey

  1. 🔍 EDA Foundation (Section 1): Understanding what EDA is and why it matters
  2. 🧹 Data Preparation (Section 2): Learning how to clean and prepare your data
  3. 📊 Data Processing (Section 3): Mastering pandas for data manipulation
  4. 📈 Data Visualization (Section 4): Creating insights with matplotlib and seaborn
  5. 🚀 Practical Application (Section 5): Combining everything with real examples

💡 Each section builds on the previous one, creating a systematic approach from theory to practice!


🧹 Now let’s learn how to prepare your data for analysis!

1. 🔍 What is Exploratory Data Analysis (EDA)?

Before diving into the tools, let’s understand what Exploratory Data Analysis really means and why it’s the cornerstone of successful machine learning projects.

Exploratory Data Analysis (EDA) is the critical first step in any data science project where you investigate, visualize, and summarize your dataset to understand its main characteristics, often using visual methods. Think of it as getting to know your data before asking it to perform in a machine learning model.

1.1 The EDA Workflow: Your Data Investigation Process

EDA follows a systematic approach to understanding your data. Here’s the typical workflow that data scientists follow:

🗺️ The Complete EDA Journey

  1. First Look 📊

    • Data shape, types, and structure
    • Quick overview of what you’re working with
  2. Data Quality Assessment 🔍

    • Missing values and how much
    • Duplicates and inconsistencies
    • Data type mismatches
  3. Univariate Analysis 📈

    • Distribution of individual variables
    • Summary statistics for each feature
    • Identify outliers and anomalies
  4. Bivariate Analysis 🔗

    • Relationships between pairs of variables
    • Correlations and associations
    • Feature interactions
  5. Multivariate Analysis 🌐

    • Complex relationships between multiple variables
    • Patterns across the entire dataset
    • Feature importance insights

1.2 Critical Questions EDA Helps Answer

Before training any ML model, EDA helps you answer these fundamental questions:

Essential EDA Questions

  • Data Quality: Are there missing values? Outliers? Inconsistent formats?
  • Feature Relationships: Which features are correlated? Which ones might be redundant?
  • Data Distribution: Is the data balanced? Are there skewed distributions?
  • Target Insights: How does your target variable behave? What influences it?
  • Feature Engineering: What new features could be created from existing ones?
  • Model Selection: What type of algorithm might work best for this data?

💡 Why EDA Matters for ML Success: Without proper EDA, you’re flying blind. It prevents the “garbage in, garbage out” problem and helps you choose the right preprocessing steps, algorithms, and evaluation metrics for your specific dataset.

1.3 Key Data Concepts for EDA

Before we dive into the tools, let’s establish the fundamental vocabulary you’ll use throughout your data analysis journey. Understanding these concepts is essential for effective EDA.

📊

Dataset

A collection of structured data, typically organized in rows and columns like a spreadsheet or database table.

Example: Customer information with age, income, purchases, and preferences
🏷️

Feature

Individual measurable properties of observed phenomena. They are called input variables, attributes or columns.

Example: In Pokemon data: Attack, Defense, Speed, Type
🎯

Label

The target variable you want to predict or classify. Also called target, dependent variable, or output.

Example: Is Pokemon legendary? Email spam/not spam? House price?

1.4 Quick Visual Example

Here’s how these concepts work together in a real dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Example: Pokemon dataset structure
import pandas as pd

# Sample Pokemon dataset
pokemon_data = pd.DataFrame({
    'name': ['Pikachu', 'Charizard', 'Bulbasaur'],      # Identifier (not typically a feature)
    'attack': [55, 84, 49],                             # Feature 1
    'defense': [40, 78, 49],                            # Feature 2  
    'speed': [90, 100, 45],                             # Feature 3
    'type': ['Electric', 'Fire', 'Grass'],              # Feature 4 (categorical)
    'is_legendary': [False, False, False]               # Label (what we want to predict)
})

print("Dataset shape:", pokemon_data.shape)  # (3 rows, 6 columns)
print("\nFeatures (input):", ['attack', 'defense', 'speed', 'type'])
print("Label (target):", 'is_legendary')

🔍 EDA Focus: During exploratory data analysis, you’ll examine each feature’s distribution, relationships between features, and how features relate to your label. This understanding guides your ML model choices!

1.5 The EDA Stack

NumPy Logo
NumPy: The Numerical Foundation 📚 Documentation
  • The backbone of scientific computing in Python
  • Provides efficient array operations and mathematical functions
  • Powers many higher-level libraries like pandas and PyTorch
  • Essential for handling numerical computations efficiently
Pandas Logo
Pandas: Your Data Wrangler 📚 Documentation
  • Built on top of NumPy for structured data handling
  • Think of it as "Excel on steroids" for Python
  • Perfect for loading, cleaning, and transforming data
  • Adds labels and indexes to NumPy arrays
SciPy Logo
SciPy: Scientific Computing Powerhouse 📚 Documentation
  • Built on NumPy for advanced scientific computing
  • Provides statistical tests (ANOVA, Chi-square, t-tests)
  • Essential for hypothesis testing in EDA
  • Validates patterns found through visualization
Matplotlib Logo
Matplotlib: The Visualization Foundation 📚 Documentation
  • The grandfather of Python plotting libraries
  • Gives you precise control over plot elements
  • Creates publication-quality figures
  • Powers many higher-level plotting libraries
Seaborn Logo
Seaborn: Statistical Visualization Made Easy 📚 Documentation
  • Built on top of matplotlib
  • Provides beautiful, modern statistical graphics
  • Integrates perfectly with pandas
  • Makes complex visualizations accessible

These libraries complement each other perfectly:

  • NumPy handles the numerical heavy lifting
  • Pandas provides data structures and manipulation tools
  • Matplotlib gives you visualization building blocks
  • Seaborn adds statistical insight and polish
  • SciPy provides scientific validation and statistical testing

1.6 Why This Combination?

The Complete EDA Toolkit Integration

🧰 How Our Tools Work Together in EDA:

  1. 📊 NumPy → Efficient numerical computations for data processing
  2. 🗃️ Pandas → Load, clean, and manipulate your datasets
  3. 📈 Matplotlib → Create base visualizations to see patterns
  4. 🎨 Seaborn → Generate beautiful statistical graphics
  5. 🔬 SciPy → Validate insights with statistical tests

🔄 The EDA Workflow:

  • Explore with Pandas → Visualize with Matplotlib/Seaborn → Validate with SciPy
  • Example: Find a pattern in your data → Create a plot to show it → Use a statistical test to confirm it’s significant!

This combination gives you both visual insights and scientific rigor!

For example, here’s how they work together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats  # Import scipy.stats as 'stats'

# Raw numerical data (renamed from 'stats' to 'raw_stats_data' to avoid conflict)
raw_stats_data = np.array([[55, 40], [84, 78], [49, 49]])

# Create a structured DataFrame
data = pd.DataFrame(
    raw_stats_data,
    columns=['Attack', 'Defense'], # Changed to PascalCase for consistency
    index=['Pikachu', 'Charizard', 'Bulbasaur']
)

# Add Pokemon types
data['Type'] = ['Electric', 'Fire/Flying', 'Grass/Poison'] # Changed to PascalCase

# Display the DataFrame
print("Full DataFrame:\n", data)

# --- Visualization: Attack Stats Bar Plot ---
sns.set_theme(style="whitegrid") # Apply a clean seaborn theme
plt.figure(figsize=(8, 5)) # Set a good figure size

# Create a bar plot for Attack Stats
# FIX: Added hue='index' and legend=False to address FutureWarning
sns.barplot(data=data.reset_index(), x='index', y='Attack', hue='index', legend=False, palette='viridis')
plt.title('Pokemon Attack Stats', fontsize=16) # Clearer title
plt.xlabel('Pokemon', fontsize=12) # Clearer label
plt.ylabel('Attack Stat', fontsize=12) # Clearer label
plt.xticks(rotation=0) # Keep labels horizontal
plt.tight_layout()
plt.show()

# --- Descriptive Statistics ---
print("\nAverage Stats:")
print(f"Mean Attack: {np.mean(data['Attack']):.1f}")
print(f"Standard Deviation Attack: {np.std(data['Attack']):.1f}") # Clarified for which stat

# --- Statistical Validation ---
# Example: Are attack and defense significantly correlated?
# 'stats.pearsonr' now correctly refers to the function from scipy.stats
correlation, p_value = stats.pearsonr(data['Attack'], data['Defense'])
print(f"\nStatistical Validation:")
print(f"Attack-Defense correlation: {correlation:.3f}")
print(f"P-value: {p_value:.3f} ({'Significant' if p_value < 0.05 else 'Not significant'} relationship)")

Let’s explore each library in detail and see how they can help us understand our Pokemon dataset…

2. 🧹 Data Preprocessing: From Messy to ML-Ready

Before diving into pandas operations, let’s understand how to prepare your data for analysis. Real-world data is messy, and proper preprocessing is crucial for successful machine learning!

2.1 Understanding Variable Types

🔍 The Great Variable Classification

Variable TypeDescriptionExamplesHow ML Sees It
🔢 NumericalNumbers you can do math withAge, Price, Temperature✅ Ready to use!
📝 Categorical (Nominal)Categories with no orderColor, City, Pokemon Type❌ Needs encoding
📊 Categorical (Ordinal)Categories with meaningful orderLow/Medium/High, Grades❌ Needs careful encoding
📅 DatetimeDates and timesBirth date, Transaction time🔄 Extract features first

2.2 The Encoding Dilemma: How to Convert Categories to Numbers

🏷️ Label Encoding

What it does: Assigns numbers to categories (Red=0, Blue=1, Green=2)

Best for: Ordinal data with natural order

⚠️ Warning: ML might think Blue > Red because 1 > 0!

🎯 One-Hot Encoding

What it does: Creates binary columns (Red=1,0,0 Blue=0,1,0)

Best for: Nominal data with no natural order

⚠️ Warning: Can create many columns with high cardinality!

💡 Pro Tip: Always keep a copy of your original data! Encoding is often irreversible, and you might need to try different approaches.


📊 Now let’s dive into pandas to see these concepts in action!

3. 📊 Pandas in 10 Minutes

Let’s explore the essential features of pandas that you’ll use in 90% of your data analysis tasks:

3.1 Creating Data

Pandas makes it easy to create labeled one-dimensional arrays (Series) and two-dimensional tables (DataFrames) from scratch or from existing data. This is the foundation for all data analysis in pandas.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd
import numpy as np

# Creating a Series (1D array with labels)
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series example:\n", s)

# Creating a DataFrame (2D table)
dates = pd.date_range('20250101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), 
         index=dates,
         columns=['A', 'B', 'C', 'D'])
print("\nDataFrame example:\n", df)

3.2 Viewing Data

Pandas provides convenient methods to quickly inspect your data, check its structure, and get summary statistics, helping you understand what you’re working with.

1
2
3
4
5
6
7
8
# Quick overview
print("First 5 rows:\n", df.head())
print("\nDataFrame info:\n", df.info())
print("\nQuick statistics:\n", df.describe())

# Index and columns
print("\nIndex:", df.index)
print("Columns:", df.columns)

3.3 Selection and Indexing

You can easily select columns, rows, or specific values using labels, positions, or boolean conditions. This flexibility makes slicing and dicing your data straightforward.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Getting a column
print("Column 'A':\n", df['A'])

# Selecting by position
print("First 3 rows:\n", df[:3])

# Selection by label
print("By labels:\n", df.loc['20250102':'20250104'])

# Selection by position
print("By position:\n", df.iloc[3:5, 0:2])

# Boolean indexing
print("Values > 0:\n", df[df > 0])

3.4 Operations

Pandas supports a wide range of operations, from basic statistics to applying custom functions and grouping data for aggregation—making data analysis both powerful and concise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Statistics
print("Mean by column:\n", df.mean())
print("Mean by row:\n", df.mean(axis=1))

# Applying functions
print("\nApply custom function:")
print(df.apply(lambda x: x.max() - x.min()))

# Grouping
df['E'] = ['one', 'one', 'two', 'three', 'two', 'one']
grouped = df.groupby('E')
print("\nGroup sums:\n", grouped.sum())

3.5 Handling Missing Data

Real-world data is often incomplete. Pandas offers simple ways to detect, remove, or fill in missing values so you can clean your datasets efficiently.

1
2
3
4
5
6
7
8
9
# Create some missing data
df2 = df.copy()
df2.iloc[0:2, 0] = np.nan

# Drop rows with missing data
print("Drop NA rows:\n", df2.dropna())

# Fill missing data
print("Fill NA with 0:\n", df2.fillna(0))

3.6 Merging and Reshaping

Combining multiple datasets and reshaping tables is a breeze with pandas, allowing you to join, concatenate, and merge data just like in SQL or Excel.

1
2
3
4
5
6
7
8
# Concatenating
pieces = [df[:2], df[2:4], df[4:]]
print("Concatenated:\n", pd.concat(pieces))

# Merging
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
print("\nMerged:\n", pd.merge(left, right, on='key'))

💡 Pro Tip: These are the most common pandas operations you’ll use daily. For more advanced features, check the pandas documentation.

4. 📈 Advanced Pandas with Pivot Tables

What is a Pivot Table? Think of it as a smart way to reorganize and summarize your data. Instead of looking at hundreds of individual rows, pivot tables let you quickly see patterns by grouping and aggregating data in a cross-tabulated format.

🔄 The Magic: Pivot tables take long, repetitive data and transform it into a concise summary table where you can easily compare values across different categories.

Let’s see this transformation in action:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import pandas as pd
import numpy as np

# Create sample sales data - this is your "normal" table
np.random.seed(42)  # For consistent results
sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=20, freq='D'),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 20),
    'sales': np.random.randint(50, 150, 20),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 20),
    'salesperson': np.random.choice(['Alice', 'Bob', 'Charlie'], 20)
})

print("📊 ORIGINAL DATA (first 10 rows):")
print(sales_data.head(10))
print(f"\nTotal rows: {len(sales_data)}")

4.1 The Problem with Normal Tables

Looking at the raw data above, can you quickly answer these questions?

  • Which product sells best in each region?
  • What’s the average sales per product?
  • Which region performs better overall?

It’s hard to see patterns when data is in long format! 🤔

4.2 The Pivot Table Solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# 🔥 PIVOT TABLE MAGIC: Transform rows into a summary
pivot_table = sales_data.pivot_table(
    values='sales',           # What to summarize
    index='product',          # Rows: Group by product
    columns='region',         # Columns: Split by region  
    aggfunc='mean',          # How to summarize: average
    margins=True             # Add totals
)

print("📈 PIVOT TABLE - Average Sales by Product and Region:")
print(pivot_table.round(1))

# Compare with manual grouping (much more verbose!)
print("\n🔄 Same result using traditional grouping (more complex):")
manual_summary = sales_data.groupby(['product', 'region'], observed=False)['sales'].mean().unstack(fill_value=0)
print(manual_summary.round(1))

4.3 Pivot Table Benefits vs Normal Tables

❌ Normal Table Challenges

  • Hard to spot patterns in long data
  • Need complex groupby operations
  • Difficult to compare across categories
  • Requires multiple steps for insights

✅ Pivot Table Advantages

  • Instant visual patterns and comparisons
  • One line of code for complex summaries
  • Easy to spot highest/lowest values
  • Built-in totals and subtotals

4.4 Real-World Pivot Table Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Example 1: Multiple metrics in one table
multi_metric_pivot = sales_data.pivot_table(
    values='sales',
    index='product',
    columns='region',
    aggfunc=['mean', 'sum', 'count'],  # Multiple aggregations!
    margins=True
)
print("📊 Multiple metrics (mean, sum, count):")
print(multi_metric_pivot)

# Example 2: Time-based analysis
sales_data['month'] = sales_data['date'].dt.month
time_pivot = sales_data.pivot_table(
    values='sales',
    index='month',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
print("\n📅 Sales by month and product:")
print(time_pivot)

🎯 When to Use Pivot Tables: Perfect for answering questions like “What’s the average X by Y and Z?” or “How do sales compare across regions and products?” They transform analysis questions that would take multiple steps into single, readable operations.

4.5 Advanced Grouping Techniques

1
2
3
4
5
6
7
8
9
# 1. Rolling windows for trends
sales_data['rolling_avg'] = sales_data.groupby('product', observed=False)['sales'].rolling(7).mean().reset_index(0, drop=True)

# 2. String operations
sales_data['product_category'] = sales_data['product'].str.upper() + '_CATEGORY'

# 3. Advanced groupby with multiple functions
summary = sales_data.groupby('product', observed=False)['sales'].agg(['mean', 'std', 'count', 'sum'])
print("\nAdvanced groupby:\n", summary)

💡 Pro Tip: Pivot tables are incredibly powerful for reshaping data. They’re like Excel pivot tables but with the full power of Python behind them!


📈 Now let’s learn how to visualize our data and create compelling insights!

5. 📈 Advanced Matplotlib and Seaborn

5.1 Finding the Right Visualization

Let’s see how all these tools work together in a practical example. We’ll analyze restaurant tip data to uncover insights:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Complete data exploration pipeline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load real data (seaborn includes several datasets)
tips = sns.load_dataset('tips')

# Quick exploration pipeline
print("Dataset shape:", tips.shape)
print("\nData types:\n", tips.dtypes)
print("\nMissing values:\n", tips.isnull().sum())
print("\nBasic statistics:\n", tips.describe())

# Create insights with pandas + seaborn
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Distribution of tips
sns.histplot(data=tips, x='tip', ax=axes[0,0])
axes[0,0].set_title('Distribution of Tips')

# Relationship between total bill and tip
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time', ax=axes[0,1])
axes[0,1].set_title('Bill vs Tip by Time')

# Average tip by day and time
tip_summary = tips.groupby(['day', 'time'], observed=False)['tip'].mean().unstack()
sns.heatmap(tip_summary, annot=True, fmt='.2f', ax=axes[1,0])
axes[1,0].set_title('Average Tip by Day and Time')

# Box plot showing tip distribution by day
sns.boxplot(data=tips, x='day', y='tip', ax=axes[1,1])
axes[1,1].set_title('Tip Distribution by Day')

plt.tight_layout()
plt.show()

# Generate insights using pandas
print("\n🔍 Key Insights:")
print(f"• Average tip: ${tips['tip'].mean():.2f}")
print(f"• Best tipping day: {tips.groupby('day', observed=False)['tip'].mean().idxmax()}")
print(f"• Tip percentage: {(tips['tip'] / tips['total_bill'] * 100).mean():.1f}%")

This workflow demonstrates the power of combining pandas for data manipulation with seaborn for visualization — you can quickly move from raw data to actionable insights.

5.2 Statistical Insights with Seaborn

Seaborn shines when you need to understand statistical relationships in your data. Let’s explore the famous iris dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Using seaborn for statistical insights
import seaborn as sns
import pandas as pd
import numpy as np

# Load a dataset with interesting relationships
iris = sns.load_dataset('iris')

# Create a comprehensive statistical overview
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Correlation heatmap
correlation_matrix = iris.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0,0])
axes[0,0].set_title('Feature Correlations')

# 2. Distribution comparison
sns.violinplot(data=iris, x='species', y='petal_length', ax=axes[0,1])
axes[0,1].set_title('Petal Length Distribution by Species')

# 3. Statistical regression
sns.regplot(data=iris, x='petal_length', y='petal_width', ax=axes[1,0])
axes[1,0].set_title('Petal Length vs Width (with regression)')

# 4. Categorical relationships
sns.boxplot(data=iris, x='species', y='sepal_width', ax=axes[1,1])
axes[1,1].set_title('Sepal Width by Species')

plt.tight_layout()
plt.show()

# The famous pairplot - shows all relationships at once
plt.figure(figsize=(10, 8))
sns.pairplot(iris, hue='species', height=2.5)
plt.suptitle('Iris Dataset: All Variable Relationships', y=1.02)
plt.show()

# Statistical insights from the analysis
print("🔍 Key insights from our analysis:")
print("• Strong correlation between petal length and width (r=0.96)")
print("• Clear species separation based on petal measurements")
print("• Setosa species has distinctly different characteristics")
print("• Petal features are more discriminative than sepal features")

🎯 Why This Matters: This type of exploratory data analysis is crucial before building ML models. Understanding feature relationships helps you choose the right algorithms and preprocessing steps.

5.3 Time Series Analysis Preview

Since time series data is common in real-world projects, here’s a taste of pandas’ time series capabilities:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Time series analysis with pandas
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Create sample time series data
dates = pd.date_range('2023-01-01', '2024-12-31', freq='D')
ts_data = pd.DataFrame({
    'date': dates,
    'value': np.cumsum(np.random.randn(len(dates))) + 100,
    'category': np.random.choice(['A', 'B'], len(dates))
})

# Set date as index for time series operations
ts_data.set_index('date', inplace=True)

# Time series specific operations
ts_data['month'] = ts_data.index.month
ts_data['weekday'] = ts_data.index.day_name()
ts_data['rolling_7d'] = ts_data['value'].rolling(7).mean()
ts_data['rolling_30d'] = ts_data['value'].rolling(30).mean()

# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Time series plot with multiple rolling averages
ts_data[['value', 'rolling_7d', 'rolling_30d']].plot(ax=axes[0])
axes[0].set_title('Time Series with Rolling Averages')
axes[0].legend(['Original', '7-day Average', '30-day Average'])

# Monthly patterns
monthly_avg = ts_data.groupby('month')['value'].mean()
sns.barplot(x=monthly_avg.index, y=monthly_avg.values, ax=axes[1])
axes[1].set_title('Average Values by Month')
axes[1].set_xlabel('Month')

plt.tight_layout()
plt.show()

print("📅 Time series insights:")
print(f"• Data spans {len(ts_data)} days")
print(f"• Monthly peak: {monthly_avg.idxmax()}")
print(f"• Trend: {'Rising' if ts_data['value'].iloc[-30:].mean() > ts_data['value'].iloc[:30].mean() else 'Falling'}")

What’s Next?

In Part 3, we’ll dive deep into the statistical concepts that power effective data analysis:

  • Statistical testing and significance
  • Advanced analytical techniques
  • When and why to use different methods
  • The theory behind correlation, spatial analysis, and dimensionality reduction

Then in Part 4 and 5, we’ll combine both programming skills AND statistical knowledge:

  1. Apply these pandas techniques to real Pokemon data
  2. Use advanced visualizations to reveal hidden patterns
  3. Create a complete data analysis workflow
  4. Build insights that guide machine learning decisions