← All Chapters Chapter 2

Data Preparation for Machine Learning

Before you can build great models, you need great data. Learn the essential steps to collect, clean, and prepare your data for machine learning.

Why Data Preparation Matters

Data quality sets the upper limit on model accuracy—and bad data quietly destroys performance.

If your dataset has duplicated rows, mislabeled categories, missing values, or inconsistent formats, the model doesn't "fix" these issues; it learns them. For example, if house-price data contains typos like a 2,000 sq ft home recorded as 20,000 sq ft, the model will treat it as a legitimate example and stretch its predictions accordingly. If customer churn labels are incorrect, the model will confidently learn the wrong reasons customers leave. And if timestamps aren't aligned properly, time-series forecasting can drift completely off course.

The algorithm can only detect patterns in what it's given—good or bad. This is why data preparation (cleaning, validating, deduplicating, normalizing) usually determines whether a project succeeds or ends up producing unreliable, misleading predictions.

In this chapter, you'll learn how to:

  • Collect data from various sources (CSV files, databases, APIs)
  • Explore datasets to understand their structure and characteristics
  • Handle missing values, outliers, and inconsistencies
  • Engineer meaningful features from raw data
  • Preprocess data: scaling, encoding, and normalization
  • Split datasets properly for training, validation, and testing

The Complete ML Workflow:

    1. Collect Data        ←  You are here!
         ↓
    2. Explore Data        ←  Understanding what you have
         ↓
    3. Clean Data          ←  Fix problems
         ↓
    4. Feature Engineering ←  Create useful features
         ↓
    5. Preprocess Data     ←  Scale, encode, normalize
         ↓
    6. Split Data          ←  Train/validation/test sets
         ↓
    7. Choose Algorithm    ←  (Next chapter!)
         ↓
    8. Train Model
         ↓
    9. Evaluate & Iterate
          

Step 1: Data Collection

Common Data Sources

Data collection varies based on your application. Here are the most common sources:

Databases

SQL queries, data warehouses, company records

CSV/Excel Files

Spreadsheets, exported data, manual records

APIs & Web Scraping

REST APIs, social media feeds, website data

Sensors & IoT

Cameras, temperature sensors, GPS devices

Surveys & Forms

User input, questionnaires, feedback systems

Public Datasets

Kaggle, UCI repository, government open data

Data Collection Best Practices

Collect More Than You Need

It's easier to filter out data than to go back and collect more later

Document Everything

Record where data came from, when it was collected, and how

Ensure Data Quality

Validate data at collection time - check for obvious errors immediately

Don't Ignore Privacy & Ethics

Always get permission, anonymize sensitive data, follow GDPR/privacy laws

Don't Use Biased Data

If your training data is biased, your model will be too (e.g., only photos of light-skinned people leads to facial recognition that fails on dark skin)

import pandas as pd

# From CSV file
data = pd.read_csv('customer_data.csv')

# From Excel
data = pd.read_excel('sales_data.xlsx')

# From database (SQL)
import sqlite3
conn = sqlite3.connect('company.db')
data = pd.read_sql_query("SELECT * FROM customers", conn)

# From API (example: weather data)
import requests
response = requests.get('https://api.weather.com/data')
data = pd.DataFrame(response.json())

# First look at your data
print(data.head())  # First 5 rows
print(data.info())  # Column types and non-null counts
print(data.shape)   # (rows, columns)

Step 2: Exploratory Data Analysis (EDA)

Understanding Your Data

Exploratory Data Analysis (EDA) is the critical first step in understanding your dataset. By examining distributions, relationships, and anomalies, you gain insights that inform every subsequent decision in your machine learning pipeline—from feature engineering to model selection. This systematic exploration reveals patterns, outliers, and data quality issues that must be addressed before training begins.

Key Questions to Answer During EDA:

How many rows and columns do I have?

What types of data? (numbers, categories, dates)

Are there missing values?

What's the distribution of values?

Are there outliers or unusual values?

How are features related to each other?

Common EDA Visualizations:

    Histogram (Distribution):           Box Plot (Outliers):

        Frequency                           Max ─────●
         │                                       │
    20 ──┤     ████                             │
         │     ████                        Q3 ───┬─────┐
    15 ──┤     ████                             │     │
         │ ████████                      Median ─┼─────┤
    10 ──┤ ████████ ████                        │     │
         │ ████████ ████                   Q1 ───┴─────┘
     5 ──┤ ████████ ████ ████                   │
         │ ████████ ████ ████              Min ─────●
     0 ──┴─────────────────────
         0  10  20  30  40  50


    Scatter Plot (Relationships):        Correlation Heatmap:

    y │                                    Feature  A    B    C
      │              ●                       A    1.0  0.8 -0.3
    8 ┤          ●       ●                   B    0.8  1.0 -0.1
      │      ●   ●                           C   -0.3 -0.1  1.0
    6 ┤    ●   ●
      │  ●   ●                            Dark = Strong correlation
    4 ┤●   ●
      └─────────────→ x
          
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
data = pd.read_csv('data.csv')

# 1. Basic Info
print(data.head())           # First few rows
print(data.info())           # Data types, null counts
print(data.describe())       # Statistics (mean, std, min, max)

# 2. Check for Missing Values
print(data.isnull().sum())   # Count nulls per column

# 3. Visualize Distribution
data['age'].hist(bins=20)
plt.title('Age Distribution')
plt.show()

# 4. Box plot for outliers
data.boxplot(column='salary')
plt.show()

# 5. Correlation heatmap
correlation = data.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.show()

# 6. Scatter plot for relationships
plt.scatter(data['experience'], data['salary'])
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

What to Look For:

Distribution Shape

Is data normal? Skewed? Uniform?

Outliers

Extreme values that don't fit the pattern

Correlations

Which features move together?

Class Balance

For classification: are classes balanced?

Step 3: Data Cleaning

Handling Data Quality Issues

Raw data from real-world sources often contains missing values, duplicates, inconsistencies, and outliers. Systematic data cleaning addresses these quality issues to ensure your models train on reliable information. Each cleaning decision affects downstream model performance, making this a critical phase that requires both statistical rigor and domain knowledge.

Common Data Issues & Solutions:

1. Missing Values

Some cells are empty (NaN, null, None)

    Customer  Age  Income  City
    John      25   50000   NYC
    Sarah     NaN  65000   LA
    Mike      32   NaN     Chicago
    Emma      28   55000   NaN
                

Solutions:

1. Delete Rows

Best when missing data is minimal (less than 5% of rows) and removing rows won't significantly reduce your dataset size or introduce bias.

2. Fill with Mean/Median

Use for continuous numerical features like age, salary, or measurements. Median is more robust to outliers than mean.

3. Fill with Mode

Appropriate for categorical variables like city names, product categories, or any discrete values with clear most-frequent options.

4. Forward/Backward Fill

Ideal for time series data where values change gradually—like stock prices or sensor readings—where the previous/next value is a reasonable estimate.

2. Outliers

Extreme values that don't fit the pattern

    Salaries: [40k, 45k, 50k, 48k, 52k, 1000k ← Outlier!]

    Box Plot Method:
                              ●  ← Outlier (1000k)
                              │
         ┌──────────────┐     │
         │              │     │
    ─────┴──────┬───────┴─────┘
         │      │       │
        Q1   Median    Q3
                

Solutions:

1. IQR Method

Statistical approach that flags values beyond 1.5 times the interquartile range from Q1/Q3. Works well when data follows roughly normal distribution.

2. Z-Score Method

Identifies values more than 3 standard deviations from the mean. Best for normally distributed data with symmetric outliers on both ends.

3. Cap Values

Set maximum/minimum thresholds based on domain knowledge rather than removing data. Preserves all records while limiting extreme influence.

4. Domain Knowledge

Always investigate outliers first. A CEO's $1M salary is valid; a negative age is not. Context determines whether to keep or remove.

3. Duplicate Rows

Same record appears multiple times

    ID   Name    Age  City
    1    John    25   NYC
    2    Sarah   30   LA
    1    John    25   NYC  ← Duplicate!
    3    Mike    28   Chicago
                

Solutions:

1. Remove All Duplicates

Standard approach when you want to keep unique records only. Automatically keeps the first occurrence of each duplicate group.

2. Keep First Occurrence

Useful when chronological order matters and the first record represents the initial state or original entry in your system.

3. Keep Last Occurrence

Appropriate when you want the most recent update, like keeping the latest address or phone number for a customer record.

4. Based on Specific Columns

When full row duplication is too strict—for example, same user with different timestamps or same product with varying prices.

4. Inconsistent Data

Same thing represented differently

    Country column has:
    "USA", "U.S.A.", "United States", "US", "usa"

    All mean the same thing!
                

Solutions:

1. Standardize Text

Convert all text to lowercase or uppercase to ensure case-insensitive matching. Essential for text analysis and categorical grouping.

2. Strip Whitespace

Remove accidental spaces from user input that cause "USA" and " USA " to be treated as different values. Always apply before matching.

3. Replace Variants

Create explicit mappings for known variations. Effective when you have a finite set of alternatives like country names or product codes.

4. Regex Patterns

Use regular expressions for systematic transformations like extracting numbers from phone formats or standardizing date strings across different formats.

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('messy_data.csv')

# 1. Check for missing values
print(df.isnull().sum())

# 2. Handle missing values
# Option A: Drop rows with any missing values
df_clean = df.dropna()

# Option B: Fill numeric columns with mean
df['age'].fillna(df['age'].mean(), inplace=True)
df['income'].fillna(df['income'].median(), inplace=True)

# Option C: Fill categorical with mode
df['city'].fillna(df['city'].mode()[0], inplace=True)

# 3. Remove duplicates
df = df.drop_duplicates()

# 4. Handle outliers using IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers
df = df[(df['salary'] >= Q1 - 1.5*IQR) &
        (df['salary'] <= Q3 + 1.5*IQR)]

# 5. Standardize text data
df['country'] = df['country'].str.lower().str.strip()
df['country'] = df['country'].replace({
    'u.s.a.': 'usa',
    'united states': 'usa'
})

# 6. Check data types
print(df.dtypes)
df['date'] = pd.to_datetime(df['date'])

print("Clean data shape:", df.shape)

Step 4: Feature Engineering

Creating Better Features

Feature engineering transforms raw data into meaningful inputs that improve model performance. By extracting domain-relevant information, combining related attributes, and creating derived metrics, you give your models the patterns they need to learn effectively. Often, well-engineered features have more impact on accuracy than algorithm selection itself.

Common Feature Engineering Techniques:

Mathematical Transformations

Purpose: Derive meaningful relationships between existing numerical features that capture domain knowledge.

When to use: When you understand how variables relate mathematically (e.g., physics formulas, financial ratios, geometric calculations).

Examples:

  • Area = Length × Width — Captures space better than dimensions alone for real estate pricing
  • BMI = Weight / Height² — Standard health metric more informative than raw weight
  • Profit Margin = (Revenue - Cost) / Revenue — Relative profitability is more comparable across businesses
  • Log(Income) — Reduces skewness in salary data, making distributions more normal
Date/Time Features

Purpose: Extract temporal patterns and cyclical behaviors hidden in timestamp data.

When to use: When behavior varies by time—shopping patterns differ on weekends vs weekdays, sales spike in certain months, traffic varies by hour.

From "2024-03-15 14:30:00" extract:

  • Year — Captures long-term trends and inflation
  • Month/Quarter — Seasonal patterns (holiday shopping, tax season)
  • Day of week — Weekly cycles (weekend vs weekday behavior)
  • Hour — Intraday patterns (morning commute, lunch rush)
  • Is weekend — Binary flag for weekend-specific behavior
  • Days since event — Time elapsed since last purchase, registration, etc.
Text Features

Purpose: Convert unstructured text into numerical signals that capture sentiment, urgency, and content characteristics.

When to use: With customer reviews, support tickets, emails, social media posts, or any text where content style and sentiment matter.

From reviews/comments extract:

  • Word/character count — Detailed reviews may indicate stronger opinions
  • Capital letters ratio — EXCESSIVE CAPS may signal urgency or anger
  • Sentiment score — Positive/negative tone predicts satisfaction
  • Keyword presence — "refund", "broken", "love", "recommend" are strong signals
  • Question marks — Indicates confusion or need for clarification
Binning (Discretization)

Purpose: Convert continuous variables into categorical groups to capture non-linear relationships and reduce noise.

When to use: When the effect isn't linear (e.g., insurance risk doesn't increase smoothly with age—it spikes at certain thresholds).

Age → Age Group:

  • 0-18: Minor — Different legal/purchasing restrictions
  • 19-35: Young Adult — Peak spending on entertainment/tech
  • 36-60: Middle Age — Family-focused purchases, peak earning
  • 60+: Senior — Healthcare focus, retirement spending

Other examples: Income brackets, credit score tiers, temperature ranges

Interaction Features

Purpose: Capture combined effects where two features together have different impact than either alone.

When to use: When relationships depend on context—coffee sales differ by location AND time, not just one factor.

Examples:

  • Location × Time — "NYC_Morning" has different commute patterns than "Suburbs_Morning"
  • Category × Price — "Electronics_Premium" buyers behave differently than "Clothing_Premium"
  • Day × Weather — "Weekend_Sunny" drives outdoor retail, "Weekday_Rain" boosts delivery services
  • Age × Income — Young high-earners vs older high-earners have distinct spending patterns
Aggregation Features

Purpose: Summarize historical behavior into features that represent patterns, trends, and entity-level characteristics.

When to use: With transactional data to create customer/product profiles, or when past behavior predicts future actions.

Customer-level aggregations:

  • Total purchases — Customer lifetime value indicator
  • Average order value — Spending tier identification
  • Days since last purchase — Churn risk signal
  • Purchase frequency — Loyalty and engagement metric
  • Category diversity — Cross-category shoppers vs specialists
  • Trend (recent vs historical avg) — Increasing or decreasing engagement
import pandas as pd
import numpy as np

# 1. Mathematical transformations
df['bmi'] = df['weight'] / (df['height'] ** 2)
df['price_per_sqft'] = df['price'] / df['square_feet']
df['profit'] = df['revenue'] - df['cost']

# 2. Date/time features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['hour'] = df['date'].dt.hour

# 3. Text features
df['review_length'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()
df['has_exclamation'] = df['review'].str.contains('!').astype(int)

# 4. Binning
df['age_group'] = pd.cut(df['age'],
                         bins=[0, 18, 35, 60, 100],
                         labels=['Child', 'Young Adult', 'Adult', 'Senior'])

# 5. Interaction features
df['location_time'] = df['city'] + '_' + df['time_of_day']

# 6. Aggregation features (customer-level)
customer_stats = df.groupby('customer_id').agg({
    'purchase_amount': ['sum', 'mean', 'count'],
    'date': lambda x: (pd.Timestamp.now() - x.max()).days
})
customer_stats.columns = ['total_spent', 'avg_purchase',
                          'num_purchases', 'days_since_last']

Step 5: Feature Selection

Choosing the Right Features

Not all features contribute equally to model performance. Irrelevant or redundant features introduce noise, increase computational cost, and can cause overfitting. Feature selection identifies and retains only the most informative variables, improving both accuracy and interpretability while reducing training time.

Why Feature Selection Matters:

Faster Training
Fewer features reduce computational cost and training time, enabling faster experimentation and iteration.

Reduce Overfitting
Removing noisy or irrelevant features prevents models from learning spurious patterns that don't generalize.

Better Interpretability
Smaller feature sets make model decisions easier to understand, audit, and explain to stakeholders.

Better Accuracy
Eliminating redundant and irrelevant features allows models to focus on truly predictive signals.

Feature Selection Methods:

1. Correlation Analysis

Purpose: Identify and remove redundant features that provide the same information.

When to use: When you suspect multiple features measure the same underlying property (e.g., house square footage and number of rooms often correlate highly). Keeping both adds no new information but increases dimensionality.

How it works: Calculate pairwise correlation between all features. If two features have correlation above a threshold (typically 0.9 or 0.95), remove one. This reduces multicollinearity and speeds up training without losing predictive power.

2. Variance Threshold

Purpose: Remove features with very low variance that carry minimal information.

When to use: When features have nearly constant values across samples (e.g., a country column where 99% of entries are "USA"). Such features cannot help distinguish between different outcomes and only add computational overhead.

How it works: Calculate variance for each feature. Remove any features where variance falls below a threshold. A feature where all or most values are identical has near-zero variance and provides no discriminative power for predictions.

3. Feature Importance (Tree-Based)

Purpose: Rank features by their contribution to prediction accuracy using tree-based models.

When to use: When you want data-driven evidence of which features matter most. Particularly effective for non-linear relationships that correlation analysis might miss. Works well with Random Forest, XGBoost, or LightGBM.

How it works: Train a tree-based model on your data. The model calculates importance scores based on how much each feature reduces prediction error when used for splitting. Features appearing higher in trees and reducing error more get higher importance scores. Keep only the top-scoring features.

4. Recursive Feature Elimination (RFE)

Purpose: Systematically identify the optimal subset of features through iterative elimination.

When to use: When you need a specific number of features (e.g., limited by memory or regulatory requirements) and want to find the best combination. More thorough than simple importance ranking because it tests feature subsets together.

How it works: Start with all features and train a model. Rank features by importance and remove the least important one. Retrain the model with remaining features. Repeat until you reach the desired number of features. This accounts for feature interactions that might not be apparent when evaluating features individually.

Step 6: Data Preprocessing

Preparing Data for Algorithms

Different machine learning algorithms have specific input requirements and assumptions. Preprocessing ensures your data meets these requirements by normalizing scales, encoding categories as numbers, and handling class imbalance. Proper preprocessing is essential—even the best algorithm will fail with poorly formatted data.

Essential Preprocessing Steps:

1. Feature Scaling

Why? Features on different scales can bias algorithms (e.g., age: 20-80, income: 20000-200000)

Standardization (Z-Score)

What it does: Transforms features to have mean=0 and standard deviation=1. Each value is converted to how many standard deviations it is from the mean.

Use when: Features follow roughly normal distributions. Essential for algorithms sensitive to feature scale like SVM, Logistic Regression, Neural Networks, and KNN. Also preferred when outliers are meaningful and shouldn't be compressed.

Formula: z = (x - mean) / standard_deviation

Min-Max Normalization

What it does: Scales all features to a fixed range, typically [0, 1]. The minimum value becomes 0, the maximum becomes 1, and everything else scales proportionally between them.

Use when: You need bounded ranges (like image pixels already in 0-255). Ideal for Neural Networks and algorithms that don't assume normal distributions. Works well when you know there won't be new values outside the training range.

Formula: x_scaled = (x - min) / (max - min)

Robust Scaling

What it does: Uses median and interquartile range (IQR) instead of mean and standard deviation. Centers data around the median and scales by the range between 25th and 75th percentiles.

Use when: Your data contains many outliers that you want to preserve (unlike outlier removal). Median and IQR are not affected by extreme values, so outliers don't distort the scaling. Better than standardization when data has heavy tails or extreme values.

Formula: x_scaled = (x - median) / IQR

Visual Comparison:
Original:           Standardized:        Min-Max:
[20, 40, 60]        [-1, 0, 1]          [0, 0.5, 1]
[10000, 50000]      [-1, 1]             [0, 1]

Same scale now! ✓
              
2. Encoding Categorical Variables

Why? Algorithms need numbers, not text!

Label Encoding

What it does: Assigns each unique category a sequential integer (0, 1, 2, etc.). Simple and memory-efficient.

Use when: Encoding ordinal variables with natural order (Small < Medium < Large, Low < High). Also commonly used for target variables in classification. Tree-based algorithms (Random Forest, XGBoost) can handle label-encoded features well even without order.

Warning: DO NOT use for nominal categories (cities, colors, product names) with non-tree algorithms. The numeric encoding (NYC=0, LA=1, Chicago=2) implies Chicago > LA > NYC, which creates false relationships the model will learn.

One-Hot Encoding

What it does: Creates a separate binary (0/1) column for each category. If a sample belongs to that category, its column gets 1; all other category columns get 0.

Use when: Encoding nominal categories with no inherent order (cities, colors, departments, product types). Essential for linear models, neural networks, and SVM. Each category becomes its own independent feature with equal weight.

Caution: Creates many columns if a feature has hundreds of unique values (high cardinality). For example, 1000 zip codes become 1000 columns. Consider grouping rare categories or using alternative encodings (target encoding, embeddings) for high-cardinality features.

Original:           One-Hot Encoded:
City                City_NYC  City_LA  City_Chicago
NYC                 1         0        0
LA                  0         1        0
Chicago             0         0        1
                  
3. Handling Class Imbalance

Challenge: 95% class A, 5% class B → model just predicts A!

Example: Imbalanced Dataset
Class Distribution:
Class A (Normal): ████████████████████ 950 samples (95%)
Class B (Fraud):  █ 50 samples (5%)

Result: Model predicts "Normal" for everything → 95% accuracy!
But it misses ALL fraud cases! 
Solutions:
1. Random Oversampling

What it does: Randomly duplicates minority class samples until classes are balanced.

Before:  A A A A A A A A A B
After:   A A A A A A A A A B B B B B B B B B B
         └─────────────────┘ └─────────────────┘
         (Keep original)      (Duplicate minority)
                  

Pros: Simple to implement, no data loss from majority class, preserves all information from minority class.

Cons: Risk of overfitting since model sees exact duplicates. Model may memorize specific minority examples rather than learning general patterns.

Use when: Quick baseline for imbalanced data, when minority class has sufficient diversity, or combined with other techniques.

2. Random Undersampling

What it does: Randomly removes majority class samples until classes are balanced.

Before:  A A A A A A A A A B
After:   A B
         └┘ └┘
         (Random sample)  (Keep all)
                  

Pros: Faster training on smaller dataset, reduces risk of overfitting to majority patterns, computationally efficient.

Cons: Discards potentially useful information from majority class. May lose important patterns or edge cases present in removed samples.

Use when: You have enormous amounts of majority class data, training time is critical, or severe imbalance (99%+ majority).

3. SMOTE (Synthetic Minority Over-sampling Technique) ⭐

What it does: Creates synthetic (new, artificial) minority examples by interpolating between existing minority samples. Generates new points along the lines connecting nearby minority samples.

How SMOTE Works:
Step 1: Pick a minority sample (●)
Step 2: Find its K nearest neighbors (○)
Step 3: Draw line between ● and one ○
Step 4: Create new sample (★) somewhere on that line

    ○           Original minority samples: ●, ○
     \
      ★         New synthetic sample: ★
       \        (Combination of ● and ○)
        ●
                  

Example: If sample A has [age=25, income=50k] and sample B has [age=35, income=70k], SMOTE might create [age=30, income=60k]

Pros: No exact duplicates, increases minority class diversity, reduces overfitting compared to random oversampling. Synthetic samples fill gaps in feature space.

Cons: Can create unrealistic or noisy samples, especially near class boundaries. May amplify labeling errors in minority class. Doesn't work well with categorical features.

Best for: Moderate imbalance (60-40 to 80-20 ratios), numerical features, when minority class forms clusters in feature space.

4. Class Weights

What it does: Modifies the learning algorithm to penalize mistakes on minority class more heavily. Instead of changing the data, adjusts how much the model "cares" about each class during training.

Normal mistake:  Error cost = 1
Fraud mistake:   Error cost = 19  ← 19x more important!

Model learns: "Better to get fraud detection right!"
                  

Pros: No data manipulation needed, preserves original dataset, simple to implement, works with most algorithms (logistic regression, SVM, tree-based models, neural networks).

Cons: Requires algorithm support for sample weights or class weights. May need hyperparameter tuning to find optimal weight ratios. Doesn't add minority class diversity like SMOTE.

Best for: When you can't change the dataset (regulatory requirements), extreme imbalance (99%+), or as first approach before trying sampling methods. Often combined with other techniques.

🤔 Which Technique to Use?
Small dataset (<10k samples)

→ Use SMOTE or class weights
(Don't undersample!)

Large dataset (>100k)

→ Undersampling is OK
(Fast training)

Extreme imbalance (99:1)

→ Combine techniques
(SMOTE + Undersampling)

Can't change data

→ Use class weights
(No resampling needed)

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import pandas as pd

# Load clean data
df = pd.read_csv('clean_data.csv')

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# 1. Encode categorical variables
categorical_cols = X.select_dtypes(include=['object']).columns
X_encoded = pd.get_dummies(X, columns=categorical_cols)

# 2. Split data BEFORE scaling (important!)
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)

# 3. Scale features (fit only on training data!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform, don't fit!

# Now ready for modeling!
print("Training shape:", X_train_scaled.shape)
print("Test shape:", X_test_scaled.shape)

Step 7: Train/Test/Validation Split

Splitting Data Properly

Imagine studying for a test by memorizing the exact questions that will be on it. You'd get 100%, but did you really learn? Machine learning models do the same thing! We need separate data to train on and test on to measure real performance.

The Three Datasets:

    Complete Dataset (100%)
    ┌─────────────────────────────────────────────────────┐
    │ All Your Data                                       │
    └─────────────────────────────────────────────────────┘
                      ↓  SPLIT  ↓
    ┌──────────────────────┬────────────┬──────────────┐
    │   Training (60%)     │  Val (20%) │  Test (20%)  │
    │                      │            │              │
    │   Model learns       │  Tune      │  Final       │
    │   from this          │  settings  │  evaluation  │
    └──────────────────────┴────────────┴──────────────┘
          
Training Set (60-80%)

Purpose: Model learns patterns from this data

Usage: Fit your model here

Analogy: Practice problems before the test

Validation Set (10-20%)

Purpose: Tune hyperparameters, select models

Usage: Evaluate during development

Analogy: Practice test to check progress

Test Set (10-20%)

Purpose: Final, unbiased performance metric

Usage: Touch ONCE at the very end

Analogy: The actual exam

⚠️ Critical Rules:

1. Split BEFORE preprocessing!

Scale/normalize training and test separately to avoid data leakage

2. NEVER use test data for training

Test set must remain completely unseen until final evaluation

3. Stratify for classification

Maintain class proportions in all splits (if 30% positive in full data, keep 30% in each split)

4. Random seed for reproducibility

Use random_state parameter to get same split every time

5. Time-aware for time series

Don't shuffle time series! Train on past, test on future

from sklearn.model_selection import train_test_split

# Basic split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# With stratification (maintains class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Three-way split (60% train, 20% val, 20% test)
# First split: 80% train+val, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# Second split: 75% of 80% = 60% train, 25% of 80% = 20% val
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42
)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

Step 8: Cross-Validation

Getting More Reliable Estimates

A single train/test split provides just one performance measurement, which can be misleading if the split happens to be lucky or unlucky. Cross-validation addresses this by systematically testing your model on multiple different data splits, providing a more robust and reliable estimate of how well your model will generalize to unseen data. This technique is essential for understanding the true performance of your model and detecting overfitting before deployment.

K-Fold Cross-Validation (K=5):

    Fold 1: [TEST][train][train][train][train]  → Accuracy: 85%
    Fold 2: [train][TEST][train][train][train]  → Accuracy: 87%
    Fold 3: [train][train][TEST][train][train]  → Accuracy: 83%
    Fold 4: [train][train][train][TEST][train]  → Accuracy: 86%
    Fold 5: [train][train][train][train][TEST]  → Accuracy: 84%
                                                    ──────────
                                   Average:         85% ± 1.4%

    Final: More reliable than single 85% score!
          

Cross-Validation Techniques:

1. K-Fold Cross-Validation

What it does: Divides your dataset into K equal-sized parts (folds). The model trains on K-1 folds and tests on the remaining fold. This process repeats K times, with each fold serving as the test set exactly once. Final performance is the average across all K iterations.

How it works: With K=5, your data splits into 5 parts. First iteration uses folds 2-5 for training and fold 1 for testing. Second iteration uses folds 1,3-5 for training and fold 2 for testing. This continues until all folds have been test sets. You get 5 performance scores that average into your final metric.

When to use: Default choice for most machine learning problems with balanced datasets. Works well when you have moderate to large datasets (1000+ samples) and no special data characteristics like time ordering or severe class imbalance. Common values: K=5 for quick validation, K=10 for more thorough testing.

Trade-offs: Larger K means more training data per fold (better) but more computational cost. K=5 is fast with decent reliability. K=10 is more robust but takes twice as long. Avoid K > 10 unless you have specific reasons.

2. Stratified K-Fold

What it does: Similar to regular K-Fold but ensures each fold maintains the same class distribution as the original dataset. If your dataset has 80% class A and 20% class B, every fold will have approximately this 80-20 split.

Why it matters: Prevents unlucky splits where one fold might accidentally get too many or too few samples of a minority class. Regular K-Fold could create a fold with 90% class A / 10% class B, making that iteration's score unrepresentative. Stratification eliminates this variance.

When to use: Classification problems with imbalanced classes (not 50-50 split). Essential when minority class is less than 30% of data. Also recommended for any classification task as it provides more stable results with no downside. Not applicable to regression problems (no classes to stratify).

Real-world example: Fraud detection with 95% normal / 5% fraud. Regular K-Fold might create one fold with only 2% fraud cases, giving misleading results. Stratified K-Fold ensures every fold has approximately 5% fraud cases for consistent evaluation.

3. Leave-One-Out Cross-Validation (LOOCV)

What it does: Extreme case where K equals the number of samples. Each iteration trains on all data except one sample, then tests on that single excluded sample. If you have 100 samples, you train 100 times, each time holding out a different sample.

When to use: Very small datasets (fewer than 100 samples) where you cannot afford to exclude much training data. Medical studies with 30 patients, rare event analysis, or expensive experiments where data collection is limited. Provides maximum use of available data.

Major drawback: Computationally expensive for large datasets. With 10,000 samples, you train your model 10,000 times. This is prohibitively slow for complex models like deep neural networks or large random forests. Also provides high variance in estimates since each test set is just one sample.

Warning: Only use LOOCV when data is genuinely scarce (N < 100). For larger datasets, use K-Fold with K=10 instead, which provides similar benefits with far less computation.

4. Time Series Cross-Validation

What it does: Respects temporal ordering by always training on past data and testing on future data. Never allows the model to see future information during training. Uses expanding or rolling window approach where training set grows or shifts forward with each fold.

How it differs: Standard K-Fold shuffles data randomly, which breaks time dependencies and causes data leakage in time series. If you train on 2023 data and test on 2022 data (which K-Fold might do), you're cheating. Time Series CV ensures you always predict forward in time, just like real-world deployment.

When to use: Any problem where data has temporal ordering: stock prices, sales forecasting, weather prediction, sensor data, user behavior over time. Essential when past events influence future ones or when you need to simulate real-time prediction scenarios.

Implementation approach: Start with first 20% as training, next 20% as test (fold 1). Then use first 40% as training, next 20% as test (fold 2). Continue until all data is used. This mimics how your model will be used in production, making predictions on future unseen data.

Key Benefits of Cross-Validation:

More Reliable Performance Estimates

Single train/test split gives one data point. Cross-validation provides multiple measurements that average into a more trustworthy metric. Instead of "accuracy is 85%", you get "accuracy is 85% ± 2%" showing both performance and consistency.

Efficient Use of Limited Data

Every sample gets used for both training and testing across different folds. With small datasets, this maximizes information extraction. A single 80-20 split wastes 20% of training data; 5-fold CV uses 100% of data for training (just not all at once).

Better Overfitting Detection

If your model gets 95% on training but scores vary wildly across CV folds (70%, 85%, 90%, 75%, 80%), you know it's overfitting. Consistent scores across folds indicate robust learning. High variance across folds is a red flag before you ever deploy.

Model Comparison and Selection

When comparing algorithms, CV prevents lucky/unlucky splits from misleading you. Model A might beat Model B on one random split but lose on 4 out of 5 CV folds. CV reveals which model truly performs better on average, not just on one favorable data sample.

Practical Guidance:

For most projects: Start with 5-fold cross-validation. It balances computational cost with reliability. Use stratified K-fold for classification tasks.

For small datasets (N < 500): Consider 10-fold CV or LOOCV to maximize data usage.

For time series: Always use Time Series CV. Regular K-Fold will give misleadingly optimistic results.

For large datasets (N > 100k): A single well-chosen train/test split may suffice. CV's benefits diminish with abundant data, and computational cost increases.

Best Practices & Common Pitfalls

Don't Make These Mistakes!

⚠️ Common Pitfalls to Avoid:

Data Leakage

What happens: Information from test set leaks into training

Examples:

  • Scaling before splitting (fit on all data including test)
  • Feature selection using all data
  • Using future information to predict past (time series)

Solution: Always split first, then preprocess!

Ignoring Class Imbalance

What happens: 95% class A, model just predicts A for everything

Solution: Use stratified split, oversample/undersample, adjust class weights

Not Handling Missing Values Properly

What happens: Blindly filling with mean can introduce bias

Solution: Understand WHY data is missing. Sometimes missingness is informative!

Removing All Outliers

What happens: Some outliers are real and important (CEO salary, rare disease)

Solution: Investigate outliers with domain knowledge before removing

⚠️ One-Hot Encoding High Cardinality Features

What happens: 1000 zip codes → 1000 new columns!

Solution: Use target encoding, embedding, or group rare categories

⚠️ Not Setting random_state

What happens: Results change every time you run code

Solution: Always set random_state for reproducibility

✨ Golden Rules for Data Preparation:

1

Document Everything: Keep notes on what you changed and why

2

Split Before Preprocessing: Avoid data leakage at all costs

3

Understand Your Data: EDA before making decisions

4

Domain Knowledge Matters: Statistics + context = good decisions

5

Start Simple: Try basic cleaning before complex transformations

6

Validate Results: Always check if preprocessing improved model performance

With clean, prepared data in place, the next step is selecting and training appropriate algorithms to extract insights and make predictions from your dataset.

Next: Machine Learning Algorithms →