← All Chapters Chapter 1

Python & ML Libraries: Your Toolkit

Python is the most widely used language for machine learning. This chapter introduces the essential libraries—NumPy, Pandas, Matplotlib, Scikit-learn, and TensorFlow/PyTorch—explaining what each one does and when to use them.

Why Python?

Python is the dominant language for machine learning and data science for several reasons:

  • Readable syntax: Python code is written in a clear, English-like syntax that's easy to learn and maintain.
  • Rich ecosystem: Extensive libraries (NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch) provide pre-built functionality for ML tasks.
  • Industry adoption: Used by major tech companies (Google, Meta, Microsoft) and throughout the ML research community.
  • Large community: Extensive documentation, tutorials, and support through forums like Stack Overflow and Reddit.

What is a Python Library?

A library is a collection of pre-written code that provides specific functionality. Instead of writing code from scratch for common tasks, libraries offer ready-made functions and tools.

Example: Instead of writing hundreds of lines of code to calculate the average of numbers, a library like NumPy provides a single function: np.mean()

How libraries work:

  1. Import: Load the library into the program (import numpy)
  2. Use: Call its functions to perform tasks (numpy.mean([1, 2, 3]))
  3. Benefit: Save time and use tested, optimized code

For machine learning, Python has specialized libraries that handle data manipulation, visualization, training models, and deployment. The following five libraries are essential for ML development.

The 5 Essential Libraries

Machine learning projects follow a common workflow: load data, clean it, explore patterns, train a model, and evaluate results. Each step requires specialized tools.

The following five libraries work together to handle this workflow. Think of them as tools in a toolbox—each has a specific purpose, and they're designed to work seamlessly with one another.

Quick overview:

  • NumPy: Fast numerical computing with arrays and matrices
  • Pandas: Data loading, cleaning, and manipulation
  • Matplotlib: Creating charts and visualizations
  • Scikit-learn: Training and evaluating ML models
  • TensorFlow/PyTorch: Deep learning and neural networks

The sections below explain what each library does, when to use it, and provide code examples.

1. NumPy

Fast numerical computing with arrays and matrices

What is NumPy?

NumPy is a super-fast calculator that works on entire lists of numbers at once.

Instead of Python looping through numbers one-by-one (slow), NumPy sends the entire operation to optimized C code that processes everything in parallel (100× faster).

The magic happens through N-dimensional arrays:

Vectors (1D arrays):

A sequence of numbers. Example: [25, 50000, 720] representing [age, income, credit score] for one person.

Matrices (2D arrays):

Rows and columns of numbers. Your entire dataset is typically a matrix—each row is one person, each column is one feature.

Higher dimensions (3D, 4D, etc.):

Images are 3D arrays (height × width × color channels). Videos are 4D arrays (frames × height × width × channels). These are called tensors.

Why NumPy is fast: Python lists store scattered references to objects in memory. NumPy arrays store all data in one continuous block, and NumPy's core operations are written in C (a compiled, low-level language). This combination allows NumPy to process entire arrays at once—instead of Python looping through items one by one—making it 10-100x faster for numerical computations.

Here's this speed difference in action:

# Python way (slow): Loop through each element
numbers = [1, 2, 3, 4, 5]
result = []
for num in numbers:
    result.append(num * 2)
# Takes: ~0.1 seconds for 1 million numbers

# NumPy way (fast): Process entire array at once
import numpy as np
numbers = np.array([1, 2, 3, 4, 5])
result = numbers * 2
# Takes: ~0.001 seconds for 1 million numbers (100x faster!)

Arrays: The Core Data Structure

A NumPy array (ndarray) is a grid of values, all of the same type. Arrays can be 1D (vector), 2D (matrix), or higher dimensions.

import numpy as np

# 1D array (vector) - like a single row or column
prices = np.array([100, 150, 200, 250])
print(prices.shape)  # (4,) - 4 elements

# 2D array (matrix) - like a spreadsheet
data = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(data.shape)  # (2, 3) - 2 rows, 3 columns

# 3D array - like stacked matrices (images, video)
images = np.array([
    [[1, 2], [3, 4]],
    [[5, 6], [7, 8]]
])
print(images.shape)  # (2, 2, 2)

Creating Arrays: Common Methods

import numpy as np

# From a Python list
arr = np.array([1, 2, 3, 4, 5])

# Array of zeros (useful for initialization)
zeros = np.zeros((3, 4))  # 3 rows, 4 columns of zeros

# Array of ones
ones = np.ones((2, 3))    # 2 rows, 3 columns of ones

# Range of numbers (like Python's range)
sequence = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]

# Evenly spaced numbers
spaced = np.linspace(0, 1, 5)   # [0.0, 0.25, 0.5, 0.75, 1.0]

# Random numbers (very common in ML)
random_arr = np.random.rand(3, 3)  # 3x3 array of random numbers [0, 1)
normal = np.random.randn(100)       # 100 random numbers from normal distribution

# Identity matrix (diagonal of 1s)
identity = np.eye(3)  # Used in linear algebra

Indexing and Slicing: Accessing Data

Indexing works like Python lists, but extends to multiple dimensions.

import numpy as np

# 1D indexing (like Python lists)
arr = np.array([10, 20, 30, 40, 50])
print(arr[0])      # 10 (first element)
print(arr[-1])     # 50 (last element)
print(arr[1:4])    # [20, 30, 40] (slice: start:end)

# 2D indexing
data = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
print(data[0, 0])      # 1 (row 0, col 0)
print(data[1, 2])      # 6 (row 1, col 2)
print(data[0])         # [1, 2, 3] (entire first row)
print(data[:, 0])      # [1, 4, 7] (entire first column)
print(data[0:2, 1:3])  # [[2, 3], [5, 6]] (submatrix)

# Boolean indexing (very powerful for filtering)
arr = np.array([1, 2, 3, 4, 5, 6])
mask = arr > 3         # [False, False, False, True, True, True]
filtered = arr[mask]   # [4, 5, 6]
# Or in one line:
filtered = arr[arr > 3]  # [4, 5, 6]

Array Operations: Element-wise Math

Operations apply to every element automatically (no loops needed).

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

# Arithmetic operations (element-wise)
print(arr + 10)     # [11, 12, 13, 14, 15]
print(arr * 2)      # [2, 4, 6, 8, 10]
print(arr ** 2)     # [1, 4, 9, 16, 25] (squared)

# Operations between arrays (same shape)
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print(arr1 + arr2)  # [5, 7, 9]
print(arr1 * arr2)  # [4, 10, 18] (element-wise multiplication)

# Mathematical functions
arr = np.array([1, 4, 9, 16])
print(np.sqrt(arr))      # [1., 2., 3., 4.]
print(np.log(arr))       # Natural logarithm
print(np.exp(arr))       # e^x
print(np.sin(arr))       # Sine

Broadcasting: Operating on Different Shapes

Broadcasting allows NumPy to work with arrays of different shapes during arithmetic operations.

import numpy as np

# Adding a scalar to an array (broadcasts the scalar)
arr = np.array([1, 2, 3, 4])
result = arr + 10  # 10 is "broadcast" to [10, 10, 10, 10]
print(result)  # [11, 12, 13, 14]

# Broadcasting with 2D arrays
matrix = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
row = np.array([10, 20, 30])

# Add row to each row of matrix
result = matrix + row
print(result)
# [[11, 22, 33],
#  [14, 25, 36]]

# Normalizing data (common in ML)
data = np.array([[1, 2], [3, 4], [5, 6]])
mean = data.mean(axis=0)  # Mean of each column
std = data.std(axis=0)    # Std of each column
normalized = (data - mean) / std  # Broadcasting!

Essential Functions for Machine Learning

import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Statistical functions
print(np.mean(data))      # 5.5 (average)
print(np.median(data))    # 5.5 (middle value)
print(np.std(data))       # 2.87 (standard deviation)
print(np.var(data))       # 8.25 (variance)
print(np.min(data))       # 1 (minimum)
print(np.max(data))       # 10 (maximum)

# Aggregation along axes (for 2D)
matrix = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(matrix.sum(axis=0))  # [5, 7, 9] (sum each column)
print(matrix.sum(axis=1))  # [6, 15] (sum each row)

# Reshaping (critical for ML)
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped = arr.reshape(2, 3)  # Convert to 2x3 matrix
# [[1, 2, 3],
#  [4, 5, 6]]

# Flattening (opposite of reshape)
flat = reshaped.flatten()  # [1, 2, 3, 4, 5, 6]

# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
vstacked = np.vstack([a, b])  # Vertical: [[1,2,3], [4,5,6]]
hstacked = np.hstack([a, b])  # Horizontal: [1,2,3,4,5,6]

When to use NumPy in ML:

  • Data storage: All ML libraries (scikit-learn, TensorFlow, PyTorch) expect NumPy arrays as input
  • Feature engineering: Creating new features from existing data using mathematical operations
  • Data preprocessing: Normalization, standardization, reshaping data
  • Mathematical computations: Linear algebra, statistics, matrix operations
  • Memory efficiency: NumPy arrays use less memory than Python lists

Real Machine Learning Example

Here's how NumPy is used in a typical ML preprocessing pipeline:

import numpy as np

# Simulated dataset: House prices
# Columns: [square_feet, bedrooms, age_years, price]
raw_data = np.array([
    [1500, 3, 10, 300000],
    [2000, 4, 5, 400000],
    [1200, 2, 20, 250000],
    [1800, 3, 8, 350000]
])

# Separate features (X) and target (y)
X = raw_data[:, 0:3]  # First 3 columns (features)
y = raw_data[:, 3]     # Last column (target)

# Normalize features (important for ML algorithms)
# Formula: (value - mean) / std
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_normalized = (X - X_mean) / X_std

print("Original features:\n", X)
print("\nNormalized features:\n", X_normalized)
print("\nTarget values:", y)

# Now X_normalized and y are ready for ML training!

Best Practices

  • Avoid loops: Write arr * 2 instead of looping. NumPy operations are 10-100× faster because they run in compiled C code.
  • Slicing creates a window, not a copy: When you write subset = arr[0:3], both subset and arr point to the same memory. Change one, the other changes too. Use subset = arr[0:3].copy() to create an independent copy. Think of it like two people sharing a document (view) vs. each person having their own copy.
  • Use appropriate data types: dtype=np.float32 uses half the memory of float64. For large datasets, this prevents "out of memory" errors.
  • Double-check your axis: axis=0 operates down rows (column-wise), axis=1 operates across columns (row-wise). Getting this wrong silently produces incorrect results.
  • Broadcasting: NumPy's automatic shape matching: NumPy compares array shapes from right to left. Each dimension must be either (1) the same size, or (2) one of them is size 1. Size-1 dimensions get "stretched" to match. Example: (3, 1) + (3, 4) → (3, 4) works. (3,) + (3, 4) works (becomes (1, 3) + (3, 4) → (3, 4)). But (2, 3) + (3, 4) fails (2 ≠ 3). Think of it like auto-repeating values to fill missing dimensions.
Why this matters: NumPy is the universal data format for ML. Pandas DataFrames convert to NumPy arrays, scikit-learn models accept NumPy arrays, and deep learning frameworks build on top of similar array concepts. Mastering NumPy fundamentals is essential for all ML work in Python.

2. Pandas

Data manipulation and analysis in Python

What is Pandas?

Pandas lets you work with data like a spreadsheet, but with code instead of mouse clicks.

It's built on NumPy but adds row/column labels, making it easy to filter, sort, group, and transform real-world datasets. Think Excel meets Python's power and automation.

Pandas works with two main data structures:

Series (1D labeled array):

A single column of data with labels. Example: stock prices over time—each price has a date label.

DataFrame (2D labeled table):

Multiple columns with row and column labels. Think of it as a spreadsheet where you can reference data by meaningful names instead of just positions.

Why Pandas matters: In real-world ML projects, 60-80% of time is spent on data preparation—loading files, handling missing values, filtering rows, creating new features, and merging datasets. Pandas makes these operations intuitive and efficient.

Core Data Structures: Series and DataFrames

Series: A 1-dimensional labeled array (like a single column in a spreadsheet)

import pandas as pd
import numpy as np

# Create a Series
prices = pd.Series([100, 150, 200, 250], index=['Mon', 'Tue', 'Wed', 'Thu'])
print(prices)
# Mon    100
# Tue    150
# Wed    200
# Thu    250

# Access by index
print(prices['Wed'])  # 200
print(prices.mean())  # 175.0

DataFrame: A 2-dimensional labeled table (like an entire spreadsheet)

import pandas as pd

# Create a DataFrame
data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000],
    'Department': ['Sales', 'IT', 'IT', 'Sales']
})

print(data)
#       Name  Age  Salary Department
# 0    Alice   25   50000      Sales
# 1      Bob   30   60000         IT
# 2  Charlie   35   70000         IT
# 3    Diana   28   55000      Sales

# DataFrame properties
print(data.shape)        # (4, 4) - 4 rows, 4 columns
print(data.columns)      # Column names
print(data.dtypes)       # Data types of each column

Loading Data from Various Sources

import pandas as pd

# From CSV file (most common)
df = pd.read_csv('data.csv')

# From Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# From JSON
df = pd.read_json('data.json')

# From SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)

# From dictionary (for testing)
df = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': ['a', 'b', 'c']
})

# From NumPy array
import numpy as np
arr = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(arr, columns=['A', 'B'])

Exploring and Inspecting Data

Always start by understanding the data structure and content.

import pandas as pd

# Load example data
df = pd.read_csv('sales.csv')

# First and last rows
print(df.head())       # First 5 rows
print(df.tail(10))     # Last 10 rows

# Dataset info
print(df.info())       # Column names, types, non-null counts
print(df.shape)        # (rows, columns)
print(df.columns)      # Column names

# Summary statistics (for numerical columns)
print(df.describe())   # count, mean, std, min, max, quartiles

# Check for missing values
print(df.isnull().sum())  # Count of missing values per column
print(df.isna().sum())    # Same as isnull()

# Value counts for categorical columns
print(df['category'].value_counts())  # Frequency of each category

# Unique values
print(df['product'].unique())         # Array of unique values
print(df['product'].nunique())        # Count of unique values

Selecting and Filtering Data

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Salary': [50000, 60000, 70000, 55000, 65000],
    'Department': ['Sales', 'IT', 'IT', 'Sales', 'HR']
})

# Select single column (returns Series)
ages = df['Age']

# Select multiple columns (returns DataFrame)
subset = df[['Name', 'Salary']]

# Select rows by index
first_row = df.iloc[0]        # First row
first_three = df.iloc[0:3]    # First 3 rows

# Select by label
row = df.loc[0]               # Row with index 0
specific = df.loc[0:2, ['Name', 'Age']]  # Rows 0-2, specific columns

# Boolean filtering (most common)
high_salary = df[df['Salary'] > 60000]
it_dept = df[df['Department'] == 'IT']

# Multiple conditions (use & for AND, | for OR)
young_high_earners = df[(df['Age'] < 30) & (df['Salary'] > 55000)]
sales_or_it = df[(df['Department'] == 'Sales') | (df['Department'] == 'IT')]

# isin() for multiple values
selected_depts = df[df['Department'].isin(['IT', 'HR'])]

# String operations
starts_with_a = df[df['Name'].str.startswith('A')]
contains_li = df[df['Name'].str.contains('li')]

Handling Missing Data

Missing values (NaN, None, null) are common in real-world data and must be handled before ML.

import pandas as pd
import numpy as np

# Create data with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Check for missing values
print(df.isnull().sum())  # Count per column

# Drop rows with ANY missing values
df_dropped = df.dropna()

# Drop rows where ALL values are missing
df_dropped = df.dropna(how='all')

# Drop columns with missing values
df_dropped = df.dropna(axis=1)

# Fill missing values with a constant
df_filled = df.fillna(0)

# Fill with column mean (common for numerical data)
df['A'] = df['A'].fillna(df['A'].mean())

# Forward fill (use previous value)
df_filled = df.fillna(method='ffill')

# Backward fill (use next value)
df_filled = df.fillna(method='bfill')

# Fill different columns differently
df_filled = df.fillna({
    'A': df['A'].mean(),
    'B': df['B'].median()
})

Data Manipulation and Transformation

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
})

# Create new column
df['Tax'] = df['Salary'] * 0.3
df['Net_Salary'] = df['Salary'] - df['Tax']

# Apply function to column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')

# Apply function to entire row
def categorize(row):
    if row['Salary'] > 60000:
        return 'High'
    else:
        return 'Low'
df['Salary_Category'] = df.apply(categorize, axis=1)

# Rename columns
df_renamed = df.rename(columns={'Name': 'Employee_Name', 'Age': 'Years'})

# Drop columns
df_dropped = df.drop(['Tax'], axis=1)

# Sort values
df_sorted = df.sort_values('Salary', ascending=False)
df_sorted = df.sort_values(['Age', 'Salary'])  # Multiple columns

# Remove duplicates
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=['Name'])  # Based on specific column

GroupBy: Aggregating Data

GroupBy is one of the most powerful Pandas features—similar to SQL's GROUP BY.

import pandas as pd

df = pd.DataFrame({
    'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Salary': [50000, 60000, 70000, 65000, 55000],
    'Age': [25, 30, 35, 28, 32]
})

# Group by department and calculate mean
dept_avg = df.groupby('Department')['Salary'].mean()
print(dept_avg)
# Department
# HR       55000.0
# IT       67500.0
# Sales    55000.0

# Multiple aggregations
dept_stats = df.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Age': 'mean'
})

# Group by and count
dept_counts = df.groupby('Department').size()

# Multiple grouping columns
grouped = df.groupby(['Department', 'Age'])['Salary'].sum()

# Apply custom function
def salary_range(x):
    return x.max() - x.min()
salary_ranges = df.groupby('Department')['Salary'].apply(salary_range)

Merging and Joining DataFrames

Combine multiple datasets (like SQL JOINs).

import pandas as pd

# Two separate DataFrames
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20, 30],
    'dept_name': ['Sales', 'IT', 'HR']
})

# Inner join (only matching rows)
merged = pd.merge(employees, departments, on='dept_id', how='inner')

# Left join (all from left, matching from right)
merged = pd.merge(employees, departments, on='dept_id', how='left')

# Outer join (all rows from both)
merged = pd.merge(employees, departments, on='dept_id', how='outer')

# Concatenate DataFrames vertically (stack rows)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
combined = pd.concat([df1, df2], ignore_index=True)

# Concatenate horizontally (add columns)
combined = pd.concat([df1, df2], axis=1)

Feature Engineering for ML

Creating new features from existing data to improve model performance.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': ['2024-01-15', '2024-02-20', '2024-03-10'],
    'price': [100, 150, 120],
    'quantity': [5, 3, 7],
    'category': ['A', 'B', 'A']
})

# Convert date to datetime
df['date'] = pd.to_datetime(df['date'])

# Extract date features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday

# Create interaction features
df['total_revenue'] = df['price'] * df['quantity']

# Binning continuous variables
df['price_category'] = pd.cut(df['price'],
                               bins=[0, 100, 150, 200],
                               labels=['Low', 'Medium', 'High'])

# One-hot encoding (categorical to numerical)
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')
print(df_encoded)
# Creates new columns: cat_A, cat_B with 0s and 1s

# Label encoding (for ordinal categories)
category_mapping = {'A': 0, 'B': 1, 'C': 2}
df['category_encoded'] = df['category'].map(category_mapping)

When to use Pandas in ML:

  • Data loading: Import data from CSV, Excel, SQL, JSON, and other formats
  • Exploratory Data Analysis (EDA): Understand data distribution, patterns, and quality
  • Data cleaning: Handle missing values, duplicates, outliers, and inconsistencies
  • Feature engineering: Create new features, transform existing ones, encode categories
  • Data aggregation: Group data and compute statistics (GroupBy operations)
  • Data merging: Combine multiple data sources into a single dataset
  • Preprocessing: Prepare data before converting to NumPy arrays for ML models

Real Machine Learning Example

Complete data preprocessing pipeline for a customer churn prediction model:

import pandas as pd
import numpy as np

# 1. Load data
df = pd.read_csv('customer_data.csv')

# 2. Initial exploration
print(f"Dataset shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nData types:\n{df.dtypes}")

# 3. Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['income'].fillna(df['income'].mean(), inplace=True)
df.dropna(subset=['customer_id'], inplace=True)  # Drop if ID is missing

# 4. Remove duplicates
df.drop_duplicates(subset=['customer_id'], keep='first', inplace=True)

# 5. Feature engineering
# Calculate customer lifetime value
df['lifetime_value'] = df['monthly_charges'] * df['tenure_months']

# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle', 'Senior'])

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['gender', 'contract_type'], drop_first=True)

# 6. Filter relevant data
# Keep only active customers for analysis
df_active = df_encoded[df_encoded['status'] == 'Active']

# 7. Prepare for ML
# Separate features (X) and target (y)
feature_cols = ['age', 'income', 'tenure_months', 'lifetime_value', 'gender_M', 'contract_type_Monthly']
X = df_active[feature_cols]
y = df_active['churn']  # Target variable (1=churned, 0=retained)

# 8. Convert to NumPy for ML library
X_array = X.values
y_array = y.values

print(f"\nFinal dataset ready for ML:")
print(f"Features shape: {X_array.shape}")
print(f"Target shape: {y_array.shape}")

# Now ready to feed into scikit-learn or other ML libraries!

Best Practices

  • Use vectorized operations: Avoid Python loops; use Pandas built-in functions
  • Check dtypes: Ensure columns have correct data types (int, float, datetime, category)
  • Use inplace carefully: inplace=True modifies original DataFrame (saves memory but can't undo)
  • Chain operations: Use method chaining for cleaner code (df.dropna().sort_values())
  • Profile memory: Use df.memory_usage() to check memory consumption for large datasets
  • Category dtype: Convert repeated string columns to 'category' dtype to save memory
Why this matters: Pandas is the bridge between raw data and ML models. It handles the messy, real-world data that comes from databases, spreadsheets, and APIs. Mastering Pandas means spending less time fighting with data and more time building models. Every ML project starts with Pandas.

3. Matplotlib

Data visualization and plotting in Python

What is Matplotlib?

Matplotlib turns numbers into pictures—charts, graphs, and visualizations that make patterns visible.

It's Python's core plotting library. Before building any ML model, you visualize your data to spot outliers, see distributions, and understand relationships. Matplotlib makes that possible.

Common visualization types for ML:

Line plots and scatter plots:

Show trends over time or relationships between variables. Essential for exploring correlations and understanding how features relate to your target variable.

Histograms and distributions:

Reveal data distribution, outliers, and whether features are normally distributed—critical for choosing the right ML algorithms.

Heatmaps and subplots:

Compare multiple features at once, visualize correlation matrices, and create complex multi-panel figures for analysis.

Why visualization matters: Before training models, you must understand your data. Visualizations reveal patterns, outliers, skewed distributions, and missing data that statistics alone can't show. Every ML workflow includes exploratory data analysis (EDA) through visualization.

Two Ways to Use Matplotlib

Matplotlib has two interfaces: pyplot (simple, MATLAB-style) and object-oriented (more control, recommended for complex plots).

import matplotlib.pyplot as plt
import numpy as np

# Style 1: pyplot interface (simple, good for quick plots)
plt.plot([1, 2, 3, 4], [1, 4, 2, 3])
plt.title('Simple Plot')
plt.show()

# Style 2: Object-oriented interface (recommended)
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4], [1, 4, 2, 3])
ax.set_title('OO Plot')
plt.show()

# The OO style gives more control, especially with subplots

Essential Plot Types

1. Line Plot - Show trends over time or continuous data

import matplotlib.pyplot as plt
import numpy as np

# Sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Create line plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y1, label='sin(x)', linewidth=2, color='blue')
ax.plot(x, y2, label='cos(x)', linewidth=2, color='red', linestyle='--')
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_title('Line Plot Example')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

2. Scatter Plot - Show relationships between two variables

import matplotlib.pyplot as plt
import numpy as np

# Generate data
np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5

# Create scatter plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(x, y, alpha=0.6, s=50, c='blue', edgecolors='black')
ax.set_xlabel('Feature X')
ax.set_ylabel('Target Y')
ax.set_title('Scatter Plot: Relationship between X and Y')
ax.grid(True, alpha=0.3)
plt.show()

3. Bar Chart - Compare categories

import matplotlib.pyplot as plt

# Data
categories = ['Model A', 'Model B', 'Model C', 'Model D']
accuracies = [0.85, 0.92, 0.88, 0.95]

# Create bar chart
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(categories, accuracies, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
ax.set_ylabel('Accuracy')
ax.set_title('Model Performance Comparison')
ax.set_ylim(0, 1.0)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.2f}', ha='center', va='bottom')
plt.show()

4. Histogram - Show data distribution

import matplotlib.pyplot as plt
import numpy as np

# Generate data from normal distribution
data = np.random.normal(100, 15, 1000)

# Create histogram
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Test Scores')
ax.axvline(data.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {data.mean():.1f}')
ax.legend()
plt.show()

5. Box Plot - Show distribution and outliers

import matplotlib.pyplot as plt
import numpy as np

# Generate data for multiple groups
data1 = np.random.normal(100, 10, 100)
data2 = np.random.normal(90, 20, 100)
data3 = np.random.normal(110, 15, 100)

# Create box plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.boxplot([data1, data2, data3], labels=['Group A', 'Group B', 'Group C'])
ax.set_ylabel('Values')
ax.set_title('Distribution Comparison Across Groups')
ax.grid(True, alpha=0.3, axis='y')
plt.show()

Customization: Making Plots Beautiful

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
y = np.sin(x)

fig, ax = plt.subplots(figsize=(12, 6))

# Plot with customization
ax.plot(x, y, linewidth=2.5, color='#2E86AB', label='Sine Wave')
ax.fill_between(x, y, alpha=0.3, color='#A23B72')

# Labels and title
ax.set_xlabel('Time (seconds)', fontsize=12, fontweight='bold')
ax.set_ylabel('Amplitude', fontsize=12, fontweight='bold')
ax.set_title('Customized Sine Wave Plot', fontsize=14, fontweight='bold', pad=20)

# Grid and legend
ax.grid(True, linestyle='--', alpha=0.5, color='gray')
ax.legend(loc='upper right', framealpha=0.9, fontsize=10)

# Styling
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_facecolor('#F8F9FA')
fig.patch.set_facecolor('white')

plt.tight_layout()
plt.show()

Subplots: Multiple Plots in One Figure

Create multiple plots side-by-side for comparison.

import matplotlib.pyplot as plt
import numpy as np

# Generate data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.sin(x) * np.exp(-x/10)
y4 = np.cos(x) * np.exp(-x/10)

# Create 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Top-left
axes[0, 0].plot(x, y1, 'b-', linewidth=2)
axes[0, 0].set_title('Sine Wave')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Top-right
axes[0, 1].plot(x, y2, 'r-', linewidth=2)
axes[0, 1].set_title('Cosine Wave')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Bottom-left
axes[1, 0].plot(x, y3, 'g-', linewidth=2)
axes[1, 0].set_title('Damped Sine')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Bottom-right
axes[1, 1].plot(x, y4, 'm-', linewidth=2)
axes[1, 1].set_title('Damped Cosine')
axes[1, 1].grid(True, alpha=0.3)

# Overall title
fig.suptitle('Comparison of Trigonometric Functions', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

ML-Specific Visualizations

Confusion Matrix - Visualize classification results

import matplotlib.pyplot as plt
import numpy as np

# Confusion matrix data
confusion_matrix = np.array([
    [50, 10],
    [5, 35]
])

fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(confusion_matrix, cmap='Blues')

# Labels
classes = ['Negative', 'Positive']
ax.set_xticks(np.arange(len(classes)))
ax.set_yticks(np.arange(len(classes)))
ax.set_xticklabels(classes)
ax.set_yticklabels(classes)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14, fontweight='bold')

# Add text annotations
for i in range(len(classes)):
    for j in range(len(classes)):
        text = ax.text(j, i, confusion_matrix[i, j],
                      ha="center", va="center", color="white", fontsize=14, fontweight='bold')

plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

Feature Importance - Show which features matter most

import matplotlib.pyplot as plt
import numpy as np

# Sample feature importance data
features = ['Age', 'Income', 'Credit Score', 'Loan Amount', 'Employment Years']
importance = [0.25, 0.35, 0.20, 0.15, 0.05]

# Sort by importance
indices = np.argsort(importance)

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(np.arange(len(features)), np.array(importance)[indices], color='steelblue')
ax.set_yticks(np.arange(len(features)))
ax.set_yticklabels(np.array(features)[indices])
ax.set_xlabel('Importance Score', fontsize=12)
ax.set_title('Feature Importance in Loan Prediction Model', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

Learning Curves - Monitor model training

import matplotlib.pyplot as plt
import numpy as np

# Simulated training history
epochs = np.arange(1, 51)
train_loss = 2.5 * np.exp(-epochs/10) + np.random.randn(50) * 0.05
val_loss = 2.5 * np.exp(-epochs/10) + 0.3 + np.random.randn(50) * 0.08

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(epochs, train_loss, label='Training Loss', linewidth=2, color='blue')
ax.plot(epochs, val_loss, label='Validation Loss', linewidth=2, color='red')
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Model Training Progress', fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Styles and Themes

Matplotlib has built-in styles for consistent, professional-looking plots.

import matplotlib.pyplot as plt
import numpy as np

# See available styles
print(plt.style.available)
# ['seaborn-v0_8', 'ggplot', 'dark_background', 'bmh', 'fivethirtyeight', ...]

# Use a style
plt.style.use('seaborn-v0_8-darkgrid')

# Or use temporarily
with plt.style.context('dark_background'):
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot([1, 2, 3, 4], [1, 4, 2, 3], linewidth=2)
    ax.set_title('Plot with Dark Background')
    plt.show()

# Reset to default
plt.style.use('default')

When to use Matplotlib in ML:

  • Exploratory Data Analysis (EDA): Understand distributions, detect outliers, find correlations
  • Feature analysis: Visualize relationships between features and target variables
  • Model evaluation: Confusion matrices, ROC curves, precision-recall curves
  • Training monitoring: Plot loss curves, accuracy over epochs
  • Results presentation: Create publication-ready figures for reports and papers
  • Debugging: Visualize intermediate results to understand model behavior

Real Machine Learning Example

Complete EDA visualization workflow for a classification problem:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Simulated dataset
np.random.seed(42)
n_samples = 200

data = pd.DataFrame({
    'age': np.random.randint(18, 70, n_samples),
    'income': np.random.randint(30000, 150000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'approved': np.random.choice([0, 1], n_samples, p=[0.4, 0.6])
})

# Create comprehensive EDA dashboard
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Age distribution by approval status
approved = data[data['approved'] == 1]['age']
rejected = data[data['approved'] == 0]['age']
axes[0, 0].hist([approved, rejected], bins=20, label=['Approved', 'Rejected'], alpha=0.7)
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution by Approval Status')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Income vs Credit Score scatter
colors = ['red' if x == 0 else 'green' for x in data['approved']]
axes[0, 1].scatter(data['income'], data['credit_score'], c=colors, alpha=0.6)
axes[0, 1].set_xlabel('Income ($)')
axes[0, 1].set_ylabel('Credit Score')
axes[0, 1].set_title('Income vs Credit Score (Red=Rejected, Green=Approved)')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Approval rate by age group
age_bins = [18, 30, 40, 50, 70]
data['age_group'] = pd.cut(data['age'], bins=age_bins, labels=['18-30', '30-40', '40-50', '50-70'])
approval_rate = data.groupby('age_group')['approved'].mean()
axes[1, 0].bar(approval_rate.index.astype(str), approval_rate.values, color='steelblue')
axes[1, 0].set_xlabel('Age Group')
axes[1, 0].set_ylabel('Approval Rate')
axes[1, 0].set_title('Loan Approval Rate by Age Group')
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Plot 4: Credit score box plot by approval
axes[1, 1].boxplot([rejected_scores := data[data['approved'] == 0]['credit_score'],
                     approved_scores := data[data['approved'] == 1]['credit_score']],
                    labels=['Rejected', 'Approved'])
axes[1, 1].set_ylabel('Credit Score')
axes[1, 1].set_title('Credit Score Distribution by Approval Status')
axes[1, 1].grid(True, alpha=0.3, axis='y')

# Overall title
fig.suptitle('Loan Approval EDA Dashboard', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

# Key insights from visualization:
# 1. Age distribution shows approval patterns
# 2. Higher credit scores correlate with approval
# 3. Income alone may not be decisive factor
# 4. Age groups show different approval rates

Best Practices

  • Use OO interface: fig, ax = plt.subplots() gives more control than pyplot
  • Set figure size: Always specify figsize for consistent, readable plots
  • Label everything: Axis labels, titles, and legends make plots understandable
  • Choose colors wisely: Use colorblind-friendly palettes and sufficient contrast
  • Save high-resolution: plt.savefig('plot.png', dpi=300, bbox_inches='tight')
  • Close figures: Use plt.close() to free memory when creating many plots
Why this matters: Data visualization is the bridge between raw numbers and human understanding. Before building any ML model, visualization helps identify data quality issues, understand feature relationships, and communicate findings. Matplotlib is the foundation—master it, and tools like Seaborn and Plotly become easy to learn.

4. Scikit-learn

Machine learning algorithms and tools

What is Scikit-learn?

Scikit-learn is your ML algorithm toolkit—dozens of ready-to-use models that all work the same way: fit(), predict(), evaluate().

Need to predict house prices? Use LinearRegression. Classify emails as spam? Try RandomForest. Every algorithm has the same interface, so switching between models takes one line of code. This is where your prepared data becomes predictions.

Scikit-learn provides three main components:

Estimators (models that learn):

Algorithms like LinearRegression, RandomForest, or SVM that learn patterns from training data and make predictions on new data.

Transformers (data preprocessing):

Tools like StandardScaler or OneHotEncoder that transform your data into the format ML algorithms need—scaling numbers, encoding categories, reducing dimensions.

Pipelines (workflow automation):

Chain preprocessing and modeling steps together so you can apply the entire workflow in one command—essential for clean, reproducible ML code.

Why the consistent API matters: Every algorithm follows the same pattern—fit(X, y) to train, predict(X) to use. This means switching from linear regression to random forest requires changing one line of code. You learn the API once, and can experiment with dozens of algorithms.

Data Splitting: The Foundation of ML

Before training any model, data must be split into training and testing sets to evaluate performance on unseen data.

from sklearn.model_selection import train_test_split
import numpy as np

# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 0, 1, 1, 1])

# Split 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducible split
    stratify=y          # Maintain class balance
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
Cross-Validation: For robust evaluation, use k-fold cross-validation to test the model on multiple train/test splits.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5)  # 5-fold CV
print(f"Accuracy per fold: {scores}")
print(f"Mean accuracy: {scores.mean():.3f}")

Preprocessing: Preparing Data for ML

Raw data rarely works directly with ML algorithms. Scikit-learn provides powerful preprocessing tools.

# 1. Scaling Numerical Features
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler: mean=0, std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# MinMaxScaler: scale to [0, 1]
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X_train)

# 2. Encoding Categorical Features
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# LabelEncoder: convert categories to numbers
le = LabelEncoder()
colors = np.array(['red', 'blue', 'green', 'blue', 'red'])
encoded = le.fit_transform(colors)  # [2, 0, 1, 0, 2]

# OneHotEncoder: create binary columns
ohe = OneHotEncoder(sparse_output=False)
categories = np.array([['cat'], ['dog'], ['cat']])
one_hot = ohe.fit_transform(categories)
# [[1, 0],    # cat
#  [0, 1],    # dog
#  [1, 0]]    # cat

# 3. Handling Missing Values
from sklearn.impute import SimpleImputer

# Fill missing values with mean
imputer = SimpleImputer(strategy='mean')
X_with_nan = np.array([[1, 2], [np.nan, 4], [7, 6]])
X_filled = imputer.fit_transform(X_with_nan)
# [[1, 2],
#  [4, 4],    # NaN replaced with mean (4)
#  [7, 6]]

ML Algorithms: Regression, Classification, Clustering

Scikit-learn provides dozens of algorithms. Here are the most common ones with examples.

# 1. REGRESSION: Predict Continuous Values
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)

# Ridge Regression (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# 2. CLASSIFICATION: Predict Categories
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

# Decision Tree
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train, y_train)

# Random Forest
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train)

# Support Vector Machine
svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train, y_train)

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100)
gb.fit(X_train, y_train)

# 3. CLUSTERING: Find Groups in Data
from sklearn.cluster import KMeans, DBSCAN

# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)  # No labels needed

# DBSCAN (density-based clustering)
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X)

Model Evaluation: Measuring Performance

Training a model is only half the work—evaluating its performance is equally critical.

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    mean_squared_error, mean_absolute_error, r2_score,
    roc_auc_score, roc_curve
)

# Classification Metrics
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

accuracy = accuracy_score(y_true, y_pred)      # Overall correctness
precision = precision_score(y_true, y_pred)    # True positives / predicted positives
recall = recall_score(y_true, y_pred)          # True positives / actual positives
f1 = f1_score(y_true, y_pred)                  # Harmonic mean of precision & recall

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
# [[2, 0],    # True negatives, False positives
#  [1, 2]]    # False negatives, True positives

# Full Report
report = classification_report(y_true, y_pred)
print(report)

# ROC-AUC for probability predictions
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)

# Regression Metrics
y_true_reg = [3.0, -0.5, 2.0, 7.0]
y_pred_reg = [2.5, 0.0, 2.0, 8.0]

mse = mean_squared_error(y_true_reg, y_pred_reg)    # Mean squared error
rmse = mean_squared_error(y_true_reg, y_pred_reg, squared=False)  # Root MSE
mae = mean_absolute_error(y_true_reg, y_pred_reg)   # Mean absolute error
r2 = r2_score(y_true_reg, y_pred_reg)                # R-squared (0-1, higher is better)

Hyperparameter Tuning: Finding Optimal Settings

Every ML algorithm has hyperparameters (settings) that dramatically affect performance. Scikit-learn automates the search for optimal values.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define model
rf = RandomForestClassifier(random_state=42)

# Grid Search: Try Every Combination
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    rf, param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1                # Use all CPU cores
)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

# Use best model
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)

# Randomized Search: Sample Random Combinations (Faster)
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(5, 50)
}

random_search = RandomizedSearchCV(
    rf, param_dist,
    n_iter=20,               # Try 20 random combinations
    cv=5,
    random_state=42
)
random_search.fit(X_train, y_train)

Pipelines: Preventing Data Leakage

One of the most common mistakes in ML is data leakage—when information from the test set accidentally influences training. Pipelines prevent this by ensuring all preprocessing steps are learned only from training data.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Without Pipeline (WRONG - causes data leakage)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # ❌ Uses ALL data (train + test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
model = LogisticRegression()
model.fit(X_train, y_train)

# With Pipeline (CORRECT - no leakage)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Split BEFORE preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Scaler learns ONLY from X_train
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)

# Complex Pipeline with Multiple Steps
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

complex_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

complex_pipeline.fit(X_train, y_train)
predictions = complex_pipeline.predict(X_test)
Pipeline + GridSearch: Combine pipelines with hyperparameter tuning for production-ready workflows.
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, None]
}

grid = GridSearchCV(complex_pipeline, param_grid, cv=5)
grid.fit(X_train, y_train)

When to Use Scikit-learn in ML Workflows

  • Tabular data problems: Customer churn, fraud detection, price prediction, medical diagnosis
  • Baseline models: Always start with Scikit-learn before trying deep learning
  • Small to medium datasets: Works efficiently on datasets that fit in memory (<1M rows)
  • Interpretable models: Decision trees and linear models are easier to explain than neural networks
  • Production systems: Models can be saved and loaded for deployment
  • Preprocessing pipelines: Even if using TensorFlow/PyTorch, Scikit-learn handles data prep
  • Quick experimentation: Test multiple algorithms in minutes with consistent API

Real ML Example: End-to-End Credit Approval System

Complete workflow from raw data to deployed model.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# 1. Load Data
data = pd.DataFrame({
    'income': [50000, 30000, np.nan, 80000, 45000, 60000],
    'age': [25, 45, 35, 50, np.nan, 40],
    'employment': ['self', 'employed', 'employed', 'employed', 'self', 'employed'],
    'loan_amount': [10000, 5000, 15000, 20000, 8000, 12000],
    'approved': [1, 0, 1, 1, 0, 1]  # Target
})

# 2. Split Features and Target
X = data.drop('approved', axis=1)
y = data['approved']

# 3. Identify Column Types
numeric_features = ['income', 'age', 'loan_amount']
categorical_features = ['employment']

# 4. Create Preprocessing Pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# 5. Combine Transformers
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# 6. Create Full Pipeline
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 7. Split Data (BEFORE preprocessing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 8. Train Model
model.fit(X_train, y_train)

# 9. Evaluate
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_proba):.3f}")

# 10. Hyperparameter Tuning
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, None]
}

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best ROC-AUC: {grid_search.best_score_:.3f}")

# 11. Save Model for Production
import joblib
joblib.dump(grid_search.best_estimator_, 'credit_model.pkl')

# 12. Load and Use in Production
loaded_model = joblib.load('credit_model.pkl')
new_applicant = pd.DataFrame({
    'income': [55000],
    'age': [30],
    'employment': ['employed'],
    'loan_amount': [15000]
})
prediction = loaded_model.predict(new_applicant)
probability = loaded_model.predict_proba(new_applicant)[:, 1]
print(f"Approval: {prediction[0]}, Probability: {probability[0]:.2f}")

Best Practices (2024-2025)

  • Always use pipelines: Prevents data leakage and makes code reproducible
  • Start simple, then increase complexity: Try Logistic Regression before Random Forests
  • Use cross-validation: Never trust a single train/test split—use k-fold CV
  • Scale numerical features: Most algorithms perform better with standardized data
  • Handle class imbalance: Use stratify in train_test_split and class_weight='balanced' in models
  • Feature engineering matters: Good features > fancy algorithms
  • Monitor multiple metrics: Accuracy alone is misleading—check precision, recall, F1, ROC-AUC
  • Save trained models: Use joblib to save and load models for production
Why this matters: Scikit-learn is the workhorse of machine learning. It handles 90% of traditional ML tasks with a clean, consistent API. While deep learning frameworks like TensorFlow and PyTorch dominate headlines, most real-world ML problems (fraud detection, customer churn, price prediction) are solved with Scikit-learn. Master this library, and the transition to advanced frameworks becomes natural. Even when using deep learning, Scikit-learn handles preprocessing, evaluation, and baseline models.

5. TensorFlow / PyTorch

Deep learning and neural networks

What are TensorFlow and PyTorch?

TensorFlow and PyTorch are deep learning frameworks—tools for building neural networks that learn from images, text, audio, and video.

Scikit-learn handles traditional ML on tabular data. When you need to process unstructured data (recognize faces, understand language, generate images), you need neural networks. These frameworks handle the complex math, GPU acceleration, and automatic differentiation that make deep learning possible.

Core capabilities of deep learning frameworks:

Tensors (multi-dimensional arrays):

Like NumPy arrays but optimized for GPU computation. Images are 3D tensors, batches of images are 4D tensors. Everything in deep learning is a tensor operation.

Automatic differentiation (autograd):

Automatically computes gradients for backpropagation. You define the forward pass (how data flows through the network), and the framework calculates how to update weights.

GPU acceleration:

Neural networks have millions of parameters. GPUs perform parallel matrix operations 10-100x faster than CPUs, making deep learning training practical.

When to use deep learning: Start with Scikit-learn for tabular data (spreadsheets, databases). Move to TensorFlow/PyTorch when working with images, text, audio, video, or when you need custom neural architectures. Deep learning requires more data, more compute, and more expertise.

TensorFlow vs PyTorch: 2024-2025 Comparison

As of 2025, both frameworks are mature, highly optimized, and support dynamic computation. The choice depends on use case, not technical superiority.

PyTorch Strengths (PyTorch 2.x):
✓ Pythonic & intuitive: Write neural networks like normal Python classes
✓ Dynamic computation graphs: Debug with standard Python tools
✓ Research-friendly: Most academic papers use PyTorch
✓ Faster prototyping: Less boilerplate code
✓ Better for custom architectures and experimental models
✓ 25% faster training in some CNN benchmarks (2024)

TensorFlow Strengths (TensorFlow 2.x):
✓ Production ecosystem: TF Serving, TF Lite (mobile), TF.js (browser)
✓ Enterprise deployment: Mature MLOps tools
✓ Keras API integrated: High-level, beginner-friendly
✓ TPU support: Google's Tensor Processing Units
✓ Mobile/edge deployment: Better tooling for iOS/Android
✓ TensorBoard: Superior visualization for training metrics

Reality in 2025:
Both frameworks support static and dynamic modes. Both scale to production.
Both have excellent documentation. Pick based on your team's preference.

Core Concepts: Tensors, Layers, and Models

Both frameworks share fundamental concepts, though syntax differs.

# 1. TENSORS: Multidimensional Arrays (like NumPy, but GPU-ready)

# TensorFlow
import tensorflow as tf
tensor_tf = tf.constant([[1, 2], [3, 4]])
print(tensor_tf.shape)  # (2, 2)

# PyTorch
import torch
tensor_pt = torch.tensor([[1, 2], [3, 4]])
print(tensor_pt.shape)  # torch.Size([2, 2])

# 2. LAYERS: Building Blocks of Neural Networks

# TensorFlow (Keras API)
from tensorflow.keras import layers
dense_layer = layers.Dense(64, activation='relu')

# PyTorch (torch.nn)
import torch.nn as nn
dense_layer = nn.Linear(in_features=32, out_features=64)

# 3. MODELS: Complete Neural Networks

# TensorFlow Sequential API
model_tf = tf.keras.Sequential([
    layers.Dense(128, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# PyTorch Class-based API
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model_pt = SimpleNet()

Building Neural Networks: Complete Examples

Side-by-side comparison of building, training, and evaluating a neural network for image classification.

# TensorFlow Example: MNIST Digit Classification
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np

# 1. Load Data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

# 2. Preprocess
X_train = X_train.reshape(-1, 784).astype('float32') / 255.0
X_test = X_test.reshape(-1, 784).astype('float32') / 255.0

# 3. Build Model
model = models.Sequential([
    layers.Dense(128, activation='relu', input_shape=(784,)),
    layers.Dropout(0.2),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# 4. Compile
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# 5. Train
history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)

# 6. Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.3f}")

# 7. Predict
predictions = model.predict(X_test[:5])
print(f"Predicted classes: {np.argmax(predictions, axis=1)}")

# 8. Save Model
model.save('mnist_model.h5')
# PyTorch Example: MNIST Digit Classification
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchvision import datasets, transforms

# 1. Load Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# 2. Build Model
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.dropout = nn.Dropout(0.2)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 784)  # Flatten
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = MNISTNet()

# 3. Define Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 4. Train
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(10):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        # Forward pass
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)

        # Backward pass
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}/10, Loss: {loss.item():.4f}")

# 5. Evaluate
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        output = model(data)
        _, predicted = torch.max(output.data, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

print(f"Test accuracy: {correct / total:.3f}")

# 6. Save Model
torch.save(model.state_dict(), 'mnist_model.pth')

Common Neural Network Architectures

Different problems require different architectures. Here are the most common patterns.

# 1. CONVOLUTIONAL NEURAL NETWORKS (CNNs) - For Images

# TensorFlow
cnn_model = tf.keras.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# PyTorch
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 5 * 5, 64)
        self.fc2 = nn.Linear(64, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 64 * 5 * 5)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# 2. RECURRENT NEURAL NETWORKS (RNNs) - For Sequential Data

# TensorFlow
rnn_model = tf.keras.Sequential([
    layers.LSTM(128, input_shape=(timesteps, features)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])

# PyTorch
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])  # Take last timestep
        return out

# 3. TRANSFORMERS - For Language (Modern Approach)

# Using TensorFlow with Keras
transformer_model = tf.keras.Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.MultiHeadAttention(num_heads=8, key_dim=64),
    layers.GlobalAveragePooling1D(),
    layers.Dense(64, activation='relu'),
    layers.Dense(num_classes, activation='softmax')
])

The Training Process: Forward Pass, Loss, Backpropagation

Understanding what happens inside model.fit() (TensorFlow) or the training loop (PyTorch).

# Manual Training Loop (PyTorch) - Shows What's Happening

for epoch in range(num_epochs):
    for batch_data, batch_labels in data_loader:

        # 1. FORWARD PASS
        # Pass data through the network
        predictions = model(batch_data)

        # 2. COMPUTE LOSS
        # Measure how wrong the predictions are
        loss = loss_function(predictions, batch_labels)

        # 3. ZERO GRADIENTS
        # Clear previous gradients (PyTorch accumulates them)
        optimizer.zero_grad()

        # 4. BACKWARD PASS (Backpropagation)
        # Compute gradients for all parameters
        loss.backward()

        # 5. UPDATE WEIGHTS
        # Adjust parameters using gradients
        optimizer.step()

    print(f"Epoch {epoch}, Loss: {loss.item()}")

# TensorFlow hides this in model.fit(), but the steps are identical:
# 1. Forward pass → 2. Loss → 3. Gradients → 4. Backprop → 5. Update

GPU Acceleration: 10-100x Faster Training

Deep learning requires GPUs for practical training times. Both frameworks make this simple.

# TensorFlow: Automatic GPU Detection
import tensorflow as tf

# TensorFlow automatically uses GPU if available
print("GPUs Available:", len(tf.config.list_physical_devices('GPU')))

# Manual device placement (rarely needed)
with tf.device('/GPU:0'):
    model = tf.keras.Sequential([...])
    model.fit(X_train, y_train)

# PyTorch: Explicit Device Management
import torch

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move model and data to GPU
model = SimpleNet().to(device)
data = data.to(device)
labels = labels.to(device)

# All operations now run on GPU
output = model(data)

# Performance Impact:
# CPU: Training 1 epoch on MNIST takes ~60 seconds
# GPU: Training 1 epoch on MNIST takes ~6 seconds (10x faster)
# For large models (ResNet, BERT): GPU is 50-100x faster

When to Use TensorFlow/PyTorch

  • Computer vision: Image classification, object detection, segmentation (CNNs)
  • Natural language processing: Text classification, translation, LLMs (Transformers)
  • Speech and audio: Speech recognition, audio generation (RNNs, CNNs)
  • Time series: Stock prediction, sensor data forecasting (LSTMs, GRUs)
  • Generative models: GANs for image generation, VAEs for data synthesis
  • Reinforcement learning: Game AI, robotics control
  • When Scikit-learn fails: Unstructured data, massive datasets, or when deep learning outperforms

Real Example: Image Classification with Transfer Learning

Using pre-trained models (trained on millions of images) for custom tasks—the most practical approach for real projects.

# TensorFlow: Transfer Learning with MobileNetV2
import tensorflow as tf
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras import layers, models

# 1. Load Pre-trained Model (trained on ImageNet)
base_model = MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,  # Remove classification head
    weights='imagenet'
)

# 2. Freeze Base Model (don't retrain it)
base_model.trainable = False

# 3. Add Custom Classification Head
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_classes, activation='softmax')  # Custom classes
])

# 4. Compile and Train
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Only trains the new layers, not the pre-trained base
model.fit(train_data, epochs=10, validation_data=val_data)

# Why Transfer Learning?
# - Training from scratch: requires millions of images, weeks of GPU time
# - Transfer learning: works with 100s of images, hours of training
# - Achieves 90%+ accuracy on custom tasks with minimal data
# PyTorch: Transfer Learning with ResNet
import torch
import torch.nn as nn
from torchvision import models, transforms

# 1. Load Pre-trained Model
resnet = models.resnet50(pretrained=True)

# 2. Freeze Base Layers
for param in resnet.parameters():
    param.requires_grad = False

# 3. Replace Final Layer
num_features = resnet.fc.in_features
resnet.fc = nn.Linear(num_features, num_classes)

# 4. Only the final layer will be trained
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=0.001)

# 5. Train
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
resnet = resnet.to(device)

for epoch in range(10):
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = resnet(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

torch.save(resnet.state_dict(), 'custom_resnet.pth')

Best Practices (2024-2025)

  • Start with pre-trained models: Transfer learning beats training from scratch 99% of the time
  • Use data augmentation: Rotate, flip, crop images to create more training data artificially
  • Monitor validation loss: If training loss decreases but validation increases, the model is overfitting
  • Use GPUs: Deep learning on CPU is impractical for real projects—use Google Colab (free GPU) or cloud services
  • Batch normalization: Add BatchNormalization layers to stabilize training
  • Learning rate scheduling: Reduce learning rate when validation loss plateaus
  • Early stopping: Stop training when validation performance stops improving
  • Save checkpoints: Save model weights during training to recover from crashes
  • TensorFlow for production, PyTorch for research: This is still the practical reality in 2025

Deployment: From Training to Production

# TensorFlow Deployment Options

# 1. TensorFlow Lite (Mobile/Edge)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

# 2. TensorFlow.js (Browser)
# tensorflowjs_converter --input_format=keras model.h5 tfjs_model/

# 3. TensorFlow Serving (Production API)
# docker run -p 8501:8501 --mount type=bind,source=/models/my_model,target=/models/my_model -e MODEL_NAME=my_model -t tensorflow/serving

# PyTorch Deployment Options

# 1. TorchServe (Production API)
# torch-model-archiver --model-name mnist --version 1.0 --model-file model.py --serialized-file model.pth
# torchserve --start --model-store model_store --models mnist=mnist.mar

# 2. ONNX (Cross-platform)
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx")

# 3. TorchScript (Production PyTorch)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")
Why this matters: TensorFlow and PyTorch power the AI revolution. Every modern breakthrough—GPT language models, DALL-E image generation, AlphaFold protein folding—uses these frameworks. While Scikit-learn handles traditional ML, deep learning tackles problems impossible with classical approaches. The learning curve is steeper, but the capabilities are transformative. Start with Scikit-learn to build intuition, then graduate to TensorFlow or PyTorch when solving problems with images, text, or massive datasets. In 2025, both frameworks are production-ready, and the "best" choice depends on team preference and deployment target, not technical superiority.

Setting Up Your Environment

Quick Setup (5 minutes)

Python and these libraries are required. Here's the setup process:

1

Install Python

Download from python.org or install via Anaconda (recommended for beginners—it includes everything).

Anaconda: anaconda.com/download

2

Install Libraries

Open your terminal/command prompt and run:

# If using pip (Python's package manager)
pip install numpy pandas matplotlib scikit-learn

# For deep learning (optional for now)
pip install tensorflow
# OR
pip install torch torchvision

If you installed Anaconda, most of these come pre-installed!

3

Test Your Setup

Create a file called test.py and run this:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

print("✅ All libraries installed successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

Run with: python test.py

IDE / Code Editor

A code editor is needed to write Python code. Choose one:

Jupyter Notebook

Best for learning and experimentation. Run code in cells, see output immediately, mix code with notes.

Install: pip install notebook
Run: jupyter notebook

Recommended for beginners

VS Code

Professional code editor. Great for building real projects. Install Python extension for best experience.

Download from: code.visualstudio.com

Google Colab

Jupyter notebook in your browser. No installation needed. Free GPUs for deep learning. Perfect for trying things out.

Just go to: colab.research.google.com

Easiest to start

Your First ML Program

Build a Complete ML Pipeline (30 lines)

Let's put it all together. This program loads data, trains a model, and makes predictions. Copy this into a notebook and run it.

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 1: Create sample data (house size vs price)
np.random.seed(42)
house_sizes = np.random.randint(800, 3500, 100)  # Square feet
base_prices = house_sizes * 150  # Base: $150 per sq ft
noise = np.random.normal(0, 50000, 100)  # Add randomness
house_prices = base_prices + noise

# Step 2: Prepare data
X = house_sizes.reshape(-1, 1)  # Features (must be 2D for sklearn)
y = house_prices                 # Target

# Step 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 5: Make predictions
y_pred = model.predict(X_test)

# Step 6: Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Model Performance:")
print(f"  Mean Squared Error: ${mse:,.0f}")
print(f"  R² Score: {r2:.3f}")
print(f"\nModel learned: Price = ${model.coef_[0]:.2f} × Size + ${model.intercept_:,.0f}")

# Step 7: Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, alpha=0.5, label='Actual Prices')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predictions')
plt.xlabel('House Size (sq ft)')
plt.ylabel('Price ($)')
plt.title('House Price Predictions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()  # Note: Execution pauses here until you close the plot window

# Predict for a new house
new_house_size = np.array([[2000]])
predicted_price = model.predict(new_house_size)
print(f"\nPrediction: A 2000 sq ft house will cost ${predicted_price[0]:,.0f}")

What this code does:

  1. Creates fake data: 100 houses with sizes and prices (in real projects, data is loaded from CSV)
  2. Splits data: 80% for training, 20% for testing
  3. Trains model: Learns the relationship between size and price
  4. Makes predictions: Predicts prices for test houses
  5. Evaluates: Calculates how accurate the model is
  6. Visualizes: Creates a graph showing predictions vs reality
  7. Uses the model: Predicts price for a new 2000 sq ft house

The Standard ML Workflow

Most ML projects follow these steps:

  1. Load & Explore Data — Use Pandas to load CSV files, examine the first few rows, and understand the data structure
  2. Clean Data — Remove missing values, filter invalid entries, and handle duplicates using Pandas and NumPy
  3. Prepare Features — Select relevant columns as features (X) and the target variable (y) for prediction
  4. Split Data — Divide data into training and testing sets (typically 80/20 split) using Scikit-learn
  5. Train Model — Choose an algorithm (Linear Regression, Random Forest, etc.) and train it on the training data
  6. Make Predictions — Use the trained model to predict outcomes on the test data
  7. Evaluate Performance — Measure accuracy using metrics like accuracy score, precision, recall, or RMSE
  8. Improve & Iterate — Try different models, tune parameters, engineer new features, and repeat until performance is acceptable

Learn More

Official Documentation

📖 NumPy

numpy.org/doc

📖 Scikit-learn

scikit-learn.org/stable

📖 TensorFlow

tensorflow.org/tutorials