Python is the most widely used language for machine learning. This chapter introduces the essential libraries—NumPy, Pandas, Matplotlib, Scikit-learn, and TensorFlow/PyTorch—explaining what each one does and when to use them.
Python is the dominant language for machine learning and data science for several reasons:
A library is a collection of pre-written code that provides specific functionality. Instead of writing code from scratch for common tasks, libraries offer ready-made functions and tools.
Example: Instead of writing hundreds of lines of code to calculate the average of numbers, a library like NumPy provides a single function: np.mean()
How libraries work:
import numpy)numpy.mean([1, 2, 3]))For machine learning, Python has specialized libraries that handle data manipulation, visualization, training models, and deployment. The following five libraries are essential for ML development.
Machine learning projects follow a common workflow: load data, clean it, explore patterns, train a model, and evaluate results. Each step requires specialized tools.
The following five libraries work together to handle this workflow. Think of them as tools in a toolbox—each has a specific purpose, and they're designed to work seamlessly with one another.
Quick overview:
The sections below explain what each library does, when to use it, and provide code examples.
Fast numerical computing with arrays and matrices
NumPy is a super-fast calculator that works on entire lists of numbers at once.
Instead of Python looping through numbers one-by-one (slow), NumPy sends the entire operation to optimized C code that processes everything in parallel (100× faster).
The magic happens through N-dimensional arrays:
Vectors (1D arrays):
A sequence of numbers. Example: [25, 50000, 720] representing [age, income, credit score] for one person.
Matrices (2D arrays):
Rows and columns of numbers. Your entire dataset is typically a matrix—each row is one person, each column is one feature.
Higher dimensions (3D, 4D, etc.):
Images are 3D arrays (height × width × color channels). Videos are 4D arrays (frames × height × width × channels). These are called tensors.
Here's this speed difference in action:
# Python way (slow): Loop through each element
numbers = [1, 2, 3, 4, 5]
result = []
for num in numbers:
result.append(num * 2)
# Takes: ~0.1 seconds for 1 million numbers
# NumPy way (fast): Process entire array at once
import numpy as np
numbers = np.array([1, 2, 3, 4, 5])
result = numbers * 2
# Takes: ~0.001 seconds for 1 million numbers (100x faster!) A NumPy array (ndarray) is a grid of values, all of the same type. Arrays can be 1D (vector), 2D (matrix), or higher dimensions.
import numpy as np
# 1D array (vector) - like a single row or column
prices = np.array([100, 150, 200, 250])
print(prices.shape) # (4,) - 4 elements
# 2D array (matrix) - like a spreadsheet
data = np.array([
[1, 2, 3],
[4, 5, 6]
])
print(data.shape) # (2, 3) - 2 rows, 3 columns
# 3D array - like stacked matrices (images, video)
images = np.array([
[[1, 2], [3, 4]],
[[5, 6], [7, 8]]
])
print(images.shape) # (2, 2, 2) import numpy as np
# From a Python list
arr = np.array([1, 2, 3, 4, 5])
# Array of zeros (useful for initialization)
zeros = np.zeros((3, 4)) # 3 rows, 4 columns of zeros
# Array of ones
ones = np.ones((2, 3)) # 2 rows, 3 columns of ones
# Range of numbers (like Python's range)
sequence = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
# Evenly spaced numbers
spaced = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
# Random numbers (very common in ML)
random_arr = np.random.rand(3, 3) # 3x3 array of random numbers [0, 1)
normal = np.random.randn(100) # 100 random numbers from normal distribution
# Identity matrix (diagonal of 1s)
identity = np.eye(3) # Used in linear algebra Indexing works like Python lists, but extends to multiple dimensions.
import numpy as np
# 1D indexing (like Python lists)
arr = np.array([10, 20, 30, 40, 50])
print(arr[0]) # 10 (first element)
print(arr[-1]) # 50 (last element)
print(arr[1:4]) # [20, 30, 40] (slice: start:end)
# 2D indexing
data = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
print(data[0, 0]) # 1 (row 0, col 0)
print(data[1, 2]) # 6 (row 1, col 2)
print(data[0]) # [1, 2, 3] (entire first row)
print(data[:, 0]) # [1, 4, 7] (entire first column)
print(data[0:2, 1:3]) # [[2, 3], [5, 6]] (submatrix)
# Boolean indexing (very powerful for filtering)
arr = np.array([1, 2, 3, 4, 5, 6])
mask = arr > 3 # [False, False, False, True, True, True]
filtered = arr[mask] # [4, 5, 6]
# Or in one line:
filtered = arr[arr > 3] # [4, 5, 6] Operations apply to every element automatically (no loops needed).
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# Arithmetic operations (element-wise)
print(arr + 10) # [11, 12, 13, 14, 15]
print(arr * 2) # [2, 4, 6, 8, 10]
print(arr ** 2) # [1, 4, 9, 16, 25] (squared)
# Operations between arrays (same shape)
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print(arr1 + arr2) # [5, 7, 9]
print(arr1 * arr2) # [4, 10, 18] (element-wise multiplication)
# Mathematical functions
arr = np.array([1, 4, 9, 16])
print(np.sqrt(arr)) # [1., 2., 3., 4.]
print(np.log(arr)) # Natural logarithm
print(np.exp(arr)) # e^x
print(np.sin(arr)) # Sine Broadcasting allows NumPy to work with arrays of different shapes during arithmetic operations.
import numpy as np
# Adding a scalar to an array (broadcasts the scalar)
arr = np.array([1, 2, 3, 4])
result = arr + 10 # 10 is "broadcast" to [10, 10, 10, 10]
print(result) # [11, 12, 13, 14]
# Broadcasting with 2D arrays
matrix = np.array([
[1, 2, 3],
[4, 5, 6]
])
row = np.array([10, 20, 30])
# Add row to each row of matrix
result = matrix + row
print(result)
# [[11, 22, 33],
# [14, 25, 36]]
# Normalizing data (common in ML)
data = np.array([[1, 2], [3, 4], [5, 6]])
mean = data.mean(axis=0) # Mean of each column
std = data.std(axis=0) # Std of each column
normalized = (data - mean) / std # Broadcasting! import numpy as np
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Statistical functions
print(np.mean(data)) # 5.5 (average)
print(np.median(data)) # 5.5 (middle value)
print(np.std(data)) # 2.87 (standard deviation)
print(np.var(data)) # 8.25 (variance)
print(np.min(data)) # 1 (minimum)
print(np.max(data)) # 10 (maximum)
# Aggregation along axes (for 2D)
matrix = np.array([
[1, 2, 3],
[4, 5, 6]
])
print(matrix.sum(axis=0)) # [5, 7, 9] (sum each column)
print(matrix.sum(axis=1)) # [6, 15] (sum each row)
# Reshaping (critical for ML)
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped = arr.reshape(2, 3) # Convert to 2x3 matrix
# [[1, 2, 3],
# [4, 5, 6]]
# Flattening (opposite of reshape)
flat = reshaped.flatten() # [1, 2, 3, 4, 5, 6]
# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
vstacked = np.vstack([a, b]) # Vertical: [[1,2,3], [4,5,6]]
hstacked = np.hstack([a, b]) # Horizontal: [1,2,3,4,5,6] Here's how NumPy is used in a typical ML preprocessing pipeline:
import numpy as np
# Simulated dataset: House prices
# Columns: [square_feet, bedrooms, age_years, price]
raw_data = np.array([
[1500, 3, 10, 300000],
[2000, 4, 5, 400000],
[1200, 2, 20, 250000],
[1800, 3, 8, 350000]
])
# Separate features (X) and target (y)
X = raw_data[:, 0:3] # First 3 columns (features)
y = raw_data[:, 3] # Last column (target)
# Normalize features (important for ML algorithms)
# Formula: (value - mean) / std
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X_normalized = (X - X_mean) / X_std
print("Original features:\n", X)
print("\nNormalized features:\n", X_normalized)
print("\nTarget values:", y)
# Now X_normalized and y are ready for ML training! arr * 2 instead of looping. NumPy operations are 10-100× faster because they run in compiled C code.subset = arr[0:3], both subset and arr point to the same memory. Change one, the other changes too. Use subset = arr[0:3].copy() to create an independent copy. Think of it like two people sharing a document (view) vs. each person having their own copy.dtype=np.float32 uses half the memory of float64. For large datasets, this prevents "out of memory" errors.axis=0 operates down rows (column-wise), axis=1 operates across columns (row-wise). Getting this wrong silently produces incorrect results.(3, 1) + (3, 4) → (3, 4) works. (3,) + (3, 4) works (becomes (1, 3) + (3, 4) → (3, 4)). But (2, 3) + (3, 4) fails (2 ≠ 3). Think of it like auto-repeating values to fill missing dimensions.Data manipulation and analysis in Python
Pandas lets you work with data like a spreadsheet, but with code instead of mouse clicks.
It's built on NumPy but adds row/column labels, making it easy to filter, sort, group, and transform real-world datasets. Think Excel meets Python's power and automation.
Pandas works with two main data structures:
Series (1D labeled array):
A single column of data with labels. Example: stock prices over time—each price has a date label.
DataFrame (2D labeled table):
Multiple columns with row and column labels. Think of it as a spreadsheet where you can reference data by meaningful names instead of just positions.
Series: A 1-dimensional labeled array (like a single column in a spreadsheet)
import pandas as pd
import numpy as np
# Create a Series
prices = pd.Series([100, 150, 200, 250], index=['Mon', 'Tue', 'Wed', 'Thu'])
print(prices)
# Mon 100
# Tue 150
# Wed 200
# Thu 250
# Access by index
print(prices['Wed']) # 200
print(prices.mean()) # 175.0 DataFrame: A 2-dimensional labeled table (like an entire spreadsheet)
import pandas as pd
# Create a DataFrame
data = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'Salary': [50000, 60000, 70000, 55000],
'Department': ['Sales', 'IT', 'IT', 'Sales']
})
print(data)
# Name Age Salary Department
# 0 Alice 25 50000 Sales
# 1 Bob 30 60000 IT
# 2 Charlie 35 70000 IT
# 3 Diana 28 55000 Sales
# DataFrame properties
print(data.shape) # (4, 4) - 4 rows, 4 columns
print(data.columns) # Column names
print(data.dtypes) # Data types of each column import pandas as pd
# From CSV file (most common)
df = pd.read_csv('data.csv')
# From Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# From JSON
df = pd.read_json('data.json')
# From SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
# From dictionary (for testing)
df = pd.DataFrame({
'col1': [1, 2, 3],
'col2': ['a', 'b', 'c']
})
# From NumPy array
import numpy as np
arr = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(arr, columns=['A', 'B']) Always start by understanding the data structure and content.
import pandas as pd
# Load example data
df = pd.read_csv('sales.csv')
# First and last rows
print(df.head()) # First 5 rows
print(df.tail(10)) # Last 10 rows
# Dataset info
print(df.info()) # Column names, types, non-null counts
print(df.shape) # (rows, columns)
print(df.columns) # Column names
# Summary statistics (for numerical columns)
print(df.describe()) # count, mean, std, min, max, quartiles
# Check for missing values
print(df.isnull().sum()) # Count of missing values per column
print(df.isna().sum()) # Same as isnull()
# Value counts for categorical columns
print(df['category'].value_counts()) # Frequency of each category
# Unique values
print(df['product'].unique()) # Array of unique values
print(df['product'].nunique()) # Count of unique values import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'Salary': [50000, 60000, 70000, 55000, 65000],
'Department': ['Sales', 'IT', 'IT', 'Sales', 'HR']
})
# Select single column (returns Series)
ages = df['Age']
# Select multiple columns (returns DataFrame)
subset = df[['Name', 'Salary']]
# Select rows by index
first_row = df.iloc[0] # First row
first_three = df.iloc[0:3] # First 3 rows
# Select by label
row = df.loc[0] # Row with index 0
specific = df.loc[0:2, ['Name', 'Age']] # Rows 0-2, specific columns
# Boolean filtering (most common)
high_salary = df[df['Salary'] > 60000]
it_dept = df[df['Department'] == 'IT']
# Multiple conditions (use & for AND, | for OR)
young_high_earners = df[(df['Age'] < 30) & (df['Salary'] > 55000)]
sales_or_it = df[(df['Department'] == 'Sales') | (df['Department'] == 'IT')]
# isin() for multiple values
selected_depts = df[df['Department'].isin(['IT', 'HR'])]
# String operations
starts_with_a = df[df['Name'].str.startswith('A')]
contains_li = df[df['Name'].str.contains('li')] Missing values (NaN, None, null) are common in real-world data and must be handled before ML.
import pandas as pd
import numpy as np
# Create data with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Check for missing values
print(df.isnull().sum()) # Count per column
# Drop rows with ANY missing values
df_dropped = df.dropna()
# Drop rows where ALL values are missing
df_dropped = df.dropna(how='all')
# Drop columns with missing values
df_dropped = df.dropna(axis=1)
# Fill missing values with a constant
df_filled = df.fillna(0)
# Fill with column mean (common for numerical data)
df['A'] = df['A'].fillna(df['A'].mean())
# Forward fill (use previous value)
df_filled = df.fillna(method='ffill')
# Backward fill (use next value)
df_filled = df.fillna(method='bfill')
# Fill different columns differently
df_filled = df.fillna({
'A': df['A'].mean(),
'B': df['B'].median()
}) import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
})
# Create new column
df['Tax'] = df['Salary'] * 0.3
df['Net_Salary'] = df['Salary'] - df['Tax']
# Apply function to column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
# Apply function to entire row
def categorize(row):
if row['Salary'] > 60000:
return 'High'
else:
return 'Low'
df['Salary_Category'] = df.apply(categorize, axis=1)
# Rename columns
df_renamed = df.rename(columns={'Name': 'Employee_Name', 'Age': 'Years'})
# Drop columns
df_dropped = df.drop(['Tax'], axis=1)
# Sort values
df_sorted = df.sort_values('Salary', ascending=False)
df_sorted = df.sort_values(['Age', 'Salary']) # Multiple columns
# Remove duplicates
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=['Name']) # Based on specific column GroupBy is one of the most powerful Pandas features—similar to SQL's GROUP BY.
import pandas as pd
df = pd.DataFrame({
'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR'],
'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Salary': [50000, 60000, 70000, 65000, 55000],
'Age': [25, 30, 35, 28, 32]
})
# Group by department and calculate mean
dept_avg = df.groupby('Department')['Salary'].mean()
print(dept_avg)
# Department
# HR 55000.0
# IT 67500.0
# Sales 55000.0
# Multiple aggregations
dept_stats = df.groupby('Department').agg({
'Salary': ['mean', 'min', 'max'],
'Age': 'mean'
})
# Group by and count
dept_counts = df.groupby('Department').size()
# Multiple grouping columns
grouped = df.groupby(['Department', 'Age'])['Salary'].sum()
# Apply custom function
def salary_range(x):
return x.max() - x.min()
salary_ranges = df.groupby('Department')['Salary'].apply(salary_range) Combine multiple datasets (like SQL JOINs).
import pandas as pd
# Two separate DataFrames
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'dept_id': [10, 20, 10, 30]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 30],
'dept_name': ['Sales', 'IT', 'HR']
})
# Inner join (only matching rows)
merged = pd.merge(employees, departments, on='dept_id', how='inner')
# Left join (all from left, matching from right)
merged = pd.merge(employees, departments, on='dept_id', how='left')
# Outer join (all rows from both)
merged = pd.merge(employees, departments, on='dept_id', how='outer')
# Concatenate DataFrames vertically (stack rows)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
combined = pd.concat([df1, df2], ignore_index=True)
# Concatenate horizontally (add columns)
combined = pd.concat([df1, df2], axis=1) Creating new features from existing data to improve model performance.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': ['2024-01-15', '2024-02-20', '2024-03-10'],
'price': [100, 150, 120],
'quantity': [5, 3, 7],
'category': ['A', 'B', 'A']
})
# Convert date to datetime
df['date'] = pd.to_datetime(df['date'])
# Extract date features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek # 0=Monday, 6=Sunday
# Create interaction features
df['total_revenue'] = df['price'] * df['quantity']
# Binning continuous variables
df['price_category'] = pd.cut(df['price'],
bins=[0, 100, 150, 200],
labels=['Low', 'Medium', 'High'])
# One-hot encoding (categorical to numerical)
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')
print(df_encoded)
# Creates new columns: cat_A, cat_B with 0s and 1s
# Label encoding (for ordinal categories)
category_mapping = {'A': 0, 'B': 1, 'C': 2}
df['category_encoded'] = df['category'].map(category_mapping) Complete data preprocessing pipeline for a customer churn prediction model:
import pandas as pd
import numpy as np
# 1. Load data
df = pd.read_csv('customer_data.csv')
# 2. Initial exploration
print(f"Dataset shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nData types:\n{df.dtypes}")
# 3. Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['income'].fillna(df['income'].mean(), inplace=True)
df.dropna(subset=['customer_id'], inplace=True) # Drop if ID is missing
# 4. Remove duplicates
df.drop_duplicates(subset=['customer_id'], keep='first', inplace=True)
# 5. Feature engineering
# Calculate customer lifetime value
df['lifetime_value'] = df['monthly_charges'] * df['tenure_months']
# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle', 'Senior'])
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['gender', 'contract_type'], drop_first=True)
# 6. Filter relevant data
# Keep only active customers for analysis
df_active = df_encoded[df_encoded['status'] == 'Active']
# 7. Prepare for ML
# Separate features (X) and target (y)
feature_cols = ['age', 'income', 'tenure_months', 'lifetime_value', 'gender_M', 'contract_type_Monthly']
X = df_active[feature_cols]
y = df_active['churn'] # Target variable (1=churned, 0=retained)
# 8. Convert to NumPy for ML library
X_array = X.values
y_array = y.values
print(f"\nFinal dataset ready for ML:")
print(f"Features shape: {X_array.shape}")
print(f"Target shape: {y_array.shape}")
# Now ready to feed into scikit-learn or other ML libraries! Data visualization and plotting in Python
Matplotlib turns numbers into pictures—charts, graphs, and visualizations that make patterns visible.
It's Python's core plotting library. Before building any ML model, you visualize your data to spot outliers, see distributions, and understand relationships. Matplotlib makes that possible.
Common visualization types for ML:
Line plots and scatter plots:
Show trends over time or relationships between variables. Essential for exploring correlations and understanding how features relate to your target variable.
Histograms and distributions:
Reveal data distribution, outliers, and whether features are normally distributed—critical for choosing the right ML algorithms.
Heatmaps and subplots:
Compare multiple features at once, visualize correlation matrices, and create complex multi-panel figures for analysis.
Matplotlib has two interfaces: pyplot (simple, MATLAB-style) and object-oriented (more control, recommended for complex plots).
import matplotlib.pyplot as plt
import numpy as np
# Style 1: pyplot interface (simple, good for quick plots)
plt.plot([1, 2, 3, 4], [1, 4, 2, 3])
plt.title('Simple Plot')
plt.show()
# Style 2: Object-oriented interface (recommended)
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4], [1, 4, 2, 3])
ax.set_title('OO Plot')
plt.show()
# The OO style gives more control, especially with subplots 1. Line Plot - Show trends over time or continuous data
import matplotlib.pyplot as plt
import numpy as np
# Sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
# Create line plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y1, label='sin(x)', linewidth=2, color='blue')
ax.plot(x, y2, label='cos(x)', linewidth=2, color='red', linestyle='--')
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_title('Line Plot Example')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show() 2. Scatter Plot - Show relationships between two variables
import matplotlib.pyplot as plt
import numpy as np
# Generate data
np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
# Create scatter plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(x, y, alpha=0.6, s=50, c='blue', edgecolors='black')
ax.set_xlabel('Feature X')
ax.set_ylabel('Target Y')
ax.set_title('Scatter Plot: Relationship between X and Y')
ax.grid(True, alpha=0.3)
plt.show() 3. Bar Chart - Compare categories
import matplotlib.pyplot as plt
# Data
categories = ['Model A', 'Model B', 'Model C', 'Model D']
accuracies = [0.85, 0.92, 0.88, 0.95]
# Create bar chart
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(categories, accuracies, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
ax.set_ylabel('Accuracy')
ax.set_title('Model Performance Comparison')
ax.set_ylim(0, 1.0)
# Add value labels on bars
for bar in bars:
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height,
f'{height:.2f}', ha='center', va='bottom')
plt.show() 4. Histogram - Show data distribution
import matplotlib.pyplot as plt
import numpy as np
# Generate data from normal distribution
data = np.random.normal(100, 15, 1000)
# Create histogram
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Test Scores')
ax.axvline(data.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {data.mean():.1f}')
ax.legend()
plt.show() 5. Box Plot - Show distribution and outliers
import matplotlib.pyplot as plt
import numpy as np
# Generate data for multiple groups
data1 = np.random.normal(100, 10, 100)
data2 = np.random.normal(90, 20, 100)
data3 = np.random.normal(110, 15, 100)
# Create box plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.boxplot([data1, data2, data3], labels=['Group A', 'Group B', 'Group C'])
ax.set_ylabel('Values')
ax.set_title('Distribution Comparison Across Groups')
ax.grid(True, alpha=0.3, axis='y')
plt.show() import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
y = np.sin(x)
fig, ax = plt.subplots(figsize=(12, 6))
# Plot with customization
ax.plot(x, y, linewidth=2.5, color='#2E86AB', label='Sine Wave')
ax.fill_between(x, y, alpha=0.3, color='#A23B72')
# Labels and title
ax.set_xlabel('Time (seconds)', fontsize=12, fontweight='bold')
ax.set_ylabel('Amplitude', fontsize=12, fontweight='bold')
ax.set_title('Customized Sine Wave Plot', fontsize=14, fontweight='bold', pad=20)
# Grid and legend
ax.grid(True, linestyle='--', alpha=0.5, color='gray')
ax.legend(loc='upper right', framealpha=0.9, fontsize=10)
# Styling
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_facecolor('#F8F9FA')
fig.patch.set_facecolor('white')
plt.tight_layout()
plt.show() Create multiple plots side-by-side for comparison.
import matplotlib.pyplot as plt
import numpy as np
# Generate data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.sin(x) * np.exp(-x/10)
y4 = np.cos(x) * np.exp(-x/10)
# Create 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Plot 1: Top-left
axes[0, 0].plot(x, y1, 'b-', linewidth=2)
axes[0, 0].set_title('Sine Wave')
axes[0, 0].grid(True, alpha=0.3)
# Plot 2: Top-right
axes[0, 1].plot(x, y2, 'r-', linewidth=2)
axes[0, 1].set_title('Cosine Wave')
axes[0, 1].grid(True, alpha=0.3)
# Plot 3: Bottom-left
axes[1, 0].plot(x, y3, 'g-', linewidth=2)
axes[1, 0].set_title('Damped Sine')
axes[1, 0].grid(True, alpha=0.3)
# Plot 4: Bottom-right
axes[1, 1].plot(x, y4, 'm-', linewidth=2)
axes[1, 1].set_title('Damped Cosine')
axes[1, 1].grid(True, alpha=0.3)
# Overall title
fig.suptitle('Comparison of Trigonometric Functions', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show() Confusion Matrix - Visualize classification results
import matplotlib.pyplot as plt
import numpy as np
# Confusion matrix data
confusion_matrix = np.array([
[50, 10],
[5, 35]
])
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(confusion_matrix, cmap='Blues')
# Labels
classes = ['Negative', 'Positive']
ax.set_xticks(np.arange(len(classes)))
ax.set_yticks(np.arange(len(classes)))
ax.set_xticklabels(classes)
ax.set_yticklabels(classes)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14, fontweight='bold')
# Add text annotations
for i in range(len(classes)):
for j in range(len(classes)):
text = ax.text(j, i, confusion_matrix[i, j],
ha="center", va="center", color="white", fontsize=14, fontweight='bold')
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show() Feature Importance - Show which features matter most
import matplotlib.pyplot as plt
import numpy as np
# Sample feature importance data
features = ['Age', 'Income', 'Credit Score', 'Loan Amount', 'Employment Years']
importance = [0.25, 0.35, 0.20, 0.15, 0.05]
# Sort by importance
indices = np.argsort(importance)
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(np.arange(len(features)), np.array(importance)[indices], color='steelblue')
ax.set_yticks(np.arange(len(features)))
ax.set_yticklabels(np.array(features)[indices])
ax.set_xlabel('Importance Score', fontsize=12)
ax.set_title('Feature Importance in Loan Prediction Model', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show() Learning Curves - Monitor model training
import matplotlib.pyplot as plt
import numpy as np
# Simulated training history
epochs = np.arange(1, 51)
train_loss = 2.5 * np.exp(-epochs/10) + np.random.randn(50) * 0.05
val_loss = 2.5 * np.exp(-epochs/10) + 0.3 + np.random.randn(50) * 0.08
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(epochs, train_loss, label='Training Loss', linewidth=2, color='blue')
ax.plot(epochs, val_loss, label='Validation Loss', linewidth=2, color='red')
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Model Training Progress', fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show() Matplotlib has built-in styles for consistent, professional-looking plots.
import matplotlib.pyplot as plt
import numpy as np
# See available styles
print(plt.style.available)
# ['seaborn-v0_8', 'ggplot', 'dark_background', 'bmh', 'fivethirtyeight', ...]
# Use a style
plt.style.use('seaborn-v0_8-darkgrid')
# Or use temporarily
with plt.style.context('dark_background'):
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot([1, 2, 3, 4], [1, 4, 2, 3], linewidth=2)
ax.set_title('Plot with Dark Background')
plt.show()
# Reset to default
plt.style.use('default') Complete EDA visualization workflow for a classification problem:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Simulated dataset
np.random.seed(42)
n_samples = 200
data = pd.DataFrame({
'age': np.random.randint(18, 70, n_samples),
'income': np.random.randint(30000, 150000, n_samples),
'credit_score': np.random.randint(300, 850, n_samples),
'approved': np.random.choice([0, 1], n_samples, p=[0.4, 0.6])
})
# Create comprehensive EDA dashboard
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Plot 1: Age distribution by approval status
approved = data[data['approved'] == 1]['age']
rejected = data[data['approved'] == 0]['age']
axes[0, 0].hist([approved, rejected], bins=20, label=['Approved', 'Rejected'], alpha=0.7)
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution by Approval Status')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Plot 2: Income vs Credit Score scatter
colors = ['red' if x == 0 else 'green' for x in data['approved']]
axes[0, 1].scatter(data['income'], data['credit_score'], c=colors, alpha=0.6)
axes[0, 1].set_xlabel('Income ($)')
axes[0, 1].set_ylabel('Credit Score')
axes[0, 1].set_title('Income vs Credit Score (Red=Rejected, Green=Approved)')
axes[0, 1].grid(True, alpha=0.3)
# Plot 3: Approval rate by age group
age_bins = [18, 30, 40, 50, 70]
data['age_group'] = pd.cut(data['age'], bins=age_bins, labels=['18-30', '30-40', '40-50', '50-70'])
approval_rate = data.groupby('age_group')['approved'].mean()
axes[1, 0].bar(approval_rate.index.astype(str), approval_rate.values, color='steelblue')
axes[1, 0].set_xlabel('Age Group')
axes[1, 0].set_ylabel('Approval Rate')
axes[1, 0].set_title('Loan Approval Rate by Age Group')
axes[1, 0].grid(True, alpha=0.3, axis='y')
# Plot 4: Credit score box plot by approval
axes[1, 1].boxplot([rejected_scores := data[data['approved'] == 0]['credit_score'],
approved_scores := data[data['approved'] == 1]['credit_score']],
labels=['Rejected', 'Approved'])
axes[1, 1].set_ylabel('Credit Score')
axes[1, 1].set_title('Credit Score Distribution by Approval Status')
axes[1, 1].grid(True, alpha=0.3, axis='y')
# Overall title
fig.suptitle('Loan Approval EDA Dashboard', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()
# Key insights from visualization:
# 1. Age distribution shows approval patterns
# 2. Higher credit scores correlate with approval
# 3. Income alone may not be decisive factor
# 4. Age groups show different approval rates Machine learning algorithms and tools
Scikit-learn is your ML algorithm toolkit—dozens of ready-to-use models that all work the same way: fit(), predict(), evaluate().
Need to predict house prices? Use LinearRegression. Classify emails as spam? Try RandomForest. Every algorithm has the same interface, so switching between models takes one line of code. This is where your prepared data becomes predictions.
Scikit-learn provides three main components:
Estimators (models that learn):
Algorithms like LinearRegression, RandomForest, or SVM that learn patterns from training data and make predictions on new data.
Transformers (data preprocessing):
Tools like StandardScaler or OneHotEncoder that transform your data into the format ML algorithms need—scaling numbers, encoding categories, reducing dimensions.
Pipelines (workflow automation):
Chain preprocessing and modeling steps together so you can apply the entire workflow in one command—essential for clean, reproducible ML code.
fit(X, y) to train, predict(X) to use. This means switching from linear regression to random forest requires changing one line of code. You learn the API once, and can experiment with dozens of algorithms.
Before training any model, data must be split into training and testing sets to evaluate performance on unseen data.
from sklearn.model_selection import train_test_split
import numpy as np
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 0, 1, 1, 1])
# Split 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # Reproducible split
stratify=y # Maintain class balance
)
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}") from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5) # 5-fold CV
print(f"Accuracy per fold: {scores}")
print(f"Mean accuracy: {scores.mean():.3f}") Raw data rarely works directly with ML algorithms. Scikit-learn provides powerful preprocessing tools.
# 1. Scaling Numerical Features
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# StandardScaler: mean=0, std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# MinMaxScaler: scale to [0, 1]
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X_train)
# 2. Encoding Categorical Features
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# LabelEncoder: convert categories to numbers
le = LabelEncoder()
colors = np.array(['red', 'blue', 'green', 'blue', 'red'])
encoded = le.fit_transform(colors) # [2, 0, 1, 0, 2]
# OneHotEncoder: create binary columns
ohe = OneHotEncoder(sparse_output=False)
categories = np.array([['cat'], ['dog'], ['cat']])
one_hot = ohe.fit_transform(categories)
# [[1, 0], # cat
# [0, 1], # dog
# [1, 0]] # cat
# 3. Handling Missing Values
from sklearn.impute import SimpleImputer
# Fill missing values with mean
imputer = SimpleImputer(strategy='mean')
X_with_nan = np.array([[1, 2], [np.nan, 4], [7, 6]])
X_filled = imputer.fit_transform(X_with_nan)
# [[1, 2],
# [4, 4], # NaN replaced with mean (4)
# [7, 6]] Scikit-learn provides dozens of algorithms. Here are the most common ones with examples.
# 1. REGRESSION: Predict Continuous Values
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
# Ridge Regression (L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# 2. CLASSIFICATION: Predict Categories
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
# Decision Tree
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train, y_train)
# Random Forest
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train)
# Support Vector Machine
svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train, y_train)
# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100)
gb.fit(X_train, y_train)
# 3. CLUSTERING: Find Groups in Data
from sklearn.cluster import KMeans, DBSCAN
# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X) # No labels needed
# DBSCAN (density-based clustering)
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X) Training a model is only half the work—evaluating its performance is equally critical.
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report,
mean_squared_error, mean_absolute_error, r2_score,
roc_auc_score, roc_curve
)
# Classification Metrics
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
accuracy = accuracy_score(y_true, y_pred) # Overall correctness
precision = precision_score(y_true, y_pred) # True positives / predicted positives
recall = recall_score(y_true, y_pred) # True positives / actual positives
f1 = f1_score(y_true, y_pred) # Harmonic mean of precision & recall
# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
# [[2, 0], # True negatives, False positives
# [1, 2]] # False negatives, True positives
# Full Report
report = classification_report(y_true, y_pred)
print(report)
# ROC-AUC for probability predictions
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
# Regression Metrics
y_true_reg = [3.0, -0.5, 2.0, 7.0]
y_pred_reg = [2.5, 0.0, 2.0, 8.0]
mse = mean_squared_error(y_true_reg, y_pred_reg) # Mean squared error
rmse = mean_squared_error(y_true_reg, y_pred_reg, squared=False) # Root MSE
mae = mean_absolute_error(y_true_reg, y_pred_reg) # Mean absolute error
r2 = r2_score(y_true_reg, y_pred_reg) # R-squared (0-1, higher is better) Every ML algorithm has hyperparameters (settings) that dramatically affect performance. Scikit-learn automates the search for optimal values.
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define model
rf = RandomForestClassifier(random_state=42)
# Grid Search: Try Every Combination
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
rf, param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1 # Use all CPU cores
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Use best model
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)
# Randomized Search: Sample Random Combinations (Faster)
from scipy.stats import randint
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': randint(5, 50)
}
random_search = RandomizedSearchCV(
rf, param_dist,
n_iter=20, # Try 20 random combinations
cv=5,
random_state=42
)
random_search.fit(X_train, y_train) One of the most common mistakes in ML is data leakage—when information from the test set accidentally influences training. Pipelines prevent this by ensuring all preprocessing steps are learned only from training data.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Without Pipeline (WRONG - causes data leakage)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # ❌ Uses ALL data (train + test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
model = LogisticRegression()
model.fit(X_train, y_train)
# With Pipeline (CORRECT - no leakage)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Split BEFORE preprocessing
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Scaler learns ONLY from X_train
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
# Complex Pipeline with Multiple Steps
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
complex_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100))
])
complex_pipeline.fit(X_train, y_train)
predictions = complex_pipeline.predict(X_test) param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [5, 10, None]
}
grid = GridSearchCV(complex_pipeline, param_grid, cv=5)
grid.fit(X_train, y_train) Complete workflow from raw data to deployed model.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
# 1. Load Data
data = pd.DataFrame({
'income': [50000, 30000, np.nan, 80000, 45000, 60000],
'age': [25, 45, 35, 50, np.nan, 40],
'employment': ['self', 'employed', 'employed', 'employed', 'self', 'employed'],
'loan_amount': [10000, 5000, 15000, 20000, 8000, 12000],
'approved': [1, 0, 1, 1, 0, 1] # Target
})
# 2. Split Features and Target
X = data.drop('approved', axis=1)
y = data['approved']
# 3. Identify Column Types
numeric_features = ['income', 'age', 'loan_amount']
categorical_features = ['employment']
# 4. Create Preprocessing Pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# 5. Combine Transformers
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# 6. Create Full Pipeline
model = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# 7. Split Data (BEFORE preprocessing)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 8. Train Model
model.fit(X_train, y_train)
# 9. Evaluate
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_proba):.3f}")
# 10. Hyperparameter Tuning
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [5, 10, None]
}
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best ROC-AUC: {grid_search.best_score_:.3f}")
# 11. Save Model for Production
import joblib
joblib.dump(grid_search.best_estimator_, 'credit_model.pkl')
# 12. Load and Use in Production
loaded_model = joblib.load('credit_model.pkl')
new_applicant = pd.DataFrame({
'income': [55000],
'age': [30],
'employment': ['employed'],
'loan_amount': [15000]
})
prediction = loaded_model.predict(new_applicant)
probability = loaded_model.predict_proba(new_applicant)[:, 1]
print(f"Approval: {prediction[0]}, Probability: {probability[0]:.2f}") stratify in train_test_split and class_weight='balanced' in modelsjoblib to save and load models for productionDeep learning and neural networks
TensorFlow and PyTorch are deep learning frameworks—tools for building neural networks that learn from images, text, audio, and video.
Scikit-learn handles traditional ML on tabular data. When you need to process unstructured data (recognize faces, understand language, generate images), you need neural networks. These frameworks handle the complex math, GPU acceleration, and automatic differentiation that make deep learning possible.
Core capabilities of deep learning frameworks:
Tensors (multi-dimensional arrays):
Like NumPy arrays but optimized for GPU computation. Images are 3D tensors, batches of images are 4D tensors. Everything in deep learning is a tensor operation.
Automatic differentiation (autograd):
Automatically computes gradients for backpropagation. You define the forward pass (how data flows through the network), and the framework calculates how to update weights.
GPU acceleration:
Neural networks have millions of parameters. GPUs perform parallel matrix operations 10-100x faster than CPUs, making deep learning training practical.
As of 2025, both frameworks are mature, highly optimized, and support dynamic computation. The choice depends on use case, not technical superiority.
PyTorch Strengths (PyTorch 2.x):
✓ Pythonic & intuitive: Write neural networks like normal Python classes
✓ Dynamic computation graphs: Debug with standard Python tools
✓ Research-friendly: Most academic papers use PyTorch
✓ Faster prototyping: Less boilerplate code
✓ Better for custom architectures and experimental models
✓ 25% faster training in some CNN benchmarks (2024)
TensorFlow Strengths (TensorFlow 2.x):
✓ Production ecosystem: TF Serving, TF Lite (mobile), TF.js (browser)
✓ Enterprise deployment: Mature MLOps tools
✓ Keras API integrated: High-level, beginner-friendly
✓ TPU support: Google's Tensor Processing Units
✓ Mobile/edge deployment: Better tooling for iOS/Android
✓ TensorBoard: Superior visualization for training metrics
Reality in 2025:
Both frameworks support static and dynamic modes. Both scale to production.
Both have excellent documentation. Pick based on your team's preference. Both frameworks share fundamental concepts, though syntax differs.
# 1. TENSORS: Multidimensional Arrays (like NumPy, but GPU-ready)
# TensorFlow
import tensorflow as tf
tensor_tf = tf.constant([[1, 2], [3, 4]])
print(tensor_tf.shape) # (2, 2)
# PyTorch
import torch
tensor_pt = torch.tensor([[1, 2], [3, 4]])
print(tensor_pt.shape) # torch.Size([2, 2])
# 2. LAYERS: Building Blocks of Neural Networks
# TensorFlow (Keras API)
from tensorflow.keras import layers
dense_layer = layers.Dense(64, activation='relu')
# PyTorch (torch.nn)
import torch.nn as nn
dense_layer = nn.Linear(in_features=32, out_features=64)
# 3. MODELS: Complete Neural Networks
# TensorFlow Sequential API
model_tf = tf.keras.Sequential([
layers.Dense(128, activation='relu'),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])
# PyTorch Class-based API
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
model_pt = SimpleNet() Side-by-side comparison of building, training, and evaluating a neural network for image classification.
# TensorFlow Example: MNIST Digit Classification
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
# 1. Load Data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
# 2. Preprocess
X_train = X_train.reshape(-1, 784).astype('float32') / 255.0
X_test = X_test.reshape(-1, 784).astype('float32') / 255.0
# 3. Build Model
model = models.Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])
# 4. Compile
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# 5. Train
history = model.fit(
X_train, y_train,
epochs=10,
batch_size=32,
validation_split=0.2,
verbose=1
)
# 6. Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_acc:.3f}")
# 7. Predict
predictions = model.predict(X_test[:5])
print(f"Predicted classes: {np.argmax(predictions, axis=1)}")
# 8. Save Model
model.save('mnist_model.h5') # PyTorch Example: MNIST Digit Classification
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchvision import datasets, transforms
# 1. Load Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)
# 2. Build Model
class MNISTNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.dropout = nn.Dropout(0.2)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(-1, 784) # Flatten
x = torch.relu(self.fc1(x))
x = self.dropout(x)
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
model = MNISTNet()
# 3. Define Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 4. Train
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
for epoch in range(10):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
# Forward pass
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
# Backward pass
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}/10, Loss: {loss.item():.4f}")
# 5. Evaluate
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
print(f"Test accuracy: {correct / total:.3f}")
# 6. Save Model
torch.save(model.state_dict(), 'mnist_model.pth') Different problems require different architectures. Here are the most common patterns.
# 1. CONVOLUTIONAL NEURAL NETWORKS (CNNs) - For Images
# TensorFlow
cnn_model = tf.keras.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])
# PyTorch
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=3)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 5 * 5, 64)
self.fc2 = nn.Linear(64, 10)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = self.pool(torch.relu(self.conv2(x)))
x = x.view(-1, 64 * 5 * 5)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
# 2. RECURRENT NEURAL NETWORKS (RNNs) - For Sequential Data
# TensorFlow
rnn_model = tf.keras.Sequential([
layers.LSTM(128, input_shape=(timesteps, features)),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])
# PyTorch
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.lstm(x)
out = self.fc(out[:, -1, :]) # Take last timestep
return out
# 3. TRANSFORMERS - For Language (Modern Approach)
# Using TensorFlow with Keras
transformer_model = tf.keras.Sequential([
layers.Embedding(vocab_size, embedding_dim),
layers.MultiHeadAttention(num_heads=8, key_dim=64),
layers.GlobalAveragePooling1D(),
layers.Dense(64, activation='relu'),
layers.Dense(num_classes, activation='softmax')
])
Understanding what happens inside model.fit() (TensorFlow) or the training loop (PyTorch).
# Manual Training Loop (PyTorch) - Shows What's Happening
for epoch in range(num_epochs):
for batch_data, batch_labels in data_loader:
# 1. FORWARD PASS
# Pass data through the network
predictions = model(batch_data)
# 2. COMPUTE LOSS
# Measure how wrong the predictions are
loss = loss_function(predictions, batch_labels)
# 3. ZERO GRADIENTS
# Clear previous gradients (PyTorch accumulates them)
optimizer.zero_grad()
# 4. BACKWARD PASS (Backpropagation)
# Compute gradients for all parameters
loss.backward()
# 5. UPDATE WEIGHTS
# Adjust parameters using gradients
optimizer.step()
print(f"Epoch {epoch}, Loss: {loss.item()}")
# TensorFlow hides this in model.fit(), but the steps are identical:
# 1. Forward pass → 2. Loss → 3. Gradients → 4. Backprop → 5. Update Deep learning requires GPUs for practical training times. Both frameworks make this simple.
# TensorFlow: Automatic GPU Detection
import tensorflow as tf
# TensorFlow automatically uses GPU if available
print("GPUs Available:", len(tf.config.list_physical_devices('GPU')))
# Manual device placement (rarely needed)
with tf.device('/GPU:0'):
model = tf.keras.Sequential([...])
model.fit(X_train, y_train)
# PyTorch: Explicit Device Management
import torch
# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Move model and data to GPU
model = SimpleNet().to(device)
data = data.to(device)
labels = labels.to(device)
# All operations now run on GPU
output = model(data)
# Performance Impact:
# CPU: Training 1 epoch on MNIST takes ~60 seconds
# GPU: Training 1 epoch on MNIST takes ~6 seconds (10x faster)
# For large models (ResNet, BERT): GPU is 50-100x faster Using pre-trained models (trained on millions of images) for custom tasks—the most practical approach for real projects.
# TensorFlow: Transfer Learning with MobileNetV2
import tensorflow as tf
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras import layers, models
# 1. Load Pre-trained Model (trained on ImageNet)
base_model = MobileNetV2(
input_shape=(224, 224, 3),
include_top=False, # Remove classification head
weights='imagenet'
)
# 2. Freeze Base Model (don't retrain it)
base_model.trainable = False
# 3. Add Custom Classification Head
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_classes, activation='softmax') # Custom classes
])
# 4. Compile and Train
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Only trains the new layers, not the pre-trained base
model.fit(train_data, epochs=10, validation_data=val_data)
# Why Transfer Learning?
# - Training from scratch: requires millions of images, weeks of GPU time
# - Transfer learning: works with 100s of images, hours of training
# - Achieves 90%+ accuracy on custom tasks with minimal data # PyTorch: Transfer Learning with ResNet
import torch
import torch.nn as nn
from torchvision import models, transforms
# 1. Load Pre-trained Model
resnet = models.resnet50(pretrained=True)
# 2. Freeze Base Layers
for param in resnet.parameters():
param.requires_grad = False
# 3. Replace Final Layer
num_features = resnet.fc.in_features
resnet.fc = nn.Linear(num_features, num_classes)
# 4. Only the final layer will be trained
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=0.001)
# 5. Train
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
resnet = resnet.to(device)
for epoch in range(10):
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = resnet(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
torch.save(resnet.state_dict(), 'custom_resnet.pth') BatchNormalization layers to stabilize training# TensorFlow Deployment Options
# 1. TensorFlow Lite (Mobile/Edge)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
# 2. TensorFlow.js (Browser)
# tensorflowjs_converter --input_format=keras model.h5 tfjs_model/
# 3. TensorFlow Serving (Production API)
# docker run -p 8501:8501 --mount type=bind,source=/models/my_model,target=/models/my_model -e MODEL_NAME=my_model -t tensorflow/serving
# PyTorch Deployment Options
# 1. TorchServe (Production API)
# torch-model-archiver --model-name mnist --version 1.0 --model-file model.py --serialized-file model.pth
# torchserve --start --model-store model_store --models mnist=mnist.mar
# 2. ONNX (Cross-platform)
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx")
# 3. TorchScript (Production PyTorch)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt") Python and these libraries are required. Here's the setup process:
Download from python.org or install via Anaconda (recommended for beginners—it includes everything).
Anaconda: anaconda.com/download
Open your terminal/command prompt and run:
# If using pip (Python's package manager)
pip install numpy pandas matplotlib scikit-learn
# For deep learning (optional for now)
pip install tensorflow
# OR
pip install torch torchvision If you installed Anaconda, most of these come pre-installed!
Create a file called test.py and run this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
print("✅ All libraries installed successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}") Run with: python test.py
A code editor is needed to write Python code. Choose one:
Best for learning and experimentation. Run code in cells, see output immediately, mix code with notes.
Install: pip install notebook
Run: jupyter notebook
Professional code editor. Great for building real projects. Install Python extension for best experience.
Download from: code.visualstudio.com
Jupyter notebook in your browser. No installation needed. Free GPUs for deep learning. Perfect for trying things out.
Just go to: colab.research.google.com
Easiest to startLet's put it all together. This program loads data, trains a model, and makes predictions. Copy this into a notebook and run it.
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Step 1: Create sample data (house size vs price)
np.random.seed(42)
house_sizes = np.random.randint(800, 3500, 100) # Square feet
base_prices = house_sizes * 150 # Base: $150 per sq ft
noise = np.random.normal(0, 50000, 100) # Add randomness
house_prices = base_prices + noise
# Step 2: Prepare data
X = house_sizes.reshape(-1, 1) # Features (must be 2D for sklearn)
y = house_prices # Target
# Step 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 4: Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = model.predict(X_test)
# Step 6: Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Model Performance:")
print(f" Mean Squared Error: ${mse:,.0f}")
print(f" R² Score: {r2:.3f}")
print(f"\nModel learned: Price = ${model.coef_[0]:.2f} × Size + ${model.intercept_:,.0f}")
# Step 7: Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, alpha=0.5, label='Actual Prices')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predictions')
plt.xlabel('House Size (sq ft)')
plt.ylabel('Price ($)')
plt.title('House Price Predictions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show() # Note: Execution pauses here until you close the plot window
# Predict for a new house
new_house_size = np.array([[2000]])
predicted_price = model.predict(new_house_size)
print(f"\nPrediction: A 2000 sq ft house will cost ${predicted_price[0]:,.0f}")