What is a Neural Network?

A neural network is a computational model inspired by how biological neurons work in the brain. At its core, it’s a system that learns to map inputs to outputs by finding patterns in data.

Simple Neural Network - Housing Price Prediction

The most basic neural network has this structure:

Input → Neuron → Output

For housing price prediction:

House Size (sq ft) → [Neuron] → Predicted Price ($)

How It Works

Input: Feed the house size into the neuron
Processing: The neuron applies a mathematical function
Output: Produces a predicted price

This is essentially a sophisticated version of finding the “best fit line” through your data points (like least squares regression).

ReLU Function - The Popular Choice

ReLU = “Rectified Linear Unit”

What ReLU Looks Like:

Price |
      |         /
      |        /
      |       /
      |      /
      |     /
      |    /
------+---/---------> Size
      |

The Math Behind ReLU:

ReLU(x) = max(0, x)
If input is negative → output is 0
If input is positive → output equals the input

Why “Rectified”?

“Rectify” means “to make right” or “correct.” In this context, it means we’re “correcting” negative values by setting them to zero. This makes sense for housing prices - you can’t have a negative price!

Building Bigger Networks

To create more powerful neural networks, you stack these basic “neurons” together like building blocks:

Size → [Neuron] → [Neuron] → [Neuron] → Price
       Layer 1    Layer 2    Layer 3

Each layer can have multiple neurons, and each neuron typically uses an activation function like ReLU to introduce non-linearity, allowing the network to learn complex patterns.

Multiple Input Example - More Realistic

Size           ────┐
# Bedrooms     ────┼──→ [Hidden Layer] ──→ [Output] ──→ Price
Zip Code       ────┤      Neurons              y
Wealth         ────┘
     │
     x (inputs)

Key Insight

Simple networks: Learn basic relationships (like a straight line)
Complex networks: Can learn intricate patterns but risk overfitting
Sweet spot: Finding the right complexity for your specific problem and data size

True or false? As explained in this lecture, every input layer feature is interconnected with every hidden layer feature. → True Imagine you have 4 friends (the inputs) talking to 3 other friends (the hidden neurons). In a neural network, every input friend talks to every hidden friend.

How Supervised Learning Works:

You give the AI:

Input (x): The data you feed in
Output (y): The correct answer you want
The AI learns the pattern and can predict outputs for new inputs

The Examples Explained:

🏠 Real Estate (Standard NN)

Input: House features (size, bedrooms, location)
Output: Price
AI learns: “Houses with these features typically cost this much”

📱 Online Advertising (Standard NN)

Input: Ad content + user information
Output: Will user click? (0 = no, 1 = yes)
AI learns: “Users like this typically click on ads like that”

📸 Photo Tagging (CNN - Convolutional Neural Network)

Input: Image
Output: What objects are in it (1 to 1000 different things)
AI learns: “This pattern of pixels means ‘dog’, that pattern means ‘car’”

🎤 Speech Recognition (RNN - Recurrent Neural Network)

Input: Audio file
Output: Text transcript
AI learns: “These sound waves correspond to these words”

🌍 Machine Translation (RNN)

Input: English text
Output: Chinese text
AI learns: “This English phrase means this in Chinese”

🚗 Autonomous Driving (Custom/Hybrid)

Input: Camera images + radar data
Output: Where other cars are positioned
AI learns: “From sensor data, other vehicles are located here”

Key Point:

Each application uses different types of neural networks (NN, CNN, RNN) because different problems need different approaches!

Types of Neural Networks Explained

1. Standard NN (Fully Connected Neural Network)

What it is: The basic neural network where every neuron connects to every neuron in the next layer.

Structure:

Input → Hidden Layer → Output
  |         |           |
Every input connects to every hidden neuron
Every hidden neuron connects to every output

Best for:

Tabular data (spreadsheet-like data)
Simple predictions with clear features
Housing prices, loan approval, basic classification

Why it works: Good at finding relationships between different features, but treats all inputs equally.

2. CNN (Convolutional Neural Network)

What it is: Specialized for images and spatial data. Uses “filters” that slide across the image to detect patterns.

How it works:

Image → [Detect edges] → [Detect shapes] → [Detect objects] → Classification

Key Features:

Filters/Kernels: Small windows that scan the image looking for specific patterns
Layer by layer: First layer finds edges, second finds shapes, third finds objects
Shared weights: Same filter used across entire image

Best for:

Image recognition (cats vs dogs)
Photo tagging
Medical imaging
Self-driving cars (object detection)

Why it’s special: Understands that nearby pixels are related (spatial relationships).

3. RNN (Recurrent Neural Network)

What it is: Designed for sequential data - things that happen in order over time.

Key Feature - Memory:

Input₁ → [Neuron + Memory] → Output₁
                ↓
Input₂ → [Neuron + Memory] → Output₂
                ↓
Input₃ → [Neuron + Memory] → Output₃

How it works:

Processes data one piece at a time
Remembers what it saw before
Uses past information to understand current input

Best for:

Speech recognition (words depend on previous words)
Language translation (sentence structure matters)
Time series (stock prices, weather)
Text generation (ChatGPT-style)

Why it’s special: Has “memory” - can remember what happened earlier in the sequence.

Quick Comparison:

Type	Best for	Key Strength
Standard NN	Spreadsheet data	Finds feature relationships
CNN	Images	Understands spatial patterns
RNN	Sequential data	Has memory for order/time

Think of it like:

Standard NN: A calculator that’s really good at math
CNN: Eyes that recognize what they see
RNN: A brain that remembers conversations

Structured vs Unstructured Data in Supervised Learning

This image shows the two main types of data used in machine learning:

Structured Data

What it is: Data that’s organized in rows and columns like a spreadsheet or database table.

Examples shown:

Housing Data:
- Size: 2104 sq ft
- Bedrooms: 3
- Price: $400,000
Advertising Data:
- User Age: 41
- Ad ID: 93242
- Click: 1 (yes, they clicked)

Characteristics:

Organized in clear categories
Numbers and categories that fit neatly in tables
Easy to read by humans
Predictable format

Best Neural Network: Standard NN (fully connected)

Unstructured Data

What it is: Data that doesn’t fit neatly into rows and columns. It’s “messy” or complex.

Examples shown:

Audio: Sound waves (for speech recognition)
Image: Photo of a cat
Text: “Four scores and seven years ago…” (natural language)

Characteristics:

No clear structure
Rich and complex information
Hard to put in spreadsheets
Requires special processing

Best Neural Networks:

Images: CNN (Convolutional NN)
Audio/Text: RNN (Recurrent NN)

Key Difference:

Structured	Unstructured
📊 Spreadsheet-ready	🎨 Rich, complex
Numbers & categories	Images, audio, text
Standard NN works great	Needs specialized NNs
Like filling out a form	Like understanding art

Think of it like:

Structured: Filing cabinet with labeled folders
Unstructured: A messy artist’s studio full of creative works

The type of data determines which neural network architecture works best!

Why Small Training Sets Show Little Difference

Looking at the left side of the graph, you can see that when training sets are small, all the different approaches (traditional learning algorithms, small NN, medium NN, large NN) perform at roughly the same level.

Why This Happens:

1. Not Enough Data to Learn Complex Patterns

Small datasets don’t have enough examples for neural networks to learn intricate relationships
It’s like trying to learn a language from just 10 sentences - you can’t pick up much complexity

2. Traditional Algorithms Work Fine with Small Data

Traditional algorithms (like logistic regression, SVM) are designed to work well with limited data
They use simpler mathematical approaches that don’t need massive amounts of examples

3. Neural Networks Are “Data Hungry”

Neural networks, especially large ones, need lots of data to show their true power
With small datasets, they can’t utilize their ability to find complex patterns
It’s like having a Ferrari but only driving in a parking lot - you can’t use its full potential

4. All Methods Hit the Same “Ceiling”

With limited data, there’s only so much any algorithm can learn
The performance ceiling is determined by the amount of information available, not the sophistication of the method

The Key Insight:

“If you have a small dataset, don’t bother with complex neural networks. Use simpler, traditional methods instead.”

Scale drives deep learning progress

Data (more datasets)
Computation(faster computational power)
Algorithms (algorithmic innovation)

Speed Comparison: Sigmoid vs ReLU

Algorithmic Innovation → Help Computation

Aspect	Sigmoid	ReLU
Gradient	Gets very small	Stays at 1
Computation	Expensive (exponential)	Cheap (max function)
Deep Learning	Slow/stuck	Fast training
Vanishing Gradient	❌ Big problem	✅ No problem

Bottom Line:

ReLU = Faster training + Better gradients + Simpler math This is why ReLU became the standard activation function and helped make deep learning practical!

Binary Classification

Input of image → 1(cat) vs 0(non cat) RGB(64 64 64) → 64 x 64 x 3 = 12288

Machine Learning Notation Guide

Basic Notation

Training Data

(x, y) = A single training example
- x = Input features (e.g., house size, bedrooms)
- y = Output/target (e.g., price)

Dataset Sizes

m = Number of training examples
M = M_train = Training set size
M_test = Number of test examples

Multiple Training Examples

Training set: {(x⁽¹⁾, y⁽¹⁾), (x⁽²⁾, y⁽²⁾), …, (x⁽ᵐ⁾, y⁽ᵐ⁾)} Where:

(x⁽¹⁾, y⁽¹⁾) = First training example
(x⁽²⁾, y⁽²⁾) = Second training example
(x⁽ᵐ⁾, y⁽ᵐ⁾) = m-th (last) training example

Matrix Notation

Input Matrix X

X = [x⁽¹⁾ x⁽²⁾ ... x⁽ᵐ⁾]

X ∈ ℝⁿˣᵐ = X is a matrix with n features and m examples
X.shape = (nₓ, m)
- nₓ = number of input features
- m = number of training examples

Output Vector Y

Y = [y⁽¹⁾ y⁽²⁾ ... y⁽ᵐ⁾]

Y ∈ ℝ¹ˣᵐ = Y is a row vector with m examples
Y.shape = (1, m)

Key Concepts

Data Organization

Columns = Different training examples
Rows = Different features
Each x⁽ⁱ⁾ = One column representing all features of example i
Each y⁽ⁱ⁾ = Output for example i

Dimensions

nₓ = Number of input features (height of X matrix)
m = Number of training examples (width of matrices)
X is nₓ × m matrix
Y is 1 × m matrix (assuming single output)

Example

If predicting house prices with 3 features (size, bedrooms, age) and 1000 examples:

nₓ = 3 (features)
m = 1000 (examples)
X.shape = (3, 1000)
Y.shape = (1, 1000)

Logistic Regression

This image shows how Logistic Regression works - a method for binary classification (predicting yes/no, 0/1 outcomes). It’s a way to predict YES or NO questions (like “Will this email be spam?” or “Will this person buy our product?“)

The Problem

Given x, want ŷ = P(y=1|x)

Given input features x, we want to predict the probability that y = 1
The output must be between 0 ≤ ŷ ≤ 1 (since it’s a probability)

The Components

Parameters

w ∈ ℝⁿˣ = Weight vector (same dimension as input features)
b ∈ ℝ = Bias term (single number)

The Formula

ŷ = σ(wᵀx + b) Where z = wᵀx + b (linear combination)

The Sigmoid Function σ(z)

Mathematical Definition:

σ(z) = 1/(1 + e⁻ᶻ)

What the Sigmoid Does:

Takes any real number z (-∞ to +∞)
Squashes it to probability (0 to 1)

Sigmoid Curve Properties:

When z = 0: σ(0) = 0.5
When z → large positive: σ(z) → 1
When z → large negative: σ(z) → 0
When z → +∞: σ(z) ≈ 1
When z → -∞: σ(z) ≈ 0

How It Works

Linear Combination: z = wᵀx + b (can be any value)
Sigmoid Transformation: σ(z) = probability between 0 and 1
Interpretation:
- If ŷ > 0.5 → predict class 1
- If ŷ < 0.5 → predict class 0

Key Insight

Sigmoid converts any linear output into a valid probability! This is why logistic regression works for classification - it gives you meaningful probabilities instead of just raw numbers.

Logistic Regression Cost Function Explained

This shows how to measure and minimize the error in logistic regression.

The Setup

ŷ⁽ⁱ⁾ = σ(wᵀx⁽ⁱ⁾ + b) = predicted probability for example i
y⁽ⁱ⁾ = actual label (0 or 1)
Goal: Make ŷ⁽ⁱ⁾ ≈ y⁽ⁱ⁾ for all examples

Why Not Use Squared Error?

Squared Error: L(ŷ,y) = ½(ŷ-y)²

Problem: This creates a non-convex cost function (wavy with multiple local minima) when combined with sigmoid. Hard to optimize!

Gradient Derivation for Logistic Regression with Squared Error

Overview

This explains why the gradient ∇_w J contains the terms (ŷ - y) and x when using squared error loss with sigmoid activation.

The Chain Rule Breakdown

For J(w) = ½(ŷ - y)² and ŷ = σ(w^T x), we use the chain rule:

∇_w J = ∂J/∂ŷ × ∂ŷ/∂z × ∂z/∂w

where z = w^T x (the linear combination before sigmoid).

Three Key Components

1. Error Term: ∂J/∂ŷ = (ŷ - y)

From: Squared error loss J = ½(ŷ - y)²
Meaning: How wrong is our prediction?
Example: If ŷ = 0.9 and y = 1, error = -0.1

2. Sigmoid Derivative: ∂ŷ/∂z = ŷ(1 - ŷ)

From: Sigmoid function ŷ = σ(z) = 1/(1 + e^(-z))
Meaning: How “adjustable” is our current prediction?
Key insight:
- Maximum when ŷ = 0.5 (most adjustable)
- Minimum when ŷ ≈ 0 or ŷ ≈ 1 (least adjustable)

3. Input Features: ∂z/∂w = x

From: Linear layer z = w^T x
Meaning: Which input features contributed to the error?
Purpose: Determines how much to adjust each weight

Final Gradient Formula

∇_w J = (ŷ - y) × ŷ(1 - ŷ) × x
         ↑        ↑         ↑
      Error   Sigmoid    Input
              derivative features

Why Each Term Matters

Term	Origin	Role
(ŷ - y)	Loss function	Measures prediction error
ŷ(1 - ŷ)	Sigmoid derivative	Controls update sensitivity
x	Linear layer	Identifies which features to adjust

The Problem with Sigmoid Derivative

The ŷ(1 - ŷ) term creates issues:

Vanishing gradients: When ŷ ≈ 0 or ŷ ≈ 1, gradient ≈ 0
Slow learning: Updates become very small near certainty

Why Cross-Entropy is Better

Cross-entropy loss eliminates the problematic ŷ(1 - ŷ) term, giving:

∇_w J = (ŷ - y) × x

This provides:

Stable gradients regardless of prediction confidence
Faster convergence
No vanishing gradient problem

Key Takeaway

The terms (ŷ - y) and x arise naturally from the chain rule:

(ŷ - y) comes from differentiating the loss function
x comes from the linear transformation w^T x
The sigmoid derivative ŷ(1 - ŷ) is the “bottleneck” that causes training difficulties

The Logistic Loss Function

L(ŷ,y) = -[y log ŷ + (1-y) log(1-ŷ)]

How This Works for Different Cases:

Case 1: If y = 1 (actual label is 1)

L(ŷ,y) = -log ŷ
Want ŷ large (close to 1) → want log ŷ large → want -log ŷ small
When ŷ → 1: loss → 0 ✅
When ŷ → 0: loss → ∞ ❌ (heavily penalized)

Case 2: If y = 0 (actual label is 0)

L(ŷ,y) = -log(1-ŷ)
Want ŷ small (close to 0) → want log(1-ŷ) large → want -log(1-ŷ) small
When ŷ → 0: loss → 0 ✅
When ŷ → 1: loss → ∞ ❌ (heavily penalized)

The Complete Cost Function

J(w,b) = (1/m) Σᵢ₌₁ᵐ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾) J(w,b) = -(1/m) Σᵢ₌₁ᵐ [y⁽ⁱ⁾ log ŷ⁽ⁱ⁾ + (1-y⁽ⁱ⁾) log(1-ŷ⁽ⁱ⁾)]

What This Means:

Average the loss over all m training examples
Convex function → guaranteed to find global minimum with gradient descent
Maximum likelihood estimation → principled statistical approach

Key Insight:

The logistic loss function heavily penalizes wrong confident predictions while being gentle on correct predictions. This creates a smooth, convex optimization landscape that gradient descent can solve efficiently!

Loss Function vs Cost Function

Loss Function

Definition: Measures the error for a SINGLE training example

Notation: L(ŷ, y)

ŷ = predicted value for one example
y = actual value for that same example

Examples:

Squared Loss: L(ŷ, y) = ½(ŷ - y)²
Logistic Loss: L(ŷ, y) = -[y log ŷ + (1-y) log(1-ŷ)]

Think of it as: “How wrong am I on this one prediction?”

Cost Function

Definition: Measures the total error across ALL training examples

Notation: J(w, b) or C(w, b)

Takes the average (or sum) of all individual losses
Depends on parameters w and b

Formula: J(w, b) = (1/m) Σᵢ₌₁ᵐ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾)

Think of it as: “How wrong is my model overall?”

Simple Analogy

Loss Function = Grade on One Test

You take a math test → get 75/100
That’s your “loss” for that one test

Cost Function = Your Overall GPA

Average of ALL your test grades
Represents your overall performance

In Practice

For One Example:

Actual: y = 1
Predicted: ŷ = 0.8
Loss: L(0.8, 1) = -log(0.8) = 0.22

For Entire Dataset:

Example 1: L(ŷ⁽¹⁾, y⁽¹⁾) = 0.22
Example 2: L(ŷ⁽²⁾, y⁽²⁾) = 0.15
Example 3: L(ŷ⁽³⁾, y⁽³⁾) = 0.31
...

Cost: J(w,b) = (0.22 + 0.15 + 0.31 + ...) / m

Key Points:

Loss = individual error
Cost = overall error (what we minimize during training)
Goal: Minimize the cost function to get the best model!

Gradient Descent for Logistic Regression

These images show how to optimize logistic regression using gradient descent.

The Setup:

ŷ = σ(wᵀx + b) where σ(z) = 1/(1+e⁻ᶻ)
Cost function: J(w,b) = -(1/m)Σ[y⁽ⁱ⁾log ŷ⁽ⁱ⁾ + (1-y⁽ⁱ⁾)log(1-ŷ⁽ⁱ⁾)]

The Goal:

Find w, b that minimize J(w,b)

The Visualization:

3D bowl shape: Shows how cost J(w,b) changes with different w and b values
Global optimum: The bottom of the bowl (red dot) = best parameters
Convex shape: Guarantees we can find the global minimum

Key Insight:

The right side shows the difference between:

Convex function (smooth bowl) ✅ - Easy to optimize
Non-convex function (wavy with multiple dips) ❌ - Hard to optimize

Image 2: The Algorithm

The Gradient Descent Steps:

Repeat until convergence:

Calculate gradients (slopes):
- ∂J(w,b)/∂w = slope with respect to w
- ∂J(w,b)/∂b = slope with respect to b
Update parameters:
- w := w - α(∂J(w,b)/∂w)
- b := b - α(∂J(w,b)/∂b)

Key Components:

α (alpha) = Learning rate

Controls how big steps we take
Too big → might overshoot minimum
Too small → very slow learning

∂J(w,b)/∂w = Partial derivative

Tells us which direction to move w
If positive → decrease w
If negative → increase w

How It Works (Simple Explanation):

Think of it like rolling a ball down a hill:

Look around: Calculate which direction is steepest downhill (gradients)
Take a step: Move in that direction by amount α
Repeat: Keep doing this until you reach the bottom

The Process:

Start somewhere → Calculate slope → Take step downhill → Repeat

Why It Works:

Convex cost function means there’s only one bottom
Gradients point toward the steepest descent
Learning rate α controls step size
Eventually converges to global minimum

The Math Behind the Scenes:

The partial derivatives ∂J(w,b)/∂w and ∂J(w,b)/∂b tell us:

How much the cost changes when we change w or b slightly
Which direction to move to reduce the cost most quickly

Bottom line: Gradient descent automatically finds the best w and b values by following the mathematical “downhill” direction!

Complete Guide: Computation Graphs & Neural Network Learning

What is a Computation Graph?

A computation graph is a visual representation of mathematical operations that shows:

Nodes: Variables or operations
Edges: Flow of data
Direction: Order of computation

Think of it as:

📝 Recipe with steps - Each box is an instruction
🏭 Factory assembly line - Data flows through processing stations
🧩 Puzzle pieces - Complex calculations broken into simple parts

Simple Example Walkthrough

The Function: J(a,b,c) = 3(a + bc)

Given values: a = 5, b = 3, c = 2

Visual Representation:

a = 5 ──┐
        ├──→ [v = a + u] ──→ [J = 3v] ──→ Final Result
b = 3 ──┤
        └──→ [u = bc]
c = 2 ───────┘

Step-by-Step Calculation:

u = bc = 3 × 2 = 6
v = a + u = 5 + 6 = 11
J = 3v = 3 × 11 = 33

Forward Propagation

Definition

Forward propagation is the process of computing the output by flowing data from inputs to outputs through the graph.

Process:

Start with input values
Compute each operation in sequence
Flow results to the next operation
Arrive at final output

Example Trace:

Step 1: u = b × c
        u = 3 × 2 = 6

Step 2: v = a + u  
        v = 5 + 6 = 11

Step 3: J = 3 × v
        J = 3 × 11 = 33

Real-World Analogy: Cooking Recipe

Prep ingredients (inputs)
Follow steps in order (operations)
Get final dish (output)

Backward Propagation

Definition

Backward propagation computes gradients by flowing error signals backwards through the graph to determine how much each input affects the final output.

Purpose: Learning and Optimization

Find out which inputs have the biggest impact
Adjust parameters to improve results
Enable machine learning

The Math: Chain Rule

Starting from J = 33, work backwards: ∂J/∂v = 3 (since J = 3v) ∂J/∂a = ∂J/∂v × ∂v/∂a = 3 × 1 = 3 ∂J/∂u = ∂J/∂v × ∂v/∂u = 3 × 1 = 3 ∂J/∂b = ∂J/∂u × ∂u/∂b = 3 × c = 3 × 2 = 6 ∂J/∂c = ∂J/∂u × ∂u/∂c = 3 × b = 3 × 3 = 9

Understanding the Gradients:

∂J/∂a = 3: If I increase ‘a’ by 0.1, J increases by 0.3
∂J/∂b = 6: If I increase ‘b’ by 0.1, J increases by 0.6
∂J/∂c = 9: If I increase ‘c’ by 0.1, J increases by 0.9

Key Insight: Larger gradient = More influence on the final result

‘c’ has the biggest impact (gradient = 9)
‘b’ has medium impact (gradient = 6)
‘a’ has smallest impact (gradient = 3)

Real-World Analogy: Recipe Improvement

Taste the final dish (evaluate output)
Identify what went wrong (compute error)
Trace back to ingredients (backward pass)
Adjust recipe for next time (parameter updates)

The Learning Process (Parameter Updates)

The Update Rule:

new_parameter = old_parameter - learning_rate × gradient

Why the Minus Sign?

Gradient points uphill (direction of steepest increase)
We want to go downhill (minimize cost/error)
Minus sign flips the direction

Detailed Learning Example

Goal: Minimize J (treat J as our cost/error function) Current State:

a = 5, b = 3, c = 2
J = 33 (our current error - too high!)

Step 1: Calculate Gradients (already done above)

∂J/∂a = 3, ∂J/∂b = 6, ∂J/∂c = 9

Step 2: Choose Learning Rate

α = 0.1 (learning rate)

Step 3: Update Parameters

a_new = a - α × ∂J/∂a = 5 - 0.1 × 3 = 4.7
b_new = b - α × ∂J/∂b = 3 - 0.1 × 6 = 2.4
c_new = c - α × ∂J/∂c = 2 - 0.1 × 9 = 1.1

Step 4: Check New Cost

J_new = 3(4.7 + 2.4 × 1.1) = 3(7.34) = 22.02
Result: J decreased from 33 to 22.02! ✅

The Chain Rule in Detail

For parameter ‘c’:

c affects u: u = bc, so ∂u/∂c = b = 3
u affects v: v = a + u, so ∂v/∂u = 1
v affects J: J = 3v, so ∂J/∂v = 3

Chain rule: ∂J/∂c = ∂J/∂v × ∂v/∂u × ∂u/∂c = 3 × 1 × 3 = 9

This tells us: “A small change in ‘c’ gets multiplied by 3 (through the bc operation), then by 1 (through addition), then by 3 again (through the final multiplication)“

Neural Network Application

Simple Neural Network Structure:

Input → [Weight × Input + Bias] → [Activation] → Output → [Loss] → Error
  x              z = wx + b         a = σ(z)      ŷ      L(ŷ,y)

Forward Propagation in Neural Networks:

# Pseudocode
z = w * x + b          # Linear combination
a = sigmoid(z)         # Activation function  
y_hat = a             # Prediction
loss = compute_loss(y_hat, y_true)  # Error calculation

Backward Propagation in Neural Networks:

# Pseudocode
dL_da = gradient_of_loss_wrt_activation
da_dz = gradient_of_activation_wrt_z  
dz_dw = gradient_of_z_wrt_weight
dz_db = gradient_of_z_wrt_bias
 
# Chain rule application
dL_dw = dL_da * da_dz * dz_dw
dL_db = dL_da * da_dz * dz_db
 
# Parameter updates
w = w - learning_rate * dL_dw
b = b - learning_rate * dL_db

Real Neural Network Example: Email Spam Detection

Setup:

Email features (x) → [w×x + b] → [sigmoid] → Spam probability (ŷ) → Error

Forward Pass:

Input: x = [3, 5] (caps count, exclamation marks)
Weights: w = [0.2, 0.4], bias b = 0.1
Linear: z = 0.2×3 + 0.4×5 + 0.1 = 2.7
Sigmoid: ŷ = 1/(1+e^(-2.7)) = 0.94 (94% spam)
True label: y = 0 (actually not spam)
Error: Big mistake! We predicted 94% spam but it’s not spam

Backward Pass:

Error gradient: ∂L/∂ŷ = large positive number (big error)
Through sigmoid: ∂L/∂z = ∂L/∂ŷ × ∂ŷ/∂z = error × sigmoid_derivative
To weights: ∂L/∂w = ∂L/∂z × x
To bias: ∂L/∂b = ∂L/∂z

Parameter Updates:

w_new = w - learning_rate × ∂L/∂w
b_new = b - learning_rate × ∂L/∂b

The weights adjust to reduce the spam probability for emails with similar features.

Real-World Examples

1. Email Spam Detection

Forward Propagation:

Email Text → [Feature Extraction] → [Neural Network] → Spam Probability
"Buy now!" → [Word Count, Caps, etc.] → [Processing] → 95% Spam

Backward Propagation:

Wrong Prediction ← [Adjust Weights] ← [Calculate Gradients] ← Error Signal
"Actually Not Spam" ← [Learning] ← [Backprop] ← Large Error

2. Image Recognition

Forward Propagation:

Image Pixels → [Convolution] → [Pooling] → [Dense Layer] → "Cat" or "Dog"

Backward Propagation:

Incorrect Label ← [Update Filters] ← [Gradient Flow] ← Prediction Error

3. Stock Price Prediction

Forward Propagation:

Historical Prices → [LSTM Network] → [Dense Layer] → Tomorrow's Price

Backward Propagation:

Actual Price ← [Adjust Memory] ← [Time-based Gradients] ← Prediction Error

Why This Matters

1. Automatic Differentiation

Computers can automatically calculate gradients
No need to manually derive complex formulas
Scales to networks with millions of parameters

2. Modular Design

Break complex functions into simple operations
Easy to debug and understand
Reusable components

3. Efficient Computation

Reuse intermediate calculations
Parallel processing possibilities
Memory optimization

4. Foundation of Deep Learning

All neural networks use this principle
Enables training of very deep networks
Basis for modern AI systems

Key Learning Insights

1. Gradients are Directions

Positive gradient: Parameter makes error worse → Decrease it
Negative gradient: Parameter makes error better → Increase it
Zero gradient: Parameter doesn’t affect error → Leave unchanged

2. Learning Rate is Crucial

Too big: Skip over the minimum, unstable learning
Too small: Learn very slowly
Just right: Steady improvement

3. Chain Rule Connects Everything

Changes propagate backwards through all layers
Each operation contributes its piece to the gradient
Complex networks become manageable

Key Takeaways

The Complete Learning Loop:

1. Forward Pass: Make prediction
2. Calculate Error: Compare with truth
3. Backward Pass: Calculate gradients
4. Update Parameters: Improve weights
5. Repeat: Get better over time

Why It Works:

Large errors → Large gradients → Big weight changes
Small errors → Small gradients → Small weight changes
Automatic adjustment based on how wrong we are

Visual Summary:

Forward Pass (Making Predictions):
Input → Layer 1 → Layer 2 → ... → Output → Error

Backward Pass (Learning):
Gradient ← Layer 1 ← Layer 2 ← ... ← Output ← Error

Update (Improving):
New Weights = Old Weights - Learning_Rate × Gradients

Real-World Impact:

Search engines understanding your queries
Social media showing relevant content
Voice assistants recognizing speech
Medical AI diagnosing diseases
Self-driving cars navigating roads

Summary

Computation Graph = Smart Recipe System

Ingredients (inputs) → Cooking steps (operations) → Final dish (output)
Taste test (error) → Figure out improvements (gradients) → Better recipe (updated parameters)
Automatic improvement over time through practice!

The magic: This process automatically figures out how to adjust millions of parameters in complex neural networks to reduce errors and improve performance!

This is how neural networks learn everything from recognizing images to understanding language - through the systematic application of gradients computed via backpropagation. It’s the mathematical foundation that makes modern AI possible, turning the complex process of machine learning into an automatic, scalable system that can learn from data and improve over time.

Small simple notation

dJ/dv → dv
da ← dJ/da
db ← dJ/db

Logistic Regression: Complete Forward & Backward Pass

These images show the complete computation graph for logistic regression and how gradients flow through it.

Image 1: Forward Propagation (Logistic Regression Recap)

The Complete Forward Flow:

Inputs (x₁, x₂) → Linear Combination (z) → Sigmoid (a) → Loss (L)

Step-by-Step:

Linear Combination: z = w₁x₁ + w₂x₂ + b
Sigmoid Activation: ŷ = a = σ(z) = 1/(1+e⁻ᶻ)
Loss Calculation: L(a,y) = -(y log(a) + (1-y) log(1-a))

The Computation Graph:

Inputs: x₁, w₁, x₂, w₂, b flow into the linear combination
Output: Final loss L(a,y) that we want to minimize

Image 2: Backward Propagation (Derivatives)

The Backward Flow:

∂L/∂L ← ∂L/∂a ← ∂L/∂z ← ∂L/∂w₁, ∂L/∂w₂, ∂L/∂b

Key Derivatives:

1. Loss to Activation:

∂L/∂a = -y/a + (1-y)/(1-a)

2. Activation to Linear:

∂a/∂z = a(1-a) (sigmoid derivative)

3. Linear to Parameters:

∂z/∂w₁ = x₁
∂z/∂w₂ = x₂
∂z/∂b = 1

Chain Rule Application:

Using the chain rule to get gradients for parameter updates: ∂L/∂w₁ = ∂L/∂a × ∂a/∂z × ∂z/∂w₁ = ∂L/∂a × a(1-a) × x₁ ∂L/∂w₂ = ∂L/∂a × ∂a/∂z × ∂z/∂w₂ = ∂L/∂a × a(1-a) × x₂ ∂L/∂b = ∂L/∂a × ∂a/∂z × ∂z/∂b = ∂L/∂a × a(1-a) × 1

Parameter Updates:

w₁ := w₁ - α × ∂L/∂w₁ w₂ := w₂ - α × ∂L/∂w₂ b := b - α × ∂L/∂b

Understanding the Flow

Forward Pass (Image 1):

Start with input features and current parameters
Compute linear combination z
Apply sigmoid to get probability a
Calculate loss L comparing prediction to true label

Backward Pass (Image 2):

Start from the loss
Calculate how loss changes with respect to activation (∂L/∂a)
Propagate through sigmoid to get ∂L/∂z
Distribute to all parameters using chain rule
Update parameters to reduce loss

Key Insights

The Beauty of the Chain Rule:

Each operation contributes its “piece” to the gradient:

Loss function contributes ∂L/∂a
Sigmoid function contributes a(1-a)
Linear combination contributes x₁, x₂, or 1

Why This Works:

Large errors (∂L/∂a is big) → Big parameter updates
Small errors (∂L/∂a is small) → Small parameter updates
Input features (x₁, x₂) determine which weights get updated most

The Learning Process:

Make prediction (forward pass)
Calculate error (loss)
Find blame (backward pass with gradients)
Fix parameters (gradient descent updates)
Repeat until network learns

This is exactly how logistic regression (and all neural networks) learn - through the systematic application of the chain rule to propagate errors backward and update parameters!

Logistic Regression on m Examples: Complete Training Process

This image shows how to train logistic regression on multiple training examples and the complete gradient descent algorithm.

The Setup

Initialization:

J = 0; dw₁ = 0; dw₂ = 0; db = 0

Start with zero cost and zero gradient accumulators

Training Loop: For i = 1 to m

For each training example, perform:

Forward Pass (for example i):

z⁽ⁱ⁾ = w₁ᵀx⁽ⁱ⁾ + b
a⁽ⁱ⁾ = σ(z⁽ⁱ⁾)
J += -[y⁽ⁱ⁾ log a⁽ⁱ⁾ + (1-y⁽ⁱ⁾) log(1-a⁽ⁱ⁾)]

Backward Pass (for example i):

dz⁽ⁱ⁾ = a⁽ⁱ⁾ - y⁽ⁱ⁾ (derivative of loss w.r.t. z)
dw₁ += x₁⁽ⁱ⁾ dz⁽ⁱ⁾ (accumulate gradient for w₁)
dw₂ += x₂⁽ⁱ⁾ dz⁽ⁱ⁾ (accumulate gradient for w₂)
db += dz⁽ⁱ⁾ (accumulate gradient for b)

After Processing All Examples:

Average the Gradients:

J /= m (average cost) dw₁ /= m; dw₂ /= m; db /= m (average gradients)

Update Parameters:

w₁ := w₁ - α dw₁ w₂ := w₂ - α dw₂
b := b - α db

Key Insights

1. Accumulation Process

Forward pass: Calculate prediction and add to total cost
Backward pass: Calculate gradients and add to running totals
Average: Divide by m to get average cost and gradients

2. Why We Average

Fair representation: Each example contributes equally
Stable learning: Prevents any single example from dominating
Consistent scaling: Works regardless of dataset size

3. The Magic Formula: dz⁽ⁱ⁾ = a⁽ⁱ⁾ - y⁽ⁱ⁾

This simple formula captures the complete backward pass through sigmoid and loss:

If a⁽ⁱ⁾ = 0.9 and y⁽ⁱ⁾ = 1: dz = -0.1 (small error, small update)
If a⁽ⁱ⁾ = 0.9 and y⁽ⁱ⁾ = 0: dz = +0.9 (big error, big update)

4. Vectorization Hint

The right side shows this can be vectorized for efficiency rather than using explicit loops.

Complete Algorithm Summary

# Initialization
J = 0
dw1, dw2, db = 0, 0, 0
 
# Process all training examples
for i in range(m):
    # Forward pass
    z_i = w1 * x1_i + w2 * x2_i + b
    a_i = sigmoid(z_i)
    J += -(y_i * log(a_i) + (1-y_i) * log(1-a_i))
    
    # Backward pass
    dz_i = a_i - y_i
    dw1 += x1_i * dz_i
    dw2 += x2_i * dz_i
    db += dz_i
 
# Average over all examples
J /= m
dw1 /= m
dw2 /= m  
db /= m
 
# Update parameters
w1 = w1 - alpha * dw1
w2 = w2 - alpha * dw2
b = b - alpha * db

Why This Works

1. Batch Processing

Process all examples before updating parameters
Get a “consensus” view of what changes are needed
More stable than updating after each example

2. Error Accumulation

Large errors contribute more to gradient updates
Small errors contribute less
Automatic weighting based on how wrong we are

3. Democratic Learning

Each training example gets a “vote” on parameter changes
Final update is the average of all votes
Prevents overfitting to individual examples

This is the complete training algorithm for logistic regression - the foundation that scales up to massive neural networks with millions of parameters!

🪴 Jisang Yun

Table of Contents

Course 1_ Neural Networks and Deep Learning

What is a Neural Network?

Simple Neural Network - Housing Price Prediction

How It Works

ReLU Function - The Popular Choice

What ReLU Looks Like:

The Math Behind ReLU:

Why “Rectified”?

Building Bigger Networks

Multiple Input Example - More Realistic

Key Insight

How Supervised Learning Works:

The Examples Explained:

Key Point:

Types of Neural Networks Explained

1. Standard NN (Fully Connected Neural Network)

2. CNN (Convolutional Neural Network)

3. RNN (Recurrent Neural Network)

Quick Comparison:

Structured vs Unstructured Data in Supervised Learning

Structured Data

Unstructured Data

Key Difference:

Why Small Training Sets Show Little Difference

Why This Happens:

1. Not Enough Data to Learn Complex Patterns

2. Traditional Algorithms Work Fine with Small Data

3. Neural Networks Are “Data Hungry”

4. All Methods Hit the Same “Ceiling”

The Key Insight:

Scale drives deep learning progress

Speed Comparison: Sigmoid vs ReLU

Bottom Line:

Binary Classification

Machine Learning Notation Guide

Basic Notation

Training Data

Dataset Sizes

Multiple Training Examples

Matrix Notation

Input Matrix X

Output Vector Y

Key Concepts

Data Organization

Dimensions

Example

Logistic Regression

The Problem

The Components

Parameters

The Formula

The Sigmoid Function σ(z)

Mathematical Definition:

What the Sigmoid Does:

Sigmoid Curve Properties:

How It Works

Key Insight

Logistic Regression Cost Function Explained

The Setup

Why Not Use Squared Error?

Gradient Derivation for Logistic Regression with Squared Error

Overview

The Chain Rule Breakdown

Three Key Components

1. Error Term: ∂J/∂ŷ = (ŷ - y)

2. Sigmoid Derivative: ∂ŷ/∂z = ŷ(1 - ŷ)

3. Input Features: ∂z/∂w = x

Final Gradient Formula

Why Each Term Matters

The Problem with Sigmoid Derivative

Why Cross-Entropy is Better

Key Takeaway

The Logistic Loss Function

How This Works for Different Cases:

The Complete Cost Function

What This Means:

Key Insight:

Loss Function vs Cost Function