AI: Through an Architect's Lens
Part 0A Conceptual Complete

Neural Networks & The Learning Mechanism

Building intuition from first principles—how neural networks actually learn, explained for engineers who want to understand the machinery before architecting with it

Updated February 23, 2026 28 min read

Neural Networks & The Learning Mechanism

Part 0A: Enterprise AI Architecture Tutorial Series

Building intuition from first principles—how neural networks actually learn, explained for engineers who want to understand the machinery before architecting with it.


Target Audience: Senior engineers building foundational ML understanding Prerequisites: Basic calculus (derivatives), linear algebra (matrix multiplication), Python Reading Time: 50-60 minutes Series Context: Foundation for Sequences & Transformers (Part 0B), then LLM Architecture (Part 1)


Introduction: Why Engineers Should Understand the Learning Mechanism

You can use PyTorch or TensorFlow without understanding backpropagation. Many do. But when you’re architecting AI systems at scale—debugging why a model won’t converge, explaining cost implications to stakeholders, or making fine-tuning decisions—surface-level knowledge fails you.

This tutorial builds your intuition from the ground up. We’ll derive the key ideas, implement them in NumPy, and connect each concept to practical implications. By the end, you’ll understand not just what neural networks do, but why they work and when they break.

Let’s start with the smallest unit: the artificial neuron.


1. The Neuron as a Decision Unit

1.1 From Biology to Mathematics

Biological neurons receive signals through dendrites, process them in the cell body, and fire (or don’t) through the axon. The artificial neuron is a dramatic simplification:

graph LR subgraph Inputs x1[x₁] x2[x₂] x3[x₃] end subgraph Neuron sum((Σ)) act[activation] end x1 -->|w₁| sum x2 -->|w₂| sum x3 -->|w₃| sum b[bias b] --> sum sum --> act act --> output[output y]

The mathematical formulation:

z=w1x1+w2x2+w3x3+b=i=1nwixi+b=wTx+bz = w_1 x_1 + w_2 x_2 + w_3 x_3 + b = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b

y=σ(z)y = \sigma(z)

Where:

  • x = input vector (features)
  • w = weight vector (learned parameters)
  • b = bias (learned parameter)
  • z = weighted sum (pre-activation)
  • σ = activation function
  • y = output

Let’s implement this:

import numpy as np
def neuron(x, w, b, activation_fn):
"""
A single artificial neuron.
Args:
x: Input vector, shape (n_features,)
w: Weight vector, shape (n_features,)
b: Bias scalar
activation_fn: Activation function
Returns:
Output scalar
"""
z = np.dot(w, x) + b # Weighted sum
y = activation_fn(z) # Apply activation
return y
# Example: A neuron deciding if an email is spam
# Features: [word_count, has_urgent, num_links]
x = np.array([150, 1, 5])
w = np.array([0.01, 2.0, 0.5]) # Learned weights
b = -3.0 # Learned bias
# Step activation (simplest case)
step = lambda z: 1 if z > 0 else 0
output = neuron(x, w, b, step)
print(f"Pre-activation z = {np.dot(w, x) + b}") # z = 1.5 + 2.0 + 2.5 - 3.0 = 3.0
print(f"Output (spam?): {output}") # 1 (spam)

1.2 Why Activation Functions Matter

Without activation functions, a neural network is just linear regression—no matter how many layers you stack:

Layer 1: h=W1x+b1\text{Layer 1: } h = W_1 x + b_1 Layer 2: y=W2h+b2=W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)\text{Layer 2: } y = W_2 h + b_2 = W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)

This collapses to a single linear transformation. Non-linear activations break this, allowing networks to learn complex decision boundaries.

Common Activation Functions:

import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""Squashes to (0, 1). Good for probabilities."""
return 1 / (1 + np.exp(-z))
def tanh(z):
"""Squashes to (-1, 1). Zero-centered."""
return np.tanh(z)
def relu(z):
"""Rectified Linear Unit. Simple, effective, fast."""
return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
"""ReLU variant that doesn't 'die' for negative inputs."""
return np.where(z > 0, z, alpha * z)
def softmax(z):
"""Converts vector to probability distribution."""
exp_z = np.exp(z - np.max(z)) # Subtract max for numerical stability
return exp_z / exp_z.sum()
# Visualize
z = np.linspace(-5, 5, 100)
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].plot(z, sigmoid(z)); axes[0, 0].set_title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)')
axes[0, 1].plot(z, tanh(z)); axes[0, 1].set_title('Tanh: tanh(z)')
axes[1, 0].plot(z, relu(z)); axes[1, 0].set_title('ReLU: max(0, z)')
axes[1, 1].plot(z, leaky_relu(z)); axes[1, 1].set_title('Leaky ReLU')
for ax in axes.flat:
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.grid(True, alpha=0.3)

Decision Framework: Choosing Activations

ActivationUse CaseProsCons
ReLUHidden layers (default)Fast, sparse, no vanishing gradient for z>0”Dead neurons” if z always <0
Leaky ReLUHidden layersNo dead neuronsSlightly more complex
SigmoidBinary classification outputOutputs probabilityVanishing gradient, not zero-centered
TanhHidden layers (older architectures)Zero-centeredVanishing gradient at extremes
SoftmaxMulti-class outputProbability distributionOnly for output layer

Practical Note: For modern deep networks, start with ReLU in hidden layers. The “vanishing gradient problem” we’ll discuss later is why sigmoid/tanh fell out of favor for deep networks.


2. Building Networks from Neurons

2.1 Layers and Depth

Individual neurons are weak. Stacking them creates layers, and stacking layers creates depth—the “deep” in deep learning.

graph LR subgraph Input Layer x1((x₁)) x2((x₂)) x3((x₃)) end subgraph Hidden Layer 1 h1((h₁)) h2((h₂)) h3((h₃)) h4((h₄)) end subgraph Hidden Layer 2 h5((h₅)) h6((h₆)) end subgraph Output Layer y1((y₁)) end x1 --> h1 & h2 & h3 & h4 x2 --> h1 & h2 & h3 & h4 x3 --> h1 & h2 & h3 & h4 h1 --> h5 & h6 h2 --> h5 & h6 h3 --> h5 & h6 h4 --> h5 & h6 h5 --> y1 h6 --> y1

Terminology:

  • Input layer: Raw features (not counted in “depth”)
  • Hidden layers: Intermediate representations (the “learning” happens here)
  • Output layer: Final prediction
  • Width: Number of neurons per layer
  • Depth: Number of hidden layers + output layer

A network with architecture [3, 4, 2, 1] means:

  • 3 input features
  • 4 neurons in hidden layer 1
  • 2 neurons in hidden layer 2
  • 1 output neuron

2.2 The Forward Pass: Matrix Operations

Instead of computing each neuron separately, we use matrix operations:

Z[l]=W[l]A[l1]+b[l]\mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]} A[l]=σ(Z[l])\mathbf{A}^{[l]} = \sigma(\mathbf{Z}^{[l]})

Where:

  • A[0]=X\mathbf{A}^{[0]} = \mathbf{X} (input)
  • W[l]\mathbf{W}^{[l]} has shape (neurons in layer l, neurons in layer l-1)
  • b[l]\mathbf{b}^{[l]} has shape (neurons in layer l, 1)
import numpy as np
class DenseLayer:
"""A fully-connected (dense) layer."""
def __init__(self, input_dim, output_dim, activation='relu'):
# Xavier/Glorot initialization (important for training stability)
scale = np.sqrt(2.0 / input_dim)
self.W = np.random.randn(output_dim, input_dim) * scale
self.b = np.zeros((output_dim, 1))
self.activation = activation
self.cache = {} # Store values for backprop
def forward(self, A_prev):
"""
Forward pass through the layer.
Args:
A_prev: Activations from previous layer, shape (input_dim, batch_size)
Returns:
A: Activations from this layer, shape (output_dim, batch_size)
"""
Z = np.dot(self.W, A_prev) + self.b
if self.activation == 'relu':
A = np.maximum(0, Z)
elif self.activation == 'sigmoid':
A = 1 / (1 + np.exp(-Z))
elif self.activation == 'tanh':
A = np.tanh(Z)
else: # linear
A = Z
# Cache for backpropagation
self.cache = {'A_prev': A_prev, 'Z': Z, 'A': A}
return A
class NeuralNetwork:
"""A simple feedforward neural network."""
def __init__(self, layer_dims, activations):
"""
Args:
layer_dims: List of layer dimensions, e.g., [784, 128, 64, 10]
activations: List of activations for each layer (except input)
"""
self.layers = []
for i in range(1, len(layer_dims)):
self.layers.append(
DenseLayer(layer_dims[i-1], layer_dims[i], activations[i-1])
)
def forward(self, X):
"""Forward pass through entire network."""
A = X
for layer in self.layers:
A = layer.forward(A)
return A
# Example: Create a network for MNIST digit classification
# Input: 784 pixels, Output: 10 digit probabilities
network = NeuralNetwork(
layer_dims=[784, 256, 128, 10],
activations=['relu', 'relu', 'sigmoid'] # Last should be softmax for multiclass
)
# Forward pass with dummy data
X = np.random.randn(784, 32) # 32 images, 784 pixels each
output = network.forward(X)
print(f"Output shape: {output.shape}") # (10, 32) - 10 scores for 32 images

2.3 What Different Layers Learn

This is the magic of deep learning: hierarchical feature learning.

graph TB subgraph "Layer 1: Edges" e1["/"] e2["|"] e3["—"] e4["\"] end subgraph "Layer 2: Shapes" s1["○"] s2["□"] s3["△"] end subgraph "Layer 3: Parts" p1["👁"] p2["👃"] p3["👄"] end subgraph "Layer 4: Objects" o1["🐱"] o2["🐶"] end e1 & e2 & e3 & e4 --> s1 & s2 & s3 s1 & s2 & s3 --> p1 & p2 & p3 p1 & p2 & p3 --> o1 & o2

In image recognition:

  • Early layers: Edges, textures, colors
  • Middle layers: Shapes, patterns, object parts
  • Later layers: High-level concepts, objects

In language models (preview of Part 0B):

  • Early layers: Character patterns, word pieces
  • Middle layers: Syntax, phrase structure
  • Later layers: Semantics, context, meaning

Architectural Implication: This is why transfer learning works. Early layers learn generalizable features (edges are edges everywhere), while later layers specialize. You can reuse early layers and fine-tune later ones.


3. Loss Functions: Defining Success

The network has made predictions. How do we measure “how wrong” they are? This is the loss function (also called cost function or objective function).

3.1 Regression Losses

For continuous outputs (predicting a number):

Mean Squared Error (MSE):

LMSE=1ni=1n(yiy^i)2\mathcal{L}_{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

def mse_loss(y_true, y_pred):
"""Mean Squared Error - penalizes large errors heavily."""
return np.mean((y_true - y_pred) ** 2)
# Example
y_true = np.array([3.0, 5.0, 2.5])
y_pred = np.array([2.8, 5.2, 2.0])
print(f"MSE: {mse_loss(y_true, y_pred):.4f}") # 0.0967

Mean Absolute Error (MAE):

LMAE=1ni=1nyiy^i\mathcal{L}_{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

def mae_loss(y_true, y_pred):
"""Mean Absolute Error - more robust to outliers."""
return np.mean(np.abs(y_true - y_pred))

When to use which:

  • MSE: When large errors are especially bad (penalizes quadratically)
  • MAE: When outliers exist and shouldn’t dominate training

3.2 Classification Losses

For discrete outputs (predicting a class):

Binary Cross-Entropy (for binary classification):

LBCE=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}_{BCE} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]

def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
"""
Binary Cross-Entropy loss.
Args:
y_true: Ground truth labels (0 or 1)
y_pred: Predicted probabilities (0 to 1)
epsilon: Small value to prevent log(0)
"""
# Clip predictions to avoid log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(
y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
)
return loss
# Example: Spam classification
y_true = np.array([1, 0, 1, 1, 0]) # Actual labels
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2]) # Predicted probabilities
print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}") # ~0.164

Intuition: Cross-entropy heavily penalizes confident wrong predictions. If you predict 0.99 for class 1 but the true label is 0, the loss is enormous: log(0.01)4.6-\log(0.01) \approx 4.6. This drives the network to be calibrated.

Categorical Cross-Entropy (for multi-class):

LCCE=1ni=1nc=1Cyi,clog(y^i,c)\mathcal{L}_{CCE} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})

def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
"""
Categorical Cross-Entropy for multi-class classification.
Args:
y_true: One-hot encoded ground truth, shape (n_samples, n_classes)
y_pred: Predicted probabilities, shape (n_samples, n_classes)
"""
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
return loss
# Example: Digit classification (0-9)
y_true = np.array([
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0], # True label: 3
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0], # True label: 7
])
y_pred = np.array([
[0.01, 0.01, 0.05, 0.85, 0.02, 0.01, 0.02, 0.01, 0.01, 0.01], # Confident 3
[0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.55, 0.05, 0.05], # Less confident 7
])
print(f"CCE Loss: {categorical_cross_entropy(y_true, y_pred):.4f}") # ~0.381

3.3 Why Loss Function Choice Matters

The loss function defines what “success” means. Different losses lead to different learned behaviors:

Loss FunctionOutput ActivationUse CaseBehavior
MSELinearRegressionMinimizes squared errors
MAELinearRobust regressionLess sensitive to outliers
Binary CESigmoidBinary classificationCalibrated probabilities
Categorical CESoftmaxMulti-classCalibrated class probabilities
Focal LossSigmoid/SoftmaxImbalanced dataDown-weights easy examples

Architectural Implication: When you see a model producing poorly calibrated probabilities (e.g., always predicting 0.51 vs 0.49), the loss function and output activation pairing might be wrong.


4. Backpropagation: How Networks Learn

Now for the core algorithm that makes neural networks trainable. Backpropagation is just the chain rule applied systematically.

4.1 The Goal: Find the Gradient

We want to minimize the loss by adjusting weights. For that, we need:

Lw\frac{\partial \mathcal{L}}{\partial w}

For every weight in the network. The gradient tells us: “If I increase this weight slightly, how does the loss change?“

4.2 Chain Rule Refresher

If y=f(g(x))y = f(g(x)), then:

dydx=dydgdgdx\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}

This chains through any number of nested functions:

Lw=Laazzw\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

4.3 Backprop Through a Simple Network

Let’s trace through a tiny network step by step:

graph LR x[x] -->|w₁| z1[z₁] z1 -->|σ| a1[a₁] a1 -->|w₂| z2[z₂] z2 -->|σ| a2["a₂ = ŷ"] a2 --> L["L(y, ŷ)"] style L fill:#ffcccc

Forward pass equations: z1=w1xz_1 = w_1 \cdot x a1=σ(z1)a_1 = \sigma(z_1) z2=w2a1z_2 = w_2 \cdot a_1 a2=σ(z2)=y^a_2 = \sigma(z_2) = \hat{y} L=(yy^)2\mathcal{L} = (y - \hat{y})^2

Backward pass (computing gradients):

Starting from the loss and working backward:

Ly^=2(yy^)\frac{\partial \mathcal{L}}{\partial \hat{y}} = -2(y - \hat{y})

Lz2=Ly^y^z2=2(yy^)σ(z2)\frac{\partial \mathcal{L}}{\partial z_2} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} = -2(y - \hat{y}) \cdot \sigma'(z_2)

Lw2=Lz2z2w2=Lz2a1\frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot a_1

La1=Lz2z2a1=Lz2w2\frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot w_2

Lw1=La1a1z1z1w1=La1σ(z1)x\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \sigma'(z_1) \cdot x

import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(z):
"""Derivative of sigmoid: σ'(z) = σ(z)(1 - σ(z))"""
s = sigmoid(z)
return s * (1 - s)
def forward_backward_demo():
"""
Demonstrate backpropagation through a 2-layer network.
"""
# Initialize
np.random.seed(42)
x = 0.5 # Input
y = 1.0 # Target
w1 = 0.8 # Weight 1
w2 = 0.6 # Weight 2
# ============ FORWARD PASS ============
z1 = w1 * x
a1 = sigmoid(z1)
z2 = w2 * a1
a2 = sigmoid(z2) # This is ŷ
loss = (y - a2) ** 2
print("=== Forward Pass ===")
print(f"z1 = w1 * x = {w1} * {x} = {z1:.4f}")
print(f"a1 = σ(z1) = σ({z1:.4f}) = {a1:.4f}")
print(f"z2 = w2 * a1 = {w2} * {a1:.4f} = {z2:.4f}")
print(f"a2 = σ(z2) = σ({z2:.4f}) = {a2:.4f}")
print(f"Loss = (y - a2)² = ({y} - {a2:.4f})² = {loss:.4f}")
# ============ BACKWARD PASS ============
# Start from loss, work backward
# dL/da2 = -2(y - a2)
dL_da2 = -2 * (y - a2)
# dL/dz2 = dL/da2 * da2/dz2 = dL/da2 * σ'(z2)
dL_dz2 = dL_da2 * sigmoid_derivative(z2)
# dL/dw2 = dL/dz2 * dz2/dw2 = dL/dz2 * a1
dL_dw2 = dL_dz2 * a1
# dL/da1 = dL/dz2 * dz2/da1 = dL/dz2 * w2
dL_da1 = dL_dz2 * w2
# dL/dz1 = dL/da1 * da1/dz1 = dL/da1 * σ'(z1)
dL_dz1 = dL_da1 * sigmoid_derivative(z1)
# dL/dw1 = dL/dz1 * dz1/dw1 = dL/dz1 * x
dL_dw1 = dL_dz1 * x
print("\n=== Backward Pass ===")
print(f"∂L/∂a2 = -2(y - a2) = {dL_da2:.4f}")
print(f"∂L/∂z2 = ∂L/∂a2 * σ'(z2) = {dL_da2:.4f} * {sigmoid_derivative(z2):.4f} = {dL_dz2:.4f}")
print(f"∂L/∂w2 = ∂L/∂z2 * a1 = {dL_dz2:.4f} * {a1:.4f} = {dL_dw2:.4f}")
print(f"∂L/∂a1 = ∂L/∂z2 * w2 = {dL_dz2:.4f} * {w2} = {dL_da1:.4f}")
print(f"∂L/∂z1 = ∂L/∂a1 * σ'(z1) = {dL_da1:.4f} * {sigmoid_derivative(z1):.4f} = {dL_dz1:.4f}")
print(f"∂L/∂w1 = ∂L/∂z1 * x = {dL_dz1:.4f} * {x} = {dL_dw1:.4f}")
return {'dL_dw1': dL_dw1, 'dL_dw2': dL_dw2}
gradients = forward_backward_demo()

Output:

=== Forward Pass ===
z1 = w1 * x = 0.8 * 0.5 = 0.4000
a1 = σ(z1) = σ(0.4000) = 0.5987
z2 = w2 * a1 = 0.6 * 0.5987 = 0.3592
a2 = σ(z2) = σ(0.3592) = 0.5889
Loss = (y - a2)² = (1.0 - 0.5889)² = 0.1690
=== Backward Pass ===
∂L/∂a2 = -2(y - a2) = -0.8223
∂L/∂z2 = ∂L/∂a2 * σ'(z2) = -0.8223 * 0.2421 = -0.1991
∂L/∂w2 = ∂L/∂z2 * a1 = -0.1991 * 0.5987 = -0.1192
∂L/∂a1 = ∂L/∂z2 * w2 = -0.1991 * 0.6 = -0.1195
∂L/∂z1 = ∂L/∂a1 * σ'(z1) = -0.1195 * 0.2401 = -0.0287
∂L/∂w1 = ∂L/∂z1 * x = -0.0287 * 0.5 = -0.0143

4.4 Computational Graph Perspective

Modern frameworks (PyTorch, TensorFlow) build a computational graph and automatically compute gradients. This is automatic differentiation.

graph TB subgraph "Forward Pass (build graph)" x[x=0.5] --> mul1[×] w1[w1=0.8] --> mul1 mul1 --> z1[z1=0.4] z1 --> sig1[σ] sig1 --> a1[a1=0.599] a1 --> mul2[×] w2[w2=0.6] --> mul2 mul2 --> z2[z2=0.359] z2 --> sig2[σ] sig2 --> a2["a2=0.589"] a2 --> loss_node["-"] y[y=1.0] --> loss_node loss_node --> sq["()²"] sq --> L["L=0.169"] end
graph BT subgraph "Backward Pass (traverse graph)" L["∂L/∂L=1"] --> sq sq["×2(y-a2)"] --> loss_node loss_node --> a2["∂L/∂a2"] a2 --> sig2["×σ'(z2)"] sig2 --> z2["∂L/∂z2"] z2 --> mul2_w["×a1"] z2 --> mul2_a["×w2"] mul2_w --> dw2["∂L/∂w2"] mul2_a --> a1["∂L/∂a1"] a1 --> sig1["×σ'(z1)"] sig1 --> z1["∂L/∂z1"] z1 --> mul1_w["×x"] mul1_w --> dw1["∂L/∂w1"] end

4.5 Backprop for a Full Layer (Matrix Form)

In practice, we compute gradients for entire layers at once:

class DenseLayerWithBackprop:
"""Dense layer with forward and backward pass."""
def __init__(self, input_dim, output_dim, activation='relu'):
self.W = np.random.randn(output_dim, input_dim) * np.sqrt(2.0 / input_dim)
self.b = np.zeros((output_dim, 1))
self.activation = activation
self.cache = {}
self.grads = {}
def forward(self, A_prev):
"""
Forward pass.
Args:
A_prev: shape (input_dim, batch_size)
Returns:
A: shape (output_dim, batch_size)
"""
self.cache['A_prev'] = A_prev
Z = np.dot(self.W, A_prev) + self.b
self.cache['Z'] = Z
if self.activation == 'relu':
A = np.maximum(0, Z)
elif self.activation == 'sigmoid':
A = 1 / (1 + np.exp(-Z))
else:
A = Z
self.cache['A'] = A
return A
def backward(self, dA):
"""
Backward pass.
Args:
dA: Gradient of loss w.r.t. this layer's output
shape (output_dim, batch_size)
Returns:
dA_prev: Gradient of loss w.r.t. previous layer's output
shape (input_dim, batch_size)
"""
A_prev = self.cache['A_prev']
Z = self.cache['Z']
m = A_prev.shape[1] # batch size
# Compute dZ based on activation
if self.activation == 'relu':
dZ = dA * (Z > 0).astype(float) # ReLU derivative
elif self.activation == 'sigmoid':
s = 1 / (1 + np.exp(-Z))
dZ = dA * s * (1 - s) # Sigmoid derivative
else:
dZ = dA
# Compute gradients
self.grads['dW'] = (1/m) * np.dot(dZ, A_prev.T)
self.grads['db'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
# Compute gradient to pass to previous layer
dA_prev = np.dot(self.W.T, dZ)
return dA_prev

5. Optimization: Finding Good Weights

We have gradients. Now we need to update weights to reduce the loss.

5.1 Gradient Descent Intuition

Imagine you’re blindfolded on a hilly landscape, trying to find the lowest point. You can feel the slope under your feet. The strategy: take a step downhill.

wnew=woldηLww_{new} = w_{old} - \eta \cdot \frac{\partial \mathcal{L}}{\partial w}

Where η\eta is the learning rate—how big a step you take.

def gradient_descent_demo():
"""
Demonstrate gradient descent on a simple function: f(x) = x²
Minimum is at x = 0
"""
x = 5.0 # Starting point
learning_rate = 0.1
history = [x]
for i in range(20):
gradient = 2 * x # Derivative of x² is 2x
x = x - learning_rate * gradient
history.append(x)
if i < 5 or i >= 18:
print(f"Step {i+1}: x = {x:.6f}, gradient = {gradient:.6f}")
return history
print("Gradient Descent on f(x) = x²:")
history = gradient_descent_demo()

Output:

Gradient Descent on f(x) = x²:
Step 1: x = 4.000000, gradient = 10.000000
Step 2: x = 3.200000, gradient = 8.000000
Step 3: x = 2.560000, gradient = 6.400000
Step 4: x = 2.048000, gradient = 5.120000
Step 5: x = 1.638400, gradient = 4.096000
Step 19: x = 0.014412, gradient = 0.036029
Step 20: x = 0.011529, gradient = 0.028823

5.2 Learning Rate: The Critical Hyperparameter

graph TD subgraph "Learning Rate Effects" lr_small["Too Small (η=0.001)"] lr_good["Good (η=0.1)"] lr_large["Too Large (η=1.0)"] end lr_small --> slow["Slow convergence
May get stuck"] lr_good --> converge["Steady convergence
Reaches minimum"] lr_large --> diverge["Oscillation
May diverge"]
import numpy as np
def learning_rate_comparison():
"""Compare different learning rates on f(x) = x²"""
learning_rates = [0.01, 0.1, 0.5, 1.0]
for lr in learning_rates:
x = 5.0
print(f"\nLearning rate = {lr}:")
for i in range(10):
gradient = 2 * x
x = x - lr * gradient
if abs(x) > 1000: # Diverging
print(f" Step {i+1}: DIVERGED (x = {x:.2f})")
break
elif i < 3 or i >= 8:
print(f" Step {i+1}: x = {x:.6f}")
learning_rate_comparison()

5.3 Beyond Vanilla Gradient Descent: Modern Optimizers

Problem 1: Slow convergence with flat gradients Problem 2: Getting stuck in local minima Problem 3: Different features need different learning rates

Solution Evolution:

graph LR SGD[SGD] --> Momentum[SGD + Momentum] Momentum --> RMSprop[RMSprop] Momentum --> Adam[Adam] RMSprop --> Adam

SGD with Momentum:

Idea: Build up velocity in consistent gradient directions.

vt=βvt1+ηLv_t = \beta \cdot v_{t-1} + \eta \cdot \nabla \mathcal{L} wt=wt1vtw_t = w_{t-1} - v_t

class SGDMomentum:
"""SGD with momentum."""
def __init__(self, learning_rate=0.01, momentum=0.9):
self.lr = learning_rate
self.momentum = momentum
self.velocity = {}
def update(self, params, grads):
"""
Update parameters.
Args:
params: Dict of parameters {'W1': ..., 'b1': ..., ...}
grads: Dict of gradients {'dW1': ..., 'db1': ..., ...}
"""
for key in params:
grad_key = 'd' + key
# Initialize velocity if first update
if key not in self.velocity:
self.velocity[key] = np.zeros_like(params[key])
# Update velocity
self.velocity[key] = (
self.momentum * self.velocity[key] +
self.lr * grads[grad_key]
)
# Update parameter
params[key] -= self.velocity[key]

Adam (Adaptive Moment Estimation):

The go-to optimizer for most applications. Combines momentum with per-parameter adaptive learning rates.

mt=β1mt1+(1β1)Lm_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla \mathcal{L} (first moment - mean of gradients) vt=β2vt1+(1β2)(L)2v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla \mathcal{L})^2 (second moment - variance of gradients) m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} (bias correction) v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t} (bias correction) wt=wt1ηm^tv^t+ϵw_t = w_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

class Adam:
"""Adam optimizer - the practical default."""
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {} # First moment
self.v = {} # Second moment
self.t = 0 # Timestep
def update(self, params, grads):
"""Update parameters using Adam."""
self.t += 1
for key in params:
grad_key = 'd' + key
# Initialize moments if first update
if key not in self.m:
self.m[key] = np.zeros_like(params[key])
self.v[key] = np.zeros_like(params[key])
# Update biased first moment
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[grad_key]
# Update biased second moment
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[grad_key] ** 2)
# Bias correction
m_hat = self.m[key] / (1 - self.beta1 ** self.t)**
** v_hat = self.v[key] / (1 - self.beta2 ** self.t)
# Update parameters
params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

Optimizer Selection Guide:

OptimizerWhen to UseDefault Hyperparameters
AdamDefault choice, works well almost alwayslr=0.001, β1=0.9, β2=0.999
AdamWWhen you need weight decay (regularization)lr=0.001 + weight_decay
SGD+MomentumLarge-scale training, when you have time to tunelr=0.01, momentum=0.9
RMSpropRNNs (historical), some specific caseslr=0.001

Practical Tip: Start with Adam (lr=0.001). If training is unstable, reduce learning rate. If training is slow, try increasing or use learning rate scheduling.


6. Practical Training Concerns

Understanding theory is necessary but not sufficient. Real training involves managing several practical challenges.

6.1 Vanishing and Exploding Gradients

Remember the chain rule? In deep networks, gradients multiply through many layers:

Lw[1]=La[L]a[L]z[L]z[L]a[L1]z[2]a[1]a[1]z[1]z[1]w[1]\frac{\partial \mathcal{L}}{\partial w^{[1]}} = \frac{\partial \mathcal{L}}{\partial a^{[L]}} \cdot \frac{\partial a^{[L]}}{\partial z^{[L]}} \cdot \frac{\partial z^{[L]}}{\partial a^{[L-1]}} \cdots \frac{\partial z^{[2]}}{\partial a^{[1]}} \cdot \frac{\partial a^{[1]}}{\partial z^{[1]}} \cdot \frac{\partial z^{[1]}}{\partial w^{[1]}}

Vanishing Gradients:

If each multiplication term is < 1 (e.g., sigmoid derivative max is 0.25), gradients shrink exponentially:

0.2510=0.000000950.25^{10} = 0.00000095

Early layers learn extremely slowly or not at all.

Exploding Gradients:

If terms > 1, gradients grow exponentially. Weights update wildly, loss becomes NaN.

def demonstrate_vanishing_gradient():
"""Show how gradients vanish through sigmoid layers."""
# Sigmoid derivative: max value is 0.25 (at z=0)
sigmoid_deriv_max = 0.25
print("Gradient magnitude through layers (sigmoid):")
print("=" * 50)
gradient = 1.0 # Start with gradient = 1
for layer in range(1, 21):
gradient *= sigmoid_deriv_max # Multiply by derivative
if layer <= 5 or layer >= 16:
print(f"Layer {layer:2d}: gradient magnitude = {gradient:.2e}")
print("\nThis is why deep networks with sigmoid don't train well!")
demonstrate_vanishing_gradient()

Output:

Gradient magnitude through layers (sigmoid):
==================================================
Layer 1: gradient magnitude = 2.50e-01
Layer 2: gradient magnitude = 6.25e-02
Layer 3: gradient magnitude = 1.56e-02
Layer 4: gradient magnitude = 3.91e-03
Layer 5: gradient magnitude = 9.77e-04
Layer 16: gradient magnitude = 9.31e-10
Layer 17: gradient magnitude = 2.33e-10
Layer 18: gradient magnitude = 5.82e-11
Layer 19: gradient magnitude = 1.46e-11
Layer 20: gradient magnitude = 3.64e-12
This is why deep networks with sigmoid don't train well!

Solutions:

SolutionHow It Helps
ReLU activationGradient is 1 for positive inputs (no shrinking)
Proper initializationXavier/He initialization keeps gradients stable
Batch normalizationNormalizes layer inputs, stabilizes gradients
Residual connectionsGradient highway that bypasses layers
Gradient clippingCaps exploding gradients (essential for RNNs)

6.2 Weight Initialization

Random initialization isn’t just random—the scale matters enormously.

Bad: Too Large

W = np.random.randn(256, 256) * 1.0 # Too large
# Activations explode, gradients explode

Bad: Too Small

W = np.random.randn(256, 256) * 0.001 # Too small
# Activations collapse to 0, gradients vanish

Xavier/Glorot Initialization (for tanh, sigmoid):

WN(0,1nin) or WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{1}{n_{in}}\right) \text{ or } W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)

def xavier_init(fan_in, fan_out):
"""Xavier initialization for sigmoid/tanh."""
std = np.sqrt(2.0 / (fan_in + fan_out))
return np.random.randn(fan_out, fan_in) * std

He Initialization (for ReLU):

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

def he_init(fan_in, fan_out):
"""He initialization for ReLU."""
std = np.sqrt(2.0 / fan_in)
return np.random.randn(fan_out, fan_in) * std

Practical Rule: Use He initialization with ReLU, Xavier with tanh/sigmoid. Most frameworks do this automatically.

6.3 Batch Normalization

One of the most impactful techniques for training stability. Normalizes each layer’s inputs to have zero mean and unit variance.

x^=xμBσB2+ϵ\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} y=γx^+βy = \gamma \hat{x} + \beta

Where μB\mu_B and σB\sigma_B are batch statistics, and γ\gamma, β\beta are learnable parameters.

class BatchNorm:
"""Batch Normalization layer."""
def __init__(self, dim, epsilon=1e-5, momentum=0.9):
self.gamma = np.ones((dim, 1)) # Scale
self.beta = np.zeros((dim, 1)) # Shift
self.epsilon = epsilon
self.momentum = momentum
# Running statistics for inference
self.running_mean = np.zeros((dim, 1))
self.running_var = np.ones((dim, 1))
def forward(self, x, training=True):
"""
Args:
x: shape (dim, batch_size)
training: Whether in training mode
"""
if training:
# Compute batch statistics
mean = np.mean(x, axis=1, keepdims=True)
var = np.var(x, axis=1, keepdims=True)
# Update running statistics
self.running_mean = (
self.momentum * self.running_mean +
(1 - self.momentum) * mean
)
self.running_var = (
self.momentum * self.running_var +
(1 - self.momentum) * var
)
else:
# Use running statistics for inference
mean = self.running_mean
var = self.running_var
# Normalize
x_norm = (x - mean) / np.sqrt(var + self.epsilon)
# Scale and shift
out = self.gamma * x_norm + self.beta
return out

Why BatchNorm Works:

  • Reduces internal covariate shift (layer inputs are more stable)
  • Acts as regularization (batch statistics add noise)
  • Allows higher learning rates
  • Makes training less sensitive to initialization

6.4 Dropout: Regularization Through Randomness

Randomly “drops” neurons during training, preventing over-reliance on any single neuron.

class Dropout:
"""Dropout layer for regularization."""
def __init__(self, drop_prob=0.5):
self.drop_prob = drop_prob
self.mask = None
def forward(self, x, training=True):
"""
Args:
x: Input tensor
training: Whether in training mode
"""
if training:
# Create random mask
self.mask = (np.random.rand(*x.shape) > self.drop_prob)
# Apply mask and scale
# Scaling by 1/(1-p) keeps expected value same
return x * self.mask / (1 - self.drop_prob)
else:
# No dropout during inference
return x
graph LR subgraph "Training (with dropout)" i1((x₁)) --> h1((h₁)) i2((x₂)) --> h1 i1 --> h2(( )) i2 --> h2 i1 --> h3((h₃)) i2 --> h3 h1 --> o((y)) h3 --> o style h2 fill:#ff6666,stroke:#ff0000 end

Dropout Rates by Layer Type:

Layer TypeTypical Dropout Rate
Input layer0.0 - 0.2
Hidden layers0.2 - 0.5
Before output0.0 - 0.3

6.5 Overfitting and the Bias-Variance Trade-off

graph TD subgraph "Underfitting" u1["High training error"] u2["High test error"] u3["Model too simple"] end subgraph "Good Fit" g1["Low training error"] g2["Low test error"] g3["Balanced complexity"] end subgraph "Overfitting" o1["Very low training error"] o2["High test error"] o3["Model memorized data"] end

Signs of Overfitting:

  • Training loss keeps decreasing
  • Validation loss starts increasing
  • Large gap between training and validation metrics

Regularization Techniques:

TechniqueHow It HelpsWhen to Use
DropoutPrevents co-adaptationLarge networks, lots of data
L2 RegularizationPenalizes large weightsAlways a reasonable default
Early StoppingStop before overfittingWhen validation loss increases
Data AugmentationEffectively more dataWhen data is limited
Batch NormalizationImplicit regularizationAlmost always
def l2_regularization_loss(params, lambda_reg=0.01):
"""
L2 regularization term to add to loss.
Penalizes large weights: L_total = L_data + λ * Σ||W||²
"""
reg_loss = 0
for key in params:
if 'W' in key: # Only regularize weights, not biases
reg_loss += np.sum(params[key] ** 2)
return lambda_reg * reg_loss / 2

7. Putting It All Together: A Complete Training Loop

Let’s combine everything into a working training loop:

import numpy as np
from typing import Dict, List, Tuple
class NeuralNetworkComplete:
"""
A complete neural network implementation with:
- Forward and backward passes
- Multiple activation functions
- Adam optimizer
- L2 regularization
"""
def __init__(self, layer_dims: List[int], activations: List[str]):
"""
Args:
layer_dims: [input_dim, hidden1_dim, ..., output_dim]
activations: ['relu', 'relu', ..., 'sigmoid'] for each layer
"""
self.params = {}
self.cache = {}
self.grads = {}
self.activations = activations
self.L = len(layer_dims) - 1 # Number of layers
# Initialize parameters with He initialization
for l in range(1, self.L + 1):
self.params[f'W{l}'] = np.random.randn(
layer_dims[l], layer_dims[l-1]
) * np.sqrt(2.0 / layer_dims[l-1])
self.params[f'b{l}'] = np.zeros((layer_dims[l], 1))
def _activate(self, Z: np.ndarray, activation: str) -> np.ndarray:
"""Apply activation function."""
if activation == 'relu':
return np.maximum(0, Z)
elif activation == 'sigmoid':
return 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
elif activation == 'tanh':
return np.tanh(Z)
else:
return Z
def _activate_backward(self, dA: np.ndarray, Z: np.ndarray,
activation: str) -> np.ndarray:
"""Compute gradient through activation."""
if activation == 'relu':
return dA * (Z > 0).astype(float)
elif activation == 'sigmoid':
s = 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
return dA * s * (1 - s)
elif activation == 'tanh':
return dA * (1 - np.tanh(Z) ** 2)
else:
return dA
def forward(self, X: np.ndarray) -> np.ndarray:
"""
Forward pass through the network.
Args:
X: Input data, shape (n_features, batch_size)
Returns:
Output predictions, shape (n_outputs, batch_size)
"""
self.cache['A0'] = X
A = X
for l in range(1, self.L + 1):
Z = np.dot(self.params[f'W{l}'], A) + self.params[f'b{l}']
A = self._activate(Z, self.activations[l-1])
self.cache[f'Z{l}'] = Z
self.cache[f'A{l}'] = A
return A
def backward(self, Y: np.ndarray, lambda_reg: float = 0.0) -> None:
"""
Backward pass to compute gradients.
Args:
Y: True labels, shape (n_outputs, batch_size)
lambda_reg: L2 regularization strength
"""
m = Y.shape[1] # batch size
AL = self.cache[f'A{self.L}']
# Gradient of cross-entropy loss w.r.t. final activation
# (assuming sigmoid output with binary cross-entropy)
dA = -(np.divide(Y, AL + 1e-15) - np.divide(1 - Y, 1 - AL + 1e-15))
for l in reversed(range(1, self.L + 1)):
Z = self.cache[f'Z{l}']
A_prev = self.cache[f'A{l-1}']
dZ = self._activate_backward(dA, Z, self.activations[l-1])
self.grads[f'dW{l}'] = (1/m) * np.dot(dZ, A_prev.T)
self.grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
# Add L2 regularization gradient
if lambda_reg > 0:
self.grads[f'dW{l}'] += (lambda_reg / m) * self.params[f'W{l}']
# Gradient for previous layer
if l > 1:
dA = np.dot(self.params[f'W{l}'].T, dZ)
def compute_loss(self, Y: np.ndarray, lambda_reg: float = 0.0) -> float:
"""
Compute binary cross-entropy loss.
Args:
Y: True labels
lambda_reg: L2 regularization strength
"""
m = Y.shape[1]
AL = self.cache[f'A{self.L}']
# Cross-entropy loss
cross_entropy = -(1/m) * np.sum(
Y * np.log(AL + 1e-15) + (1 - Y) * np.log(1 - AL + 1e-15)
)
# L2 regularization
l2_reg = 0
if lambda_reg > 0:
for l in range(1, self.L + 1):
l2_reg += np.sum(self.params[f'W{l}'] ** 2)
l2_reg = (lambda_reg / (2 * m)) * l2_reg
return cross_entropy + l2_reg
class AdamOptimizer:
"""Adam optimizer with support for any parameter dict."""
def __init__(self, params: Dict, lr: float = 0.001,
beta1: float = 0.9, beta2: float = 0.999,
epsilon: float = 1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.t = 0
# Initialize moments
self.m = {key: np.zeros_like(val) for key, val in params.items()}
self.v = {key: np.zeros_like(val) for key, val in params.items()}
def step(self, params: Dict, grads: Dict) -> None:
"""Update parameters in-place."""
self.t += 1
for key in params:
grad_key = 'd' + key
if grad_key not in grads:
continue
# Update moments
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[grad_key]
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[grad_key] ** 2)
# Bias correction
m_hat = self.m[key] / (1 - self.beta1 ** self.t)**
** v_hat = self.v[key] / (1 - self.beta2 ** self.t)
# Update
params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
def train(model: NeuralNetworkComplete,
X_train: np.ndarray, Y_train: np.ndarray,
X_val: np.ndarray, Y_val: np.ndarray,
epochs: int = 100, batch_size: int = 32,
learning_rate: float = 0.001, lambda_reg: float = 0.01,
verbose: bool = True) -> Dict:
"""
Complete training loop with validation.
Returns:
History dict with losses
"""
m = X_train.shape[1]
optimizer = AdamOptimizer(model.params, lr=learning_rate)
history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
for epoch in range(epochs):
# Shuffle training data
permutation = np.random.permutation(m)
X_shuffled = X_train[:, permutation]
Y_shuffled = Y_train[:, permutation]
epoch_loss = 0
num_batches = m // batch_size
for i in range(num_batches):
# Get mini-batch
start = i * batch_size
end = start + batch_size
X_batch = X_shuffled[:, start:end]
Y_batch = Y_shuffled[:, start:end]
# Forward pass
_ = model.forward(X_batch)
batch_loss = model.compute_loss(Y_batch, lambda_reg)
epoch_loss += batch_loss
# Backward pass
model.backward(Y_batch, lambda_reg)
# Update parameters
optimizer.step(model.params, model.grads)
# Compute metrics
avg_train_loss = epoch_loss / num_batches
# Validation
val_pred = model.forward(X_val)
val_loss = model.compute_loss(Y_val, lambda_reg)
# Accuracy
train_pred = model.forward(X_train)
train_acc = np.mean((train_pred > 0.5).astype(float) == Y_train)
val_acc = np.mean((val_pred > 0.5).astype(float) == Y_val)
history['train_loss'].append(avg_train_loss)
history['val_loss'].append(val_loss)
history['train_acc'].append(train_acc)
history['val_acc'].append(val_acc)
if verbose and (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{epochs} - "
f"Train Loss: {avg_train_loss:.4f}, Val Loss: {val_loss:.4f}, "
f"Train Acc: {train_acc:.4f}, Val Acc: {val_acc:.4f}")
return history
# Example usage: Binary classification
if __name__ == "__main__":
np.random.seed(42)
# Generate synthetic data (two spirals)
n_samples = 1000
noise = 0.1
def generate_spiral_data(n_samples, noise=0.1):
n = n_samples // 2
# Class 0: one spiral
theta0 = np.linspace(0, 4*np.pi, n) + np.random.randn(n) * noise
r0 = theta0 / (4*np.pi)
x0 = r0 * np.cos(theta0) + np.random.randn(n) * noise
y0 = r0 * np.sin(theta0) + np.random.randn(n) * noise
# Class 1: opposite spiral
theta1 = np.linspace(0, 4*np.pi, n) + np.pi + np.random.randn(n) * noise
r1 = theta1 / (4*np.pi)
x1 = r1 * np.cos(theta1) + np.random.randn(n) * noise
y1 = r1 * np.sin(theta1) + np.random.randn(n) * noise
X = np.vstack([np.hstack([x0, x1]), np.hstack([y0, y1])])
Y = np.hstack([np.zeros(n), np.ones(n)]).reshape(1, -1)
return X, Y
X, Y = generate_spiral_data(n_samples)
# Split data
split = int(0.8 * n_samples)
X_train, X_val = X[:, :split], X[:, split:]
Y_train, Y_val = Y[:, :split], Y[:, split:]
print(f"Training data: {X_train.shape[1]} samples")
print(f"Validation data: {X_val.shape[1]} samples")
# Create and train model
model = NeuralNetworkComplete(
layer_dims=[2, 64, 32, 1], # 2 inputs, 2 hidden layers, 1 output
activations=['relu', 'relu', 'sigmoid']
)
history = train(
model, X_train, Y_train, X_val, Y_val,
epochs=100, batch_size=32, learning_rate=0.01, lambda_reg=0.001
)
print(f"\nFinal Training Accuracy: {history['train_acc'][-1]:.4f}")
print(f"Final Validation Accuracy: {history['val_acc'][-1]:.4f}")

Summary: Key Takeaways for Part 0B

You now understand:

  1. Neurons compute weighted sums and apply non-linear activations
  2. Networks stack layers to learn hierarchical representations
  3. Loss functions define what “success” means mathematically
  4. Backpropagation computes gradients using the chain rule
  5. Optimizers (especially Adam) update weights to minimize loss
  6. Practical concerns like vanishing gradients, initialization, and regularization

What’s Coming in Part 0B:

Now that you understand how networks learn, we’ll tackle the sequence problem:

  • Why standard networks fail on sequences
  • RNNs and their limitations
  • The attention mechanism breakthrough
  • The full Transformer architecture

This sets the stage for Part 1, where we’ll examine these architectures through the lens of enterprise cost and capability decisions.


Quick Reference: Formulas

ConceptFormula
Neurony=σ(wTx+b)y = \sigma(w^T x + b)
MSE LossL=1n(yy^)2\mathcal{L} = \frac{1}{n}\sum(y - \hat{y})^2
Cross-EntropyL=1n[ylogy^+(1y)log(1y^)]\mathcal{L} = -\frac{1}{n}\sum[y\log\hat{y} + (1-y)\log(1-\hat{y})]
Gradient Descentwnew=woldηLww_{new} = w_{old} - \eta \frac{\partial \mathcal{L}}{\partial w}
Adam Updatewt=wt1ηm^tv^t+ϵw_t = w_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
He InitWN(0,2/nin)W \sim \mathcal{N}(0, \sqrt{2/n_{in}})
Batch Normx^=xμσ2+ϵ\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

Next in series: Part 0B — From Sequences to Transformers


About this series: A foundational tutorial series for senior engineers transitioning to AI Architect roles. Each part builds toward production-depth understanding of modern AI systems.