Introduction To Artificial
Intelligence and Machine
Learning
Aniketh Chenjeri, Andrew Doyle, Swarnim Ghimire, Mr. Igor
Tomcej
2026-02-24
Contents
Foreword
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
How Is This Book Structured?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
Part I: Supervised Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
1
Linear Regression
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
1.1
What is Linear Regression?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
1.2
The Limits of Simple Estimation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
1.3
Linear Regression Formally Defined
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
1.4
Understanding the Algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
1.5
Measuring and Interpreting Error
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
1.6
Optimization with Gradient Descent
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
1.7
Understanding Gradients
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
1.8
1/16/26: Problem Packet
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
1.9
Problem Packet
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Theory Questions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Practice Problems
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
2
Pseudoinverse & Multiple Linear Regression
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
2.1
Linear Algebra Primer
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Sets
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Common Number Sets
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Relationships Between Sets
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
Vectors, Vector Addition, and Scalar Multiplication
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
Matrices, Notation and Dimensionality
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
Sum and Difference of Matrices
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
Dot Product
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
Matrix Multiplication
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
Transpose
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
Identity Matrix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
Determinant
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
Inverse Matrix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
2.2
Multiple Linear Regression Formally
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
2.3
The Normal Equation and Pseudoinverses
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
Decomposition of the Matrix Form:
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
2.4
Gradient Descent for Multiple Variables
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
2.5
Feature Scaling
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
Start with a contrived failure case
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
Training run walkthrough: good, bad, and ugly
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
Why this happens geometrically
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
Generalizing beyond this toy dataset
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
2.6
Interpreting Weights After Scaling
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
2.7
Linear Algebra Practice Problems
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
Sets and Number Sets
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
Vectors and Vector Operations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
Notation and Dimensionality
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
Dot Product
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
Matrix Multiplication
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
Transpose
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
Determinant
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
Identity Matrix and Inverse
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
Mixed Practice
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
2.8
Problem Packet
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
Data Representation and Design Matrix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
Matrix Operations (Reference)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
Normal Equation (Derivation Provided)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
MSE and Gradient (Derivation Provided)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
RSS and MSE Application
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
Collinearity and Remedies
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
Appendix: Programming Reference
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
3
Python and Libraries
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
3.1
NumPy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
Accessing NumPy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
Arrays and Vectors
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
Matrices as Two-Dimensional Arrays
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
Methods and Functions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
Views and Shared Memory
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
Transpose, ndim, and shape
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
Elementwise Operations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
Random Numbers
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
Reproducibility with the Generator API
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
Mean, Variance, and Standard Deviation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
Axis Arguments and Row/Column Operations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
Graphics with Matplotlib
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
Practical Notes
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
Appendix: Math Fundamentals
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4
Calculus
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4.1
Limits
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4.2
Derivatives
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
Limit Definition of Derivative
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4.3
Gradients
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
Vector Valued Functions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
Gradient Definition
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
Partial Derivatives and Rules
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
5
Linear Algebra
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
6
Statistics and Probability
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
Reference
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
Glossary of Definitions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
4
Foreword
The Introduction to Artificial Intelligence and Machine Learning class has had an
interesting history. It started at Creek during the 2024-2025 school year with the
intention of giving students a gentle on-ramp to the field. The course dodged heavy
math and focused on using libraries such as TensorFlow, NumPy, Pandas, Scikit-
Learn.
Due to unforeseen circumstances, the 2025-2026 school year began without a solid
foundation for the course. This prompted a comprehensive reflection and led us to
rebuild the syllabus from scratch. While our original intention, a gentle introduction
to machine learning, remained unchanged, everyone involved agreed we needed a
more rigorous approach to the class alongside an emphasis on intuition.
This task proved difficult because subject is inherently interdisciplinary. We needed
to make the course more intuitive and less mathematical while still covering the
field’s fundamentals.
This textbook represents a collection of dedicated lessons, lecture notes, and
exercises designed to serve as the course’s foundation. At the time of writing, there
are no mathematical prerequisites, yet mathematics permeates the fields of Artificial
Intelligence and Machine Learning. We faced a choice: omit the mathematics and
build a less rigorous course emphasizing breadth, or include it and slow our pace.
We chose the latter.
This textbook is written for readers with no calculus background. Rather than
requiring formal mathematical preparation, it assumes basic algebraic
understanding and builds intuition about which mathematical operations are
necessary. We will never ask you to compute complex formulas without teaching
them to you; instead, we focus on understanding and developing an intuition for
what the math represents. This is a practical course, not a theoretical one. It will
give you a solid foundation to build upon if you decide to pursue a career in the
field. However, this book by no means is a replacement for the mathematics you’ll
need to learn; we instead hope to give you a starting point on which you can build.
We gratefully acknowledge the following contributors, this book would not have
been possible without their efforts:
•
Primary Writers:
‣
Aniketh Chenjeri (CCHS ‘26)
‣
Andrew Doyle (CCHS ‘26)
‣
Mr. Igor Tomcej
•
Reviewers:
‣
Hariprasad Gridharan (CCHS ‘25, Cornell ‘29)
‣
Siddharth Menon (CCHS ‘26)
‣
Ani Gadepalli (CCHS ‘26)
5
How Is This Book Structured?
NOTE: This text is a living document, currently undergoing active development
as part of our commitment to pedagogical excellence. In order to ensure
rigorous academic standards, chapters are released sequentially following
comprehensive peer review. This is to say the version of the text you are
viewing right now is not the final one, expect updates to various parts of the
book as we continue to refine and improve the content.
While every effort is made to provide an accurate and authoritative resource,
please note that early editions may contain errata. We encourage students to
actively engage with the material; should you identify any discrepancies or
technical inaccuracies, please report them to your teacher or teachers assistant
for correction in future revisions.
We appreciate your cooperation in refining this resource for future cohorts.
Your feedback is instrumental in ensuring the highest quality of instruction and
material.
This book is written to emphasize an understanding of fundamental concepts in AI/
ML. We will begin with supervised learning and its applications. We are starting
with supervised learning because it’s one of the most common techniques used by
practitioners. It’s also the most intuitive and easy to understand. In this section
we’ll also slowly introduce mathematical concepts and give you exercises to solidify
your understanding. We will never require you to do any calculus; however, it
becomes impossible to understand many algorithms without it. As such we will
always “solve” any calculus involved and ask you to interpret and apply it. This
doesn’t mean this course will omit math entirely; we will learn a lot of applied
linear algebra, as it is the core of how machine learning works. Under supervised
learning you will also learn important concepts for assessing the accuracy of a
model and some insight into which architectures are used for which scenarios. To
truly accomplish this we need to understand many statistics concepts; as a result
this book will cover many topics in statistics and probability. Some of these
concepts will be familiar to you if you take an AP Statistics class or equivalent.
We will then move on to unsupervised learning, where we’ll make a brief stop with
the k-means clustering algorithm and then move to neural networks. Understanding
neural networks is the ultimate goal of this class, as they are a ubiquitous and
powerful tool. If you gain an understanding of neural networks you will be able to
understand many complex algorithms such as Large Language Models, which are
the foundation for tools like ChatGPT, Google Gemini, Anthropic’s Claude, and
more.
Towards the end of this book we’ve compiled a series of documentation-like
chapters for various libraries, frameworks, and mathematical concepts. If you find
yourself not understanding certain concepts, tools, etc., you can always refer to
these documents.
6
The only assumption we make in the writing of this book is some familiarity with
Python (and programming in general) and Algebra 2. Even though we will cover
theory in this class, it will be a programming class first and foremost. You will write
a lot of code but will also be asked to understand theory and math.
In the compilation of this book we’ve pulled from various resources:
1.
Introduction to Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome
Friedman
2.
The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome
Friedman
3.
Various books by Justin Skyzak
4.
Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal,
Cheng Soon Ong
5.
The Matrix Cookbook by Joseph Montanez
6.
The Deep Learning Book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
7
Table of Contents
Foreword
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
How Is This Book Structured?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
Part I: Supervised Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
1
Linear Regression
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
1.1
What is Linear Regression?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
1.2
The Limits of Simple Estimation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
1.3
Linear Regression Formally Defined
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
1.4
Understanding the Algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
1.5
Measuring and Interpreting Error
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
1.6
Optimization with Gradient Descent
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
1.7
Understanding Gradients
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
1.8
1/16/26: Problem Packet
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
1.9
Problem Packet
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
2
Pseudoinverse & Multiple Linear Regression
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
2.1
Linear Algebra Primer
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
2.2
Multiple Linear Regression Formally
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
2.3
The Normal Equation and Pseudoinverses
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
2.4
Gradient Descent for Multiple Variables
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
2.5
Feature Scaling
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
2.6
Interpreting Weights After Scaling
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
2.7
Linear Algebra Practice Problems
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
2.8
Problem Packet
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
Appendix: Programming Reference
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
3
Python and Libraries
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
3.1
NumPy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
Appendix: Math Fundamentals
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4
Calculus
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4.1
Limits
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4.2
Derivatives
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
4.3
Gradients
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
5
Linear Algebra
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
6
Statistics and Probability
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
Reference
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
Glossary of Definitions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
8
Part I: Supervised Learning
In supervised learning, we are provided with some input
𝑋
and a corresponding
response
𝑌
. Our fundamental objective is to estimate the relationship between
these variables, typically modeled as:
𝑌
=
𝑓
(
𝑋
)
+
𝜀
where
𝑓
is an unknown fixed function of the predictors and
𝜀
is a random error
term, independent of
𝑋
, with a mean of zero. Using a training dataset
𝒯︀
=
{
(
𝑥
1
,
𝑦
1
)
,
(
𝑥
2
,
𝑦
2
)
,
…
,
(
𝑥
𝑛
,
𝑦
𝑛
)
}
, we aim to apply a learning method to estimate the
function
̂
𝑓
. This estimate allows us to predict the response for previously unseen
observations. It is also common to express this relationship as
̂
𝑦
=
̂
𝑓
(
𝑋
)
.
Definition 0.1.
A
feature
is an individual, measurable property or characteristic of a
phenomenon being observed. In machine learning, it serves as the input
variable
𝑥
that a model uses to identify patterns and make predictions. For
example if we want to predict the price of a house based on its size, number of
bedrooms, and location, the features are size, number of bedrooms, and
location.
Definition 0.2.
A
label
is the output variable
𝑦
that a model uses to make predictions. For
example if we want to predict the price of a house based on its size, number of
bedrooms, and location, the label is the price of the house.
In the supervised learning portion of this textbook, we will learn the following:
•
Linear Regression via the Pseudoinverse: We will learn the closed-form analytical
solution for parameter estimation using the Moore-Penrose pseudoinverse and
the Normal Equation. This includes multiple linear regression, where we will
learn how to handle multiple predictors simultaneously.
•
Optimization via Gradient Decent and Stochastic Gradient Descent (SGD): We
will learn the notation for and how to implement Gradient and Stochastic
Gradient Descent for Optimization.
•
The Bias-Variance Tradeoff: Understanding the decomposition of prediction error
into reducible and irreducible components, and navigating the relationship
between overfitting and underfitting.
•
Polynomial Regression: Extending the linear framework to capture non-linear
relationships by mapping predictors into higher-dimensional feature spaces.
•
Shrinkage Methods (Ridge and Lasso): Applying
𝐿
1
and
𝐿
2
regularization to
minimize the Residual Sum of Squares (RSS) while controlling model complexity
and performing variable selection.
•
Logistic Regression: Transitioning to classification by modeling the conditional
probability of qualitative responses using the logit transformation.
9
•
k-Nearest Neighbors (k-NN): Utilizing a non-parametric, memory-based approach
to prediction based on local density and spatial proximity in the feature space.
Don’t worry if non of that makes sense to you, we’ll be covering it in detail in the
coming chapters.
1
Linear Regression
1.1
What is Linear Regression?
To recall, supervised learning starts with known input data
𝑋
and known output
data
𝑌
, and we are asked to fit an equation to this dataset.
One of the simplest approaches is to assume a linear relationship between inputs
and outputs. This assumption is usually wrong and somewhat contrived, but linear
models serve an important purpose: they give you a baseline. Once you have a
baseline model, you can test more complex models against it and see if they actually
perform better.
Traditionally, linear regression requires heavy use of linear algebra. We’re not going
to get too into the weeds in this course since this course doesn’t have a math pre-
req. Instead, we’ll use Sklearn, a popular machine learning library. But before we
jump into Sklearn’s code, we need to build intuition about how its algorithms
actually work. Once you understand the underlying process, using the library itself
becomes straightforward. This means we’ll need to do some math along the way.
Example 1.1.1.
Let’s engage in a hypothetical. Suppose you’re given 2 thermometers and
asked to measure the temperature for both metrics. Let’s say our results look
like this:
Celsius (x)
Fahrenheit (y)
0
31.8
5
41.9
10
49.2
15
60.1
20
67.4
25
78.9
30
87.5
Let’s plot the data:
10
Now we know that the equation for converting Celsius to Fahrenheit is
𝑦
=
1
.
8
𝑥
+
3
2
But assume you don’t know this equation and are asked to find purely based
on that data. After sitting with the problem for a while you’ll probably realize
that you can use the equation
𝑚
=
𝑦
2
−
𝑦
1
𝑥
2
−
𝑥
1
to estimate our coefficient in the
equation and we’ve been given a y intercept at
(
0
,
3
1
.
8
)
so from our given
data, after computing, we can say our equation is
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
Note In this equation, the
hat
symbol (
^
) indicates a
predicted value
or an
estimated parameter
, not a measured or input variable. So in our model
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
̂
𝑦
represents the
predicted Fahrenheit temperature
based on the input
𝑥
(in
Celsius). The hat shows that this value comes from our model’s estimation, not
directly from observed data.
You’ll often see this notation in statistics and machine learning to distinguish
predicted
outputs
̃
𝑦
and
estimated coefficients
(
̂
𝑤
,
̂
𝑏
)
from
true
or
observed
values.
Congratulations! You’ve just made your first statistical model. But there are a few
problems with this approach to estimating linear equations from a given dataset.
1.2
The Limits of Simple Estimation
But there are a few problems with this simple approach to estimating linear
equations from a given dataset. Let’s look at what might happen if you actually
went around campus measuring temperature.
The real data that was collected by the AI class in 2025 looks like this:
11
This data is much more realistic and it highlights why our simplistic approach
doesn’t scale. Thermometers aren’t perfect. Maybe one was a little old, maybe you
were standing in direct sunlight for one measurement, or maybe a gust of wind hit
one of them.
That’s exactly why we use methods like linear regression and error measures.
Instead of trusting a single pair of points, regression finds the line that best fits all
our noisy data, balancing those little errors out. The goal isn’t to make every point
perfect (which you can’t do anyway), it’s to minimize the total amount of error
across the whole dataset.
1.3
Linear Regression Formally Defined
Definition 1.3.1.
To restate our problem: we have a given dataset composed of an input
𝑋
for
which we have an output
𝑌
and our job is to develop an equation that
encapsulates the relationship between
𝑋
and
𝑌
in some equation
̂
𝑦
=
𝑤
𝑥
+
𝑏
where
𝑤
and
𝑏
are accurate estimates of the real values (in this case 1.8 and 32
respectively)
Let’s say that this is our data, it’s randomly generated noisy data and we are using it
as a proxy for a relatively large amount of real data.
12
1.4
Understanding the Algorithm
Let’s break down what sklearn does under the hood. The core method it actually
uses to estimate
𝑤
and
𝑏
are actually quite simple. It starts by picking a random
value for
𝑤
and
𝑏
and then checks how accurate those measures are then keeps
adjusting those numbers till our model becomes accurate.
But we need to slow down and somewhat rigorously lay out what each of those
statements mean.
The statement “picks a random value for
𝑤
and
𝑏
” is intuitive, but the big question
is how does it measure for the accuracy of
𝑤
and
𝑏
and how does it change those
values.
1.5
Measuring and Interpreting Error
Let’s start the first question: how do we measure error?
Example 1.5.1.
Let’s say we have this data and some predicted values in
̂
𝑦
𝑥
𝑦
̂
𝑦
1
3
2.5
2
5
5.2
3
4
4.1
4
7
6.8
5
6
6.3
13
To measure how accurate
̂
𝑦
is to
𝑦
, we introduce the
RSS
metric
Definition 1.5.1.
RSS
stands for Residual Sum of Squares and is defined by the following
equation:
RSS
=
∑
𝑁
𝑖
=
0
(
𝑦
𝑖
−
̂
𝑦
𝑖
)
2
Don’t be scared, this equation simply means that we are taking every single y value
in the given data set and subtracting it by our models estimated values. Squaring
the function is done for the following reasons:
1.
Squaring makes all residuals positive, so large underpredictions and
overpredictions both contribute to the total error.
2.
It penalizes larger errors more heavily: A residual of 4 counts far more (16) than
a residual of 2 (which counts as 4). This makes the regression more sensitive to
large deviations.
3.
Squaring makes the loss function smooth and differentiable, which makes our life
a lot easier later on
If this still doesn’t make sense we can use this graphic to gain an intuition
14
But RSS gives us the error for each point. We want 1 number to measure error
overall. To do this we can define a function:
Definition 1.5.2.
The Mean Squared Error (MSE) is defined as:
𝐿
(
𝑤
,
𝑏
)
=
1
𝑁
∑
𝑁
𝑖
=
0
(
𝑦
𝑖
−
̂
𝑦
𝑖
)
2
or we can expand this to be:
𝐿
(
𝑤
,
𝑏
)
=
1
𝑁
∑
𝑁
𝑖
=
0
(
𝑦
𝑖
−
(
𝑤
𝑥
𝑖
+
𝑏
)
)
2
The
1
𝑁
here just averages the error at every point.
So great! We now have a really solid way to measure error
If we were to implement this in pure python we would do something like this:
#
Assume X is a spreadsheet of our data and X[i] is the ith row
def
mse
(
W
,
b
,
X
,
y
)
:
n
=
len
(
X
)
total_error
=
0
for
i
in
range
(
n
)
:
prediction
=
W
*
X
[
i
]
+
b
total_error
+=
(
y
[
i
]
-
prediction
)
*
*
2
return
total_error
/
n
1.6
Optimization with Gradient Descent
So we have a way to measure error with
MSE
. But now we face a new problem:
how do we actually find the best values for
𝑤
and
𝑏
?
We can’t just guess randomly forever. Instead, we need a systematic way to improve
our guesses. This is where derivatives come in
For those who have taken a calculus class you will be familiar with the concept of a
derivative.
A derivative measures how much a function changes when you change its input
slightly. Think of it like the slope of a hill. If you’re standing on a hill and you want
to know which direction is steepest, the slope tells you. A positive slope means the
hill goes up in that direction, and a negative slope means it goes down.
15
In our case, we want to know: if I change
𝑤
slightly, does my error go up or down?
The derivative of the loss function with respect to
𝑤
answers exactly that question.
It tells us the slope of the error landscape.
If the derivative is positive, increasing
𝑤
increases error, so we should decrease
𝑤
. If
the derivative is negative, increasing
𝑤
decreases error, so we should increase
𝑤
. By
moving in the opposite direction of the derivative, we’re moving downhill toward
lower error.
This process is called gradient descent, and we update our weights using this rule:
𝑤
←
𝑤
−
𝜂
𝜕
𝐿
𝜕
𝑤
Here,
𝜂
is the learning rate, which controls how big each step is. Too small and
learning is painfully slow. Too large and you might overshoot the best values
entirely.
We do the same for
𝑏
:
𝑏
←
𝑏
−
𝜂
𝜕
𝐿
𝜕
𝑏
We repeat this process over and over, each time getting closer to the optimal
𝑤
and
𝑏
that minimize our error.
Since this class doesn’t require you to do the math, we’re going to give you the
value of
𝜕
𝐿
𝜕
𝑤
and
𝜕
𝐿
𝜕
𝑏
:
𝜕
𝐿
𝜕
𝑤
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑤
𝑥
𝑖
+
𝑏
)
)
𝑥
𝑖
𝜕
𝐿
𝜕
𝑏
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑤
𝑥
𝑖
+
𝑏
)
)
In Python this code looks like this:
dw
=
(
-
2
.
0
/
N
)
*
np
.
sum
(
error
*
x_scaled
,
dtype
=
np
.
float64
)
w
-=
lr
*
dw
db
=
(
-
2
.
0
/
N
)
*
np
.
sum
(
error
,
dtype
=
np
.
float64
)
b
-=
lr
*
db
1.7
Understanding Gradients
If you’re still struggling with the concept of gradients, let’s break it down even
more. Instead of a function with 2 inputs let’s start with a function with 1 input,
something like
𝑓
(
𝑥
)
=
𝑥
2
. We’ve all seen this function, when graphed it looks like
this:
16
Let’s say we have some model:
̂
𝑦
=
𝑤
𝑥
Now let’s say
𝑓
(
𝑥
)
=
𝑥
2
represents the
error for some weight (coefficient)
𝑤
. So if
𝑤
=
2
our model’s error is
4
and so on.
We want to find the lowest error value so we want the absolute minimum (the
lowest
𝑦
value) in the model.
For this function there are various ways we could find it. Since we know that a
parabola that opens upwards, it’s lowest point is the vertex. You can write the
function in vertex form to find the minimum. However, let’s consider a more
complex loss landscape:
17
If we look at the graph we can find the lowest point but that isn’t always possible.
Gradient Descent is a way of finding that lowest point. For functions with 1 input
we have the derivative.
The derivative is the slope of the tangent line at a point. This literally means a
straight line that indicates the direction of the function at that point. Visually the
derivative is the purple line below
Figure 1: Example tangent line in purple for a loss function
This is the tangent line at
𝑥
=
0
.
4
and notice that if we were to try and display the
purple line as a linear equation in the form of
𝑦
=
𝑚
𝑥
+
𝑏
then
𝑚
or the slope of
the line would be negative. The actual value of m at
𝑥
=
0
.
4
is
−
2
.
2
6
5
2
8
1
5
8
0
5
7
That
𝑚
value for the purple line is the derivative. So, we know that for a function
that represents our loss at 1 point
Derivatives in 2d are represented for the original function
𝑓
(
𝑥
)
is
𝑓
′
(
𝑥
)
. So in this
instance our gradient descent algorithm would simply look like this for some
weight
𝑤
:
𝑤
←
𝑤
−
𝜂
𝑓
′
(
𝑥
)
18
We now have a solid understanding of what derivatives in 2d are. This assumes that
we are trying to model equation
̂
𝑦
=
𝑤
𝑥
. However, this approach isn’t extremely
useful. Often we have multiple weights we want to estimate with our data. This
makes more complex. This is where the idea of gradients comes from.
Definition 1.7.1.
A gradient is a generalization of the derivative for functions with multiple
inputs. If your function depends on several variables, like
𝑓
(
𝑥
,
𝑦
)
=
𝑥
2
+
𝑦
2
,
then the gradient is a vector that collects all the partial derivatives — one for
each variable:
∇
𝑓
=
(
𝜕
𝑓
𝜕
𝑥
,
𝜕
𝑓
𝜕
𝑦
)
=
(
2
𝑥
,
2
𝑦
)
Each of the items inside
∇
𝑓
can be represented partial derivative, that is the
derivative of the function
𝑓
with respect to one of its variables, which, in this
instance are
𝑥
and
𝑦
. So in our gradient descent algorithms:
𝜕
𝐿
𝜕
𝑤
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑤
𝑥
𝑖
+
𝑏
)
)
𝑥
𝑖
𝜕
𝐿
𝜕
𝑏
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑤
𝑥
𝑖
+
𝑏
)
)
We are simply finding the equivalent for
𝑓
′
in the loss function
𝐿
(
𝑤
,
𝑏
)
with respect
to the variables
𝑤
and
𝑏
Below the function
𝐿
(
𝑊
,
𝑏
)
graphed out for a model where we’re training in your
next lab:
19
Figure 2: The loss landscape for our model:
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
4
7
The red line represents the various weights our model tried and its path to reach the
optimal weights with gradient descent: click on
this link
if you want to see a video
of the model being trained and gradient descent working in real time.
1.8
1/16/26: Problem Packet
Problem 1.
You’re modeling Fahrenheit from Celsius. You try three baselines:
•
Constant model:
̂
𝑦
=
𝑐
•
Linear model:
̂
𝑦
=
𝑊
𝑥
+
𝑏
•
Piecewise linear with one breakpoint at
𝑥
=
1
0
:
̂
𝑦
=
𝑊
1
𝑥
+
𝑏
1
for
𝑥
≤
1
0
, and
̂
𝑦
=
𝑊
2
𝑥
+
𝑏
2
for
𝑥
>
1
0
Explain, without using any code, which baseline is “stronger” and why a
stronger
baseline
can sometimes make your project
harder
(but better science). Your
answer must include what “strong baseline” means operationally, how it impacts
comparisons to later models, and at least two failure modes if your baseline is too
weak.
Problem 2.
In the text,
̂
𝑦
is predicted output, and
̂
𝑊
,
̂
𝑏
are estimated parameters.
Explain the difference between observed
𝑦
, predicted
̂
𝑦
, “true” but unknown
relationship (call it
𝑦
), estimated parameters (
̂
𝑊
,
̂
𝑏
), and measurement noise. Then
answer: if you re-collect the dataset tomorrow with the
same thermometers
,
which of these are expected to change and why?
Problem 3.
The chapter lists reasons to square residuals, explain why these reasons
are important. Then argue for one scenario where squaring residuals is a bad idea,
and name a better alternative loss. (You can use the internet to find the answer to
this question.)
20
Problem 4.
You train the same model form
̂
𝑦
=
𝑊
𝑥
+
𝑏
on two datasets A and B.
•
Dataset A has
𝑁
=
2
0
points, RSS = 120.
•
Dataset B has
𝑁
=
2
0
0
points, RSS = 950.
1.
Compute
MSE
for both.
2.
Explain why the
RSS
value can mislead you across dataset sizes.
3.
Give one situation where
RSS
is still useful or preferred (be specific).
Problem 5.
Suppose your fitted line gives small MSE, but when you plot residuals
𝑟
𝑖
=
𝑦
𝑖
−
̂
𝑦
𝑖
versus
𝑥
𝑖
, you see a clear U-shape. Explain what this implies about: the
linearity assumption, whether the bias term
̂
𝑏
is “wrong”, what kind of model
change would address it (at least two options), and why
MSE
alone didn’t warn
you.
Problem 6.
Given:
𝑥
=
[
−
5
,
0
,
5
,
1
0
]
,
𝑦
=
[
2
0
.
0
,
3
1
.
8
,
4
0
.
0
,
5
5
.
0
]
, Model:
̂
𝑦
=
1
.
6
𝑥
+
3
1
.
8
Compute:
̂
𝑦
for each
𝑥
residuals
𝑟
𝑖
=
𝑦
𝑖
−
̂
𝑦
𝑖
RSS
and
MSE
Identify
which point contributes most to
RSS
and explain why
Problem 7.
Given:
𝑥
=
[
−
5
,
0
,
5
,
1
0
]
𝑦
=
[
2
0
.
0
,
3
1
.
8
,
4
0
.
0
,
5
5
.
0
]
Model:
̂
𝑦
=
1
.
6
𝑥
+
3
1
.
8
Compute:
̂
𝑦
for each
𝑥
residuals
𝑟
𝑖
=
𝑦
𝑖
−
̂
𝑦
RSS
and
MSE
Identify which
point contributes most to
RSS
and explain why
Problem 8.
Dataset:
𝑥
=
[
0
,
5
,
1
0
,
1
5
,
2
0
]
𝑦
=
[
3
2
.
0
,
4
1
.
0
,
5
0
.
5
,
6
0
.
0
,
6
8
.
0
]
Two
candidate models: A:
̂
𝑦
=
1
.
8
𝑥
+
3
2
𝐵
:
accent(y, hat) = 1.9 x + 31 Compute RSS for
both and decide which is better under RSS/MSE. Then answer: which model is
more plausible physically
, and can plausibility disagree with MSE here?
1.9
Problem Packet
Theory Questions
Problem 1.
The text describes linear models as a “baseline.” Explain the importance
of establishing a baseline model before moving on to more complex machine
learning algorithms.
Problem 2.
In the equation
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
, explain what the “hat” notation
̂
(
)
signifies and why it is crucial for distinguishing between types of data in statistics.
Problem 3.
The lesson provides three specific reasons for squaring residuals in the
RSS formula. List them and explain why making the loss function “smooth and
differentiable” is beneficial for optimization.
Problem 4.
What is the mathematical difference between Residual Sum of Squares
(
RSS
) and Mean Squared Error (
MSE
)? Why is
MSE
generally preferred when
working with datasets of varying sizes?
Practice Problems
Problem 5.
You are given the coefficients
𝑎
=
1
,
𝑏
=
4
, and
𝑐
=
2
for the function
𝑓
(
𝑥
)
=
𝑥
2
+
4
𝑥
+
2
. Using the derivative
𝑓
′
(
𝑥
)
=
2
𝑥
+
4
, write a Python function
to find the minimum of
𝑓
(
𝑥
)
using gradient descent. Start at
𝑥
=
1
0
, use
𝜂
=
0
.
1
,
and run for 10 iterations.
Problem 6.
Calculate the
RSS
and
MSE
by hand for the following dataset given
the model
̂
𝑦
=
2
𝑥
+
1
:
•
𝑥
=
[
1
,
2
,
3
]
•
𝑦
=
[
3
,
6
,
7
]
Problem 7.
Given
𝑥
=
[
1
,
2
,
3
]
and
𝑦
=
[
2
,
3
,
4
]
, and initial parameters
𝑊
=
0
and
𝑏
=
0
, compute:
•
The predicted values
̂
𝑦
•
The residuals
(
𝑦
𝑖
−
̂
𝑦
𝑖
)
•
The current
MSE
21
Problem 8.
Using the data and initial parameters from Problem 9, perform one full
batch gradient descent update to find
𝑊
new
and
𝑏
new
. Use
𝜂
=
0
.
1
and the formulas:
𝜕
𝐿
𝜕
𝑊
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
𝑖
+
𝑏
)
)
𝑥
𝑖
𝜕
𝐿
𝜕
𝑏
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
𝑖
+
𝑏
)
)
Note: Use the sign convention from the provided Python code where the
gradient is subtracted.
Problem 9.
A thermometer model is trained to
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
. If the actual
temperature is
0
°
𝐶
and the observed Fahrenheit reading is
3
1
.
8
, what is the
residual? If the actual temperature is
3
0
°
𝐶
and the observed reading is
8
7
.
5
, what is
the residual?
Problem 10.
Write a Python function
get_error(y_true, y_pred)
that returns the
Mean Squared Error using only the standard library (no numpy). Assume both
inputs are lists of equal length.
Go to:
https://github.com/CreekCS/ai-ml-textbook-labs/blob/main/intro-to-
assignments.ipynb
for the remaining problems.
2
Pseudoinverse & Multiple Linear Regression
So far we’ve assumed a very simplistic relationship between
𝑋
and
𝑌
. But what if
we have more than one predictor? For example, what if we wanted to predict the
price of a house based on its size, number of bedrooms, and location?
To answer this question we need to introduce some linear algebra.
2.1
Linear Algebra Primer
NOTE: This section is very heavy on the mathematical notation and concepts.
We recommend you consult the
Essence of Linear Algebra
video series by
3Blue1Brown for a more visual and intuitive understanding of the concepts.
Many of the concepts covered in those videos are beyond the scope of this
class, however, many videos in the series can be used to supplement your
understanding of the concepts covered in this section.
Sets
As you have learned in the Data Structures class, a set is an abstract data structure
designed for the efficient storage and retrieval of unique elements, often prioritizing
computational performance. This mirrors the mathematical definition where a set is
a distinct collection of objects, as both systems prohibit duplicate members and lack
a fundamental requirement for ordering.
Common Number Sets
In mathematics, we categorize data into specific sets based on their properties. This
hierarchy allows us to define exactly what kind of “values” a variable is allowed to
hold:
•
Natural Numbers
ℕ
: The set of positive counting numbers
{
1
,
2
,
3
,
…
}
. These
are used for indices or counting items where you cannot have zero or negatives.
•
Integers
ℤ
: Whole numbers including zero and negatives
{
…
,
−
2
,
−
1
,
0
,
1
,
2
,
…
}
.
These represent discrete quantities, like the number of bedrooms in a house.
•
Rational Numbers
ℚ
: Numbers that can be expressed as a fraction
𝑝
𝑞
where
𝑝
,
𝑞
∈
ℤ
. This includes terminating decimals like
0
.
7
5
.
22
•
Real Numbers
ℝ
: The set of all possible points on a continuous number line.
This includes everything in
ℚ
plus “irrational” numbers like
𝜋
or
√
2
. We use
these for measurements requiring high precision, like square footage or
temperature.
•
Complex Numbers
ℂ
: Numbers that include an imaginary unit
𝑖
. While less
common in basic data sets, they are vital for signal processing and advanced
physics simulations.
Relationships Between Sets
It is helpful to visualize these sets as nested boxes. Every Natural number is also an
Integer; every Integer is also a Rational number; and every Rational number is also
a Real number. We use the symbol
⊂
(“subset”) to represent this relationship:
ℕ
⊂
ℤ
⊂
ℚ
⊂
ℝ
Vectors, Vector Addition, and Scalar Multiplication
Let’s say you’re trying to predict the price of a house. To describe its value, we
might look at:
•
the total square footage of the living area;
•
the number of bedrooms available;
•
the age of the property in years.
The characteristics of the house can now be written by those 3 numbers:
(
square footage
,
number of bedrooms
,
age
)
.
If the attributes are
(
2
5
0
0
,
4
,
1
0
)
, it means the house has
2
5
0
0
square feet,
4
bedrooms, and is
1
0
years old. If the house is instead a brand new construction (and
keeps the same size and room count), then the attributes are
(
2
5
0
0
,
4
,
0
)
.
What we have just described is a vector to show the features of the property. This is
an example of a “3-vector”.
We can write vectors in a “top to bottom” format. So the above 3-vector is written
as:
(
2
5
0
0
4
1
0
)
To distinguish between ordinary numbers from vectors, we will use the word
scalar
to describe a single number. For example, the number
2
0
0
is a scalar, but the vector
(
2
0
0
,
3
0
0
,
2
5
)
is not.
In algebra so far, we’ve used symbols like
𝑥
,
𝑦
,
𝑧
,
𝑎
,
𝑏
,
…
to denote scalars. In many
linear algebra texts, boldface symbols are used to denote vectors (
𝒙
,
𝒚
,
𝒛
,
𝒂
,
𝒃
,
…
).
Definition 2.1.1.
For a whole number
𝑛
, an
n-vector
is a list of
𝑛
real numbers. We denote by
ℝ
𝑛
the collection of all possible n-vectors.
For this class we can also think of vectors as arrows in a space; the number
𝑛
here
represents the number of dimensions in the space. We can visualize this for
𝑛
=
2
or
𝑛
=
3
as arrows. For example with
𝑛
=
2
we can visualize the vector
𝒗
=
(
𝑣
1
𝑣
2
)
as an arrow in the coordinate plane
ℝ
2
pointing from the origin
(
0
,
0
)
to the point
(
𝑣
1
,
𝑣
2
)
, and similarly for a 3-vector by working in
ℝ
3
. Examples are shown below:
23
In the physical sciences, vectors are used to represent quantities that have both a
magnitude and a direction (i.e displacement, velocity, force, etc). However from the
perspective of machine learning, vectors are used to keep track of collections of
numerical data. This means we will nearly always have a very large
𝑛
.
Example 2.1.1.
Suppose there are 100 students in the AI class. We can keep track of all their
grades on the first test by using a 100-vector
𝑬
=
(
𝐸
1
𝐸
2
⋮
𝐸
1
0
0
)
Here
𝐸
1
is the first exam grade of the first student,
𝐸
2
the first exam grade of
the second student, and so on.
Now that we have an understanding of vectors, we can learn how to conduct
algebraic operations on vectors. Primarily we will learn vector addition and scalar
multiplication.
Definition 2.1.2.
The sum
𝒗
+
𝒘
of two vectors is defined only when
𝒗
and
𝒘
are
𝑛
-vectors. In
that case, we define their sum by the rule
(
𝑣
1
𝑣
2
⋮
𝑣
𝑛
)
+
(
𝑤
1
𝑤
2
⋮
𝑤
𝑛
)
=
(
𝑣
1
+
𝑤
1
𝑣
2
+
𝑤
2
⋮
𝑣
𝑛
+
𝑤
𝑛
)
.
Definition 2.1.3.
We can multiply some scalar
𝑐
against an
𝑛
-vector
𝒗
=
(
𝑣
1
𝑣
2
⋮
𝑣
𝑛
)
by the rule
𝑐
(
𝑣
1
𝑣
2
⋮
𝑣
𝑛
)
=
(
𝑐
𝑣
1
𝑐
𝑣
2
⋮
𝑐
𝑣
𝑛
)
Example 2.1.2.
Practice problem: Let
𝒗
=
(
2
−
1
3
)
and
𝒘
=
(
5
4
−
2
)
. Compute
𝒗
+
𝒘
and
−
2
𝒗
.
𝒗
+
𝒘
=
(
2
+
5
−
1
+
4
3
+
(
−
2
)
)
=
(
7
3
1
)
.
−
2
𝒗
=
(
−
4
2
−
6
)
.
Matrices, Notation and Dimensionality
The concept of a matrix is something that you should have briefly touched upon in
Algebra 2. As such, we’re assuming some basic knowledge of matrices here.
To recall:
24
Definition 2.1.4.
A matrix is a rectangular array of numbers or other mathematical objects with
elements or entries arranged in rows and columns. A matrix with
𝑝
rows and
𝑑
columns is called a
𝑝
×
𝑑
matrix.
The symbol
∈
(“is an element of”) is the bridge between a single value and its set.
However, in Linear Algebra, we use this notation to define the
shape
and
domain
of entire matrices.
When we write
𝑋
∈
ℝ
𝑝
×
𝑑
, we are not just saying
𝑋
is a real number. We are using
the set
ℝ
as a building block to describe a high-dimensional space:
1.
The
ℝ
tells us that every single entry inside the matrix is a real number.
2.
The exponent
(
𝑝
×
𝑑
)
defines the “container” size—
𝑝
rows and
𝑑
columns.
If we expand the matrix
𝑋
∈
ℝ
𝑝
×
𝑑
it would look something like this:
𝑋
=
(
𝑥
1
1
𝑥
2
1
⋮
𝑥
𝑝
1
𝑥
1
2
𝑥
2
2
⋮
𝑥
𝑝
2
…
…
⋱
…
𝑥
1
𝑑
𝑥
2
𝑑
⋮
𝑥
𝑝
𝑑
)
Notice that we have a way to identify the elements of the matrix based on their
position in the row and column. The first element would be
𝑥
1
1
. The first
1
represents the row and the second
1
represents the column. So
𝑥
1
1
is the first
element in the first row and first column, and so on.
For example, if you have a spreadsheet of 100 houses (
𝑝
=
1
0
0
) and each house has
5 features (
𝑑
=
5
), your data matrix
𝑋
exists in the space of
ℝ
1
0
0
×
5
. This tells any
reader immediately that your data consists of 500 unique real-valued measurements
organized in a specific grid.
Recall, features are distinct measurements that describe a single observation.
For example, the feature “size” might be the height of a house, and “bedrooms”
the count of rooms. Features are represented as columns in a matrix.
Labels are the target values we aim to predict for an observation. For example,
the label “price” is the house’s market value. Labels are typically represented as
a single column vector, where each entry corresponds to the observation in that
row.
Example 2.1.3.
Practice problem: Suppose
𝐴
∈
ℝ
2
×
3
and
𝐵
∈
ℝ
3
×
4
. What is the shape of
𝐴
𝐵
?
Since the inner dimensions match (
3
), the product is defined and
𝐴
𝐵
∈
ℝ
2
×
4
.
Sum and Difference of Matrices
Definition 2.1.5.
25
The sum of 2 matrices
𝐴
=
(
𝑎
1
1
𝑎
2
1
⋮
𝑎
𝑝
1
𝑎
1
2
𝑎
2
2
⋮
𝑎
𝑝
2
⋮
⋮
⋱
⋮
𝑎
1
𝑑
𝑎
2
𝑑
⋮
𝑎
𝑝
𝑑
)
and
𝐵
=
(
𝑏
1
1
𝑏
2
1
⋮
𝑏
𝑝
1
𝑏
1
2
𝑏
2
2
⋮
𝑏
𝑝
2
⋮
⋮
⋱
⋮
𝑏
1
𝑑
𝑏
2
𝑑
⋮
𝑏
𝑝
𝑑
)
is
defined only when
𝐴
and
𝐵
are of the same size. In that case, we define their
sum by the rule
𝐴
+
𝐵
=
(
𝑎
1
1
+
𝑏
1
1
𝑎
2
1
+
𝑏
2
1
⋮
𝑎
𝑝
1
+
𝑏
𝑝
1
𝑎
1
2
+
𝑏
1
2
𝑎
2
2
+
𝑏
2
2
⋮
𝑎
𝑝
2
+
𝑏
𝑝
2
⋮
⋮
⋱
⋮
𝑎
1
𝑑
+
𝑏
1
𝑑
𝑎
2
𝑑
+
𝑏
2
𝑑
⋮
𝑎
𝑝
𝑑
+
𝑏
𝑝
𝑑
)
.
Example 2.1.4.
Let’s say we have the matrices
𝐴
=
(
1
3
2
4
)
and
𝐵
=
(
5
7
6
8
)
. Then
𝐴
+
𝐵
=
(
1
+
5
3
+
7
2
+
6
4
+
8
)
=
(
6
1
0
8
1
2
)
.
Definition 2.1.6.
The difference of 2 matrices
𝐴
=
(
𝑎
1
1
𝑎
2
1
⋮
𝑎
𝑝
1
𝑎
1
2
𝑎
2
2
⋮
𝑎
𝑝
2
⋮
⋮
⋱
⋮
𝑎
1
𝑑
𝑎
2
𝑑
⋮
𝑎
𝑝
𝑑
)
and
𝐵
=
(
𝑏
1
1
𝑏
2
1
⋮
𝑏
𝑝
1
𝑏
1
2
𝑏
2
2
⋮
𝑏
𝑝
2
⋮
⋮
⋱
⋮
𝑏
1
𝑑
𝑏
2
𝑑
⋮
𝑏
𝑝
𝑑
)
is also defined only when
𝐴
and
𝐵
are of the same size (in other words
𝐴
∈
ℝ
𝑝
×
𝑑
and
𝐵
∈
ℝ
𝑝
×
𝑑
). In that case, we define their difference by using sums but
multiplying the second matrix by the scalar
−
1
:
first we multiply the second matrix by the scalar
−
1
:
−
𝐵
=
(
−
𝑏
1
1
−
𝑏
2
1
⋮
−
𝑏
𝑝
1
−
𝑏
1
2
−
𝑏
2
2
⋮
−
𝑏
𝑝
2
⋮
⋮
⋱
⋮
−
𝑏
1
𝑑
−
𝑏
2
𝑑
⋮
−
𝑏
𝑝
𝑑
)
then we add the two matrices:
𝐴
−
𝐵
=
𝐴
+
(
−
𝐵
)
=
(
𝑎
1
1
−
𝑏
1
1
𝑎
2
1
−
𝑏
2
1
⋮
𝑎
𝑝
1
−
𝑏
𝑝
1
𝑎
1
2
−
𝑏
1
2
𝑎
2
2
−
𝑏
2
2
⋮
𝑎
𝑝
2
−
𝑏
𝑝
2
⋮
⋮
⋱
⋮
𝑎
1
𝑑
−
𝑏
1
𝑑
𝑎
2
𝑑
−
𝑏
2
𝑑
⋮
𝑎
𝑝
𝑑
−
𝑏
𝑝
𝑑
)
Dot Product
Definition 2.1.7.
The dot product of two vectors
𝒂
=
(
𝑎
1
𝑎
2
…
𝑎
𝑛
)
and
𝒃
=
(
𝑏
1
𝑏
2
…
𝑏
𝑛
)
is defined as:
𝒂
⋅
𝒃
=
∑
𝑛
𝑖
=
1
𝑎
𝑖
𝑏
𝑖
=
𝑎
1
𝑏
1
+
𝑎
2
𝑏
2
+
…
+
𝑎
𝑛
𝑏
𝑛
Example 2.1.5.
We can visualize this: assume we have
𝒂
=
(
1
2
3
)
and
𝒃
=
(
4
5
6
)
. The dot
product looks like this:
26
𝒂
⋅
𝒃
=
(
1
×
4
)
+
(
2
×
5
)
+
(
3
×
6
)
=
4
+
1
0
+
1
8
=
3
2
Example 2.1.6.
Practice problem: Let
𝒖
=
(
2
−
1
4
)
and
𝒗
=
(
3
0
−
2
)
. Compute
𝒖
⋅
𝒗
.
𝒖
⋅
𝒗
=
(
2
)
(
3
)
+
(
−
1
)
(
0
)
+
(
4
)
(
−
2
)
=
6
+
0
−
8
=
−
2
.
The dot product is the tool we’ll use to multiply vectors and matrices.
Matrix Multiplication
Definition 2.1.8.
Suppose that we have
𝐴
∈
ℝ
𝑟
×
𝑑
and
𝐵
∈
ℝ
𝑑
×
𝑠
. Then the product of
𝐴
and
𝐵
is denoted
𝐴
𝐵
. The
(
𝑖
,
𝑗
)
th element of
(
𝐴
𝐵
)
is computed by multiplying each
element of the
𝑖
th row of
𝐴
by the corresponding element of the
𝑗
th column of
𝐵
. That is,
(
𝐴
𝐵
)
𝑖
𝑗
=
∑
𝑑
𝑘
=
1
𝑎
𝑖
𝑘
𝑏
𝑘
𝑗
.
Example 2.1.7.
As an example, consider
𝑨
=
(
1
3
2
4
)
and
𝑩
=
(
5
7
6
8
)
.
Then
𝑨
𝑩
=
(
1
3
2
4
)
(
5
7
6
8
)
=
(
1
×
5
+
2
×
7
3
×
5
+
4
×
7
1
×
6
+
2
×
8
3
×
6
+
4
×
8
)
=
(
1
9
4
3
2
2
5
0
)
.
Example 2.1.8.
Practice problem: Let
𝐴
=
(
2
3
−
1
4
0
1
)
and
𝐵
=
(
1
−
2
5
2
0
−
1
)
. Compute
𝐴
𝐵
.
First check dimensions:
𝐴
∈
ℝ
2
×
3
and
𝐵
∈
ℝ
3
×
2
, so
𝐴
𝐵
∈
ℝ
2
×
2
.
𝐴
𝐵
=
(
2
×
1
+
(
−
1
)
×
(
−
2
)
+
0
×
5
3
×
1
+
4
×
(
−
2
)
+
1
×
5
2
×
2
+
(
−
1
)
×
0
+
0
×
(
−
1
)
3
×
2
+
4
×
0
+
1
×
(
−
1
)
)
=
(
4
0
4
5
)
.
Note that this operation produces an
𝑟
×
𝑠
matrix. It is only possible to compute
𝑨
𝑩
if the number of columns of
𝑨
is the same as the number of rows of
𝑩
.
Transpose
Definition 2.1.9.
To take the transpose of a matrix, we interchange each of its columns with the
corresponding row. That is, row
1
becomes column
1
, row
2
becomes column
2
, and so on. A superscript
𝑇
is used to denote the transpose operation.
27
So if we have a matrix
𝑋
∈
ℝ
𝑝
×
𝑑
we can take its transpose
𝑋
𝑇
∈
ℝ
𝑑
×
𝑝
by
swapping the rows and columns. Element-wise,
(
𝑋
𝑇
)
𝑖
𝑗
=
𝑋
𝑗
𝑖
.
Example 2.1.9.
For example, if we have a matrix
𝑋
∈
ℝ
3
×
2
we can take its transpose
𝑋
𝑇
∈
ℝ
2
×
3
by swapping the rows and columns:
𝑋
=
(
𝑥
1
1
𝑥
2
1
𝑥
3
1
𝑥
1
2
𝑥
2
2
𝑥
3
2
)
𝑋
𝑇
=
(
𝑥
1
1
𝑥
1
2
𝑥
2
1
𝑥
2
2
𝑥
3
1
𝑥
3
2
)
Example 2.1.10.
Practice problem: If
𝐶
=
(
0
2
3
5
−
1
4
)
, compute
𝐶
𝑇
.
𝐶
𝑇
=
(
0
3
−
1
2
5
4
)
.
Identity Matrix
Before we can understand the inverse of a matrix, we need to understand the
identity matrix. Recall that multiplying any scalar (number) by 1 gives you the same
scalar back? The identity matrix plays the same role for matrices.
Definition 2.1.10.
The identity matrix
𝐼
𝑛
(or just
𝐼
when the size is clear) is a square
𝑛
×
𝑛
matrix with 1s on the diagonal and 0s everywhere else:
𝐼
3
=
(
1
0
0
0
1
0
0
0
1
)
For any matrix
𝐴
of compatible size:
𝐴
𝐼
=
𝐼
𝐴
=
𝐴
Example 2.1.11.
Let’s verify this works:
(
2
4
3
5
)
(
1
0
0
1
)
=
(
2
×
1
+
3
×
0
4
×
1
+
5
×
0
2
×
0
+
3
×
1
4
×
0
+
5
×
1
)
=
(
2
4
3
5
)
The matrix came out unchanged, just like multiplying a number by 1.
Determinant
Before we can compute the inverse of a matrix, we need to understand the
determinant. The determinant is a single number that tells us important
information about a matrix — most critically, whether the matrix has an inverse.
28
Definition 2.1.11.
For a
2
×
2
matrix
𝐴
=
(
𝑎
𝑐
𝑏
𝑑
)
, the
determinant
is defined as:
det
(
𝐴
)
=
𝑎
𝑑
−
𝑏
𝑐
The determinant is often written as
|
𝐴
|
or
det
(
𝐴
)
.
Example 2.1.12.
For
𝐴
=
(
3
1
2
4
)
:
det
(
𝐴
)
=
(
3
)
(
4
)
−
(
2
)
(
1
)
=
1
2
−
2
=
1
0
The determinant has a geometric interpretation: it tells you how much a matrix
“stretches” or “squishes” space. If
det
(
𝐴
)
=
2
, multiplying by
𝐴
doubles areas. If
det
(
𝐴
)
=
0
, the matrix collapses space into a lower dimension — and this is exactly
when the matrix has no inverse.
Definition 2.1.12.
A matrix
𝐴
is
invertible
(has an inverse) if and only if
det
(
𝐴
)
≠
0
.
Example 2.1.13.
Let’s check if
𝐵
=
(
1
2
2
4
)
has an inverse:
det
(
𝐵
)
=
(
1
)
(
4
)
−
(
2
)
(
2
)
=
4
−
4
=
0
Since
det
(
𝐵
)
=
0
, this matrix does
not
have an inverse. Notice that the second
row is exactly twice the first row — the rows are linearly dependent.
Inverse Matrix
Now we can define and compute the inverse. Just like division “undoes”
multiplication for numbers (since
5
×
1
5
=
1
), an inverse matrix “undoes” matrix
multiplication.
Definition 2.1.13.
For a square matrix
𝐴
, its
inverse
𝐴
−
1
is the matrix such that:
𝐴
𝐴
−
1
=
𝐴
−
1
𝐴
=
𝐼
Not every matrix has an inverse. A matrix that has an inverse is called
invertible
or
non-singular
.
Computing the Inverse of a 2×2 Matrix
For a
2
×
2
matrix, there’s a simple formula:
Definition 2.1.14.
If
𝐴
=
(
𝑎
𝑐
𝑏
𝑑
)
and
det
(
𝐴
)
≠
0
, then:
29
𝐴
−
1
=
1
det
(
𝐴
)
(
𝑑
−
𝑐
−
𝑏
𝑎
)
In words: swap the diagonal elements, negate the off-diagonal elements, and
divide everything by the determinant.
Example 2.1.14.
Let’s compute the inverse of
𝐴
=
(
3
1
2
4
)
step by step.
Step 1:
Compute the determinant.
det
(
𝐴
)
=
(
3
)
(
4
)
−
(
2
)
(
1
)
=
1
2
−
2
=
1
0
Since
det
(
𝐴
)
=
1
0
≠
0
, the inverse exists.
Step 2:
Apply the formula.
𝐴
−
1
=
1
1
0
(
4
−
1
−
2
3
)
=
(
4
1
0
−
1
1
0
−
2
1
0
3
1
0
)
=
(
0
.
4
−
0
.
1
−
0
.
2
0
.
3
)
Step 3:
Verify by computing
𝐴
𝐴
−
1
.
𝐴
𝐴
−
1
=
(
3
1
2
4
)
(
0
.
4
−
0
.
1
−
0
.
2
0
.
3
)
=
(
3
(
0
.
4
)
+
2
(
−
0
.
1
)
1
(
0
.
4
)
+
4
(
−
0
.
1
)
3
(
−
0
.
2
)
+
2
(
0
.
3
)
1
(
−
0
.
2
)
+
4
(
0
.
3
)
)
=
(
1
.
2
−
0
.
2
0
.
4
−
0
.
4
−
0
.
6
+
0
.
6
−
0
.
2
+
1
.
2
)
=
(
1
0
0
1
)
=
𝐼
✓
Example 2.1.15.
Let’s compute the inverse of
𝐴
=
(
1
3
2
4
)
.
Step 1:
det
(
𝐴
)
=
(
1
)
(
4
)
−
(
2
)
(
3
)
=
4
−
6
=
−
2
Step 2:
Apply the formula:
𝐴
−
1
=
1
−
2
(
4
−
3
−
2
1
)
=
(
−
2
3
2
1
−
1
2
)
Larger Matrices
For matrices larger than
2
×
2
, computing inverses by hand becomes much more
tedious. There are methods like Gaussian elimination and cofactor expansion, but
they require many more steps.
For a
3
×
3
matrix, the determinant formula involves 6 terms. For a
4
×
4
matrix, it
involves 24 terms. In general, computing the determinant of an
𝑛
×
𝑛
matrix by the
basic formula involves
𝑛
!
(factorial) terms — that’s 120 terms for a
5
×
5
matrix!
In practice, we let computers handle matrix inverses. Libraries like NumPy provide
efficient algorithms:
import
numpy
as
np
A
=
np
.
array
(
[
[
3
,
2
]
,
[
1
,
4
]
]
)
A_inv
=
np
.
linalg
.
inv
(
A
)
30
print
(
A_inv
)
#
[[ 0.4 -0.2]
#
[-0.1 0.3]]
Due to time constraints and the complexity of the calculations, we will not be
computing the inverse of matrices larger than
2
×
2
by hand in this class. Instead
we will introduce various techniques to compute the inverse of matrices but will
not require you to ever compute them by hand.
2.2
Multiple Linear Regression Formally
Now that we’ve established a foundation in Linear Algebra, we can finally tackle the
problem of predicting house pricing with more than one feature. As you saw in our
previous house table, a single weight
𝑊
isn’t enough when we have size, bedrooms,
and location all affecting the pricing of houses.
In Multiple Linear Regressions, we don’t just have one 1 coefficient to optimize we
have to manipulate multiple
Definition 2.2.1.
To make this work, we give every feature it’s own weight. If we have
𝑝
features, our prediction formula
̂
𝑦
looks like this:
̂
𝑦
=
𝛽
0
+
𝑋
1
𝛽
1
+
𝑋
2
𝛽
2
+
…
+
𝑋
𝑝
𝛽
𝑝
In this equation:
•
𝛽
0
is our Intercept replacing
𝑏
from our previous chapter
•
𝑋
1
,
𝑋
2
,
…
our features (Size Bedrooms, etc.)
•
𝛽
1
,
𝛽
2
,
…
are the specific weights for those features
If you reference library documentation you’ll often see this representing as a
single matrix multiplication:
̂
𝑦
=
𝑋
𝛽
2.3
The Normal Equation and Pseudoinverses
In the previous chapter, we talked about finding the optimal weights using gradient
descent. Now that we know some basic linear algebra however, we can introduce a
“perfect” mathematical solution to find the best weights immediately without
iterative gradient decent.
Definition 2.3.1.
This is called the
Normal Equation
. If we want to minimize our total error
(RSS) across all houses, we can use this formula:
̂
𝛽
=
(
𝑋
𝑇
𝑋
)
−
1
𝑋
𝑇
𝑦
When you combine them, the math splits out the exact set of weights that
creates the lowest possible error.
To understand what this equation does let’s consider a simple equation
𝑦
=
𝛽
𝑥
. If
we wanted to find the value of
𝛽
we would simply divide both sides by
𝑥
like so:
𝑦
𝑥
=
𝛽
. This can also be written as
𝛽
=
𝑥
−
1
𝑦
. This isn’t too dissimilar to the normal
equation. Let’s break down all it’s parts one-b-one:
31
Decomposition of the Matrix Form:
Recall that when dealing with multiple observations and features, we represent our
system as
𝑦
=
𝑋
𝛽
. However, because
𝑋
is typically a non-square matrix (more
observations than features), we cannot simply invert it. We derive the solution:
1.
The Gram Matrix
Since
𝑋
is rectangular, we multiply it’s transpose
𝑋
𝑇
to create a symmetric, square
matrix.
Example 2.3.1.
Let’s actually calculate the optimal coefficients for the following dataset:
Size
Bedrooms
Price
1,000
2
100,000
2,000
3
200,000
3,000
4
300,000
4,000
5
400,000
We can first make our data a matrix
𝑋
and define some labels
𝑦
:
𝑋
=
(
1
1
1
1
1,000
2,000
3,000
4,000
2
3
4
5
)
,
𝑦
=
(
100,000
200,000
300,000
400,000
)
First we can transpose our matrix:
𝑋
𝑇
=
(
1
1,000
2
1
2,000
3
1
3,000
4
1
4,000
5
)
Then we can calculate
𝑋
𝑇
𝑋
by multiplying the transpose of
𝑋
by
𝑋
:
𝑋
𝑇
𝑋
=
(
4
10,000
1
4
10,000
30,000,000
40,000
1
4
40,000
5
4
)
We multiply the flipped matrix by the house prices. This shows how each
feature relates to the price.
𝑋
𝑇
𝑦
=
(
1,000,000
3,000,000,000
4,000,000
)
Then finally we can find the inverse and the final solution: This is the part
where the math gets very tedious to do by hand. We will therefore omit the
calculation and just give you the result:
̂
𝛽
=
(
0
1
0
0
0
)
•
Intercept
𝛽
0
=
0
: The baseline price starts at
$
0
.
•
Size
𝛽
1
=
1
0
0
: For every 1 square foot, the price increases by
$
1
0
0
32
•
Bedrooms
𝛽
2
=
0
: In this specific (contrived) dataset, the number of
bedrooms didn’t add extra value beyond what the square footage already
explained.
Our final equation would be:
̂
𝑦
=
0
+
1
0
0
𝑥
1
+
0
𝑥
2
or in its simplest form:
̂
𝑦
=
1
0
0
𝑥
1
.
2.4
Gradient Descent for Multiple Variables
Sometimes we do not want to take a matrix inverse (or it is too expensive). In that
case we can minimize the error iteratively using gradient descent.
Definition 2.4.1.
In matrix form, the prediction is still
̂
𝑦
=
𝑋
𝛽
. We define a single loss over all
houses (batch loss):
𝐽
(
𝛽
)
=
(
1
2
𝑛
)
‖
𝑋
𝛽
−
𝑦
‖
2
The gradient of this loss tells us the downhill direction:
∇
𝛽
𝐽
=
(
1
𝑛
)
𝑋
𝑇
(
𝑋
𝛽
−
𝑦
)
Then we repeatedly update every weight at once:
𝛽
≔
𝛽
−
𝛼
∗
∇
𝛽
𝐽
Example 2.4.1.
Batch gradient descent in practice:
1.
Start with a guess for
𝛽
(often all zeros).
2.
Compute the predictions
𝑋
𝛽
.
3.
Measure the error
𝑋
𝛽
−
𝑦
.
4.
Use the gradient formula above to update all weights.
5.
Repeat until the loss stops changing much.
This method is slower than the normal equation, but it scales to large datasets
and does not require matrix inversion.
2.5
Feature Scaling
If gradient descent is “walking downhill,” feature scale determines whether that
walk is smooth or chaotic. In a housing dataset, square footage might be around
3000 while bedrooms might be around 3. Both features matter, but they live on very
different numeric scales.
Without scaling, one weight gets giant updates while another barely moves. That is
how you get zig-zagging, very slow progress, or complete divergence.
33
Figure 3: A literal 3D loss surface view of gradient descent. Left: unscaled features
create a thin valley and chaotic zig-zag steps. Right: scaled features create a rounder
bowl and smoother progress.
Definition 2.5.1.
Feature scaling
transforms each feature column so that columns are
numerically comparable. The most common method is
standardization
(z-
score scaling):
𝑥
′
𝑖
,
𝑗
=
𝑥
𝑖
,
𝑗
−
𝜇
𝑗
𝜎
𝑗
with
𝜇
𝑗
=
(
1
𝑛
)
∑
𝑛
𝑖
=
1
𝑥
𝑖
,
𝑗
𝜎
𝑗
=
√
(
1
𝑛
)
∑
𝑛
𝑖
=
1
(
𝑥
𝑖
,
𝑗
−
𝜇
𝑗
)
2
Here,
𝜇
𝑗
is the center of feature
𝑗
, and
𝜎
𝑗
is the feature’s typical distance from
that center.
Start with a contrived failure case
Imagine a model with only two input features:
•
𝑥
1
= square footage (roughly 800 to 4500)
•
𝑥
2
= bedrooms (roughly 1 to 5)
At initialization, many models start with
𝛽
=
0
. Then prediction error is roughly
−
𝑦
, and each gradient component is proportional to:
𝜕
𝐽
𝜕
𝛽
𝑗
𝑝
𝑟
𝑜
𝑝
𝑡
𝑜
−
(
1
𝑛
)
∑
𝑛
𝑖
=
1
𝑦
𝑖
𝑥
𝑖
,
𝑗
That means each gradient component is scaled by feature size itself. A feature with
values in the thousands naturally creates much larger updates than one with values
near 1 to 5.
Example 2.5.1.
A quick numeric intuition:
34
•
If a typical home has about 2500 square feet and 3 bedrooms, then the raw
magnitude ratio is about
2
5
0
0
3
≈
8
3
3
.
•
So, before scaling, the square-footage gradient can easily be hundreds of
times larger than the bedrooms gradient.
This does
not
mean bedrooms are unimportant. It means the optimizer is being
biased by units.
Figure 4: Raw feature space vs standardized feature space. After z-score scaling,
both features occupy comparable numeric ranges.
Read this figure left to right:
•
Left panel: the model sees one axis with values in the thousands and another near
single digits.
•
Right panel: both axes are centered around 0 with similar spread, so optimization
is more balanced.
Figure 5: Step-0 gradient magnitudes (log scale). Raw features create an extreme
update imbalance; scaled features reduce that gap.
This is the mechanism, not just the symptom: when feature scales differ wildly, one
weight receives most of the update budget.
Training run walkthrough: good, bad, and ugly
Now look at an entire training run for the same dataset under different setups.
35
Figure 6: Three gradient descent runs: unscaled with tiny learning rate (slow),
unscaled with larger learning rate (diverges), and scaled with larger learning rate
(fast + stable).
Example 2.5.2.
How to read the three curves:
1.
Raw + tiny learning rate
: stable, but painfully slow. You need very small
steps to avoid blowing up.
2.
Raw + slightly larger learning rate
: instability appears quickly and
training diverges.
3.
Scaled + much larger learning rate
: loss drops quickly and smoothly.
In other words, scaling often expands the range of learning rates that actually
work.
Why this happens geometrically
Gradient descent is navigating a loss surface in parameter space. With poor scaling,
that surface is often a thin valley. With better scaling, the valley becomes rounder.
Figure 7: Loss contours and parameter paths. Unscaled features create a thin valley
and zig-zag path; scaled features produce a rounder bowl and a more direct path.
Walkthrough:
•
Left: narrow contours force the optimizer to bounce side-to-side (zig-zag).
•
Right: contours are more circular, so the path can head toward the minimum more
directly.
36
Generalizing beyond this toy dataset
This pattern shows up in almost every multifeature model trained with gradient-
based methods:
•
If features use different units (dollars, years, square feet, counts), scaling is usually
necessary.
•
Standardization is a strong default for linear models and neural networks.
•
Always fit scaling parameters (
𝜇
,
𝜎
) on training data only, then reuse them for
validation/test data.
If you skip that last step and recompute scaling independently on test data, you
change the meaning of each feature and your evaluation becomes unreliable.
2.6
Interpreting Weights After Scaling
When you scale features, the model learns weights for the
scaled
version of each
feature. That means the coefficients are no longer in the original units (square feet,
bedrooms, years). This is why a scaled gradient descent model can look very
different from the normal equation or scikit-learn if those were trained on raw
features.
Definition 2.6.1.
If we standardize each feature with
𝑥
′
𝑗
=
𝑥
𝑗
−
𝜇
𝑗
𝜎
𝑗
and train a model with weights
𝛽
′
and intercept
𝛽
′
0
, then the equivalent weights in the original feature units
are:
𝛽
𝑗
=
𝛽
′
𝑗
𝜎
𝑗
𝛽
0
=
𝛽
′
0
−
∑
𝑗
(
𝛽
′
𝑗
∗
𝜇
𝑗
𝜎
𝑗
)
Example 2.6.1.
Suppose your trained scaled model has:
•
𝛽
′
0
=
4
2
0
,
0
0
0
•
𝛽
′
sqft
=
1
2
6
,
5
0
0
and
𝜎
sqft
=
1
1
0
0
•
𝛽
′
bed
=
1
9
,
0
0
0
and
𝜎
bed
=
0
.
9
•
𝜇
sqft
=
2
5
0
0
•
𝜇
bed
=
3
.
4
Convert back:
•
𝛽
sqft
=
1
2
6
,
5
0
0
1
1
0
0
≈
1
1
5
dollars per square foot
•
𝛽
bed
=
1
9
,
0
0
0
0
.
9
≈
2
1
,
1
1
1
dollars per additional bedroom
So the scaled model is still interpretable, but only
after
converting back to
original units.
2.7
Linear Algebra Practice Problems
Sets and Number Sets
Problem 1.
Classify each of the following values into the most specific number set
(
ℕ
,
ℤ
,
ℚ
,
ℝ
, or
ℂ
):
•
7
37
•
−
3
•
0
.
7
5
•
√
2
•
3
+
2
𝑖
Problem 2.
Classify each value into the most specific number set (
ℕ
,
ℤ
,
ℚ
,
ℝ
, or
ℂ
):
•
0
•
−
7
3
•
√
9
•
5
+
0
𝑖
Problem 3.
True or False: Every natural number is also a rational number. Explain
your reasoning using the subset relationships.
Problem 4.
True or False: Every real number is also a rational number. If false, give
a counterexample.
Problem 5.
A machine learning dataset contains the following columns: “number
of bedrooms” (values like 2, 3, 4) and “house price” (values like $245,000.50). Which
number set would you use to describe each column?
Problem 6.
If
𝐴
⊂
𝐵
and
𝐵
⊂
𝐶
, what can you conclude about the relationship
between
𝐴
and
𝐶
? Apply this to explain why
ℕ
⊂
ℝ
.
Problem 7.
Give an example of a number that is in
ℝ
but not in
ℚ
. Why does this
distinction matter for computer representations of numbers?
Vectors and Vector Operations
Problem 8.
A data point for a student has the following features: GPA (3.5), SAT
score (1400), and number of extracurriculars (4). Write this as a 3-vector in column
notation.
Problem 9.
Given two vectors
𝒂
=
(
2
5
−
1
)
and
𝒃
=
(
3
−
2
4
)
, compute
𝒂
+
𝒃
.
Problem 10.
Compute
3
𝒗
where
𝒗
=
(
4
−
2
7
)
.
Problem 11.
Compute
−
2
𝒂
+
𝒃
where
𝒂
=
(
3
1
−
4
)
and
𝒃
=
(
−
1
5
2
)
.
Problem 12.
Given
𝒖
=
(
1
2
)
and
𝒘
=
(
4
6
)
, compute
2
𝒖
+
3
𝒘
.
Problem 13.
State whether each expression is defined. If it is, compute it.
1.
𝒑
+
𝒒
where
𝒑
=
(
1
2
3
)
and
𝒒
=
(
4
5
6
)
2.
𝒓
+
𝒔
where
𝒓
=
(
1
2
)
and
𝒔
=
(
3
4
5
)
Problem 14.
Why can’t you add the vectors
𝒑
=
(
1
2
3
)
and
𝒒
=
(
4
5
)
? In a machine
learning context, what would this situation represent?
Notation and Dimensionality
Problem 15.
If a matrix
𝑀
∈
ℝ
5
0
×
7
, how many rows does it have? How many
columns? How many total entries?
Problem 16.
If
𝐴
∈
ℝ
4
×
2
, how many entries are in
𝐴
?
Problem 17.
Write the general form of a matrix
𝐴
∈
ℝ
2
×
3
using subscript notation
for each element.
Problem 18.
Write the general form of a matrix
𝐵
∈
ℝ
3
×
2
using subscript notation.
38
Problem 19.
You have a dataset of 1000 images, where each image is represented
by 784 pixel values. What is the shape of the data matrix
𝑋
if each row is one
image? Write it in the form
𝑋
∈
ℝ
𝑝
×
𝑑
.
Problem 20.
Given
𝑋
∈
ℝ
3
×
4
, what element is located at row 2, column 3? Write it
using subscript notation.
Problem 21.
A machine learning model takes in data matrix
𝑋
∈
ℝ
𝑛
×
𝑑
and
outputs predictions
̂
𝑦
∈
ℝ
𝑛
. Explain in plain English what
𝑛
and
𝑑
represent.
Dot Product
Problem 22.
Compute the dot product of
𝒂
=
(
2
3
1
)
and
𝒃
=
(
4
−
1
5
)
.
Problem 23.
Compute the dot product of
𝒖
=
(
−
2
0
3
)
and
𝒗
=
(
5
4
−
1
)
.
Problem 24.
If
𝒙
=
(
1
0
0
)
and
𝒚
=
(
0
1
0
)
, compute
𝒙
⋅
𝒚
.
Problem 25.
A house has features
𝒙
=
(
1
2
0
0
0
3
)
(intercept, square footage,
bedrooms) and the model weights are
𝜷
=
(
5
0
0
0
0
1
0
0
5
0
0
0
)
. Compute the predicted price
using the dot product.
Problem 26.
Compute
𝒗
⋅
𝒗
where
𝒗
=
(
3
4
)
.
Problem 27.
Given
𝒑
=
(
2
−
1
)
and
𝒒
=
(
4
3
)
, compute
𝒑
⋅
𝒒
.
Problem 28.
Why must two vectors have the same dimension for the dot product
to be defined? Give a practical example where this constraint matters.
Matrix Multiplication
Problem 29.
Given
𝐴
=
(
1
3
2
4
)
and
𝐵
=
(
5
7
6
8
)
, compute the element in row 1,
column 2 of
𝐴
𝐵
.
Problem 30.
If
𝐴
∈
ℝ
3
×
4
and
𝐵
∈
ℝ
4
×
2
, what is the shape of the product
𝐴
𝐵
?
Problem 31.
Can you multiply
𝑃
∈
ℝ
2
×
3
by
𝑄
∈
ℝ
2
×
3
? Explain why or why not.
Problem 32.
Compute the full matrix product:
(
2
1
0
3
)
(
1
2
4
5
)
Problem 33.
Compute the full matrix product:
(
1
0
−
1
3
2
1
)
(
2
−
1
4
1
0
−
2
)
Problem 34.
Let
𝐴
∈
ℝ
2
×
3
and
𝐵
∈
ℝ
3
×
1
. What is the shape of
𝐴
𝐵
?
Transpose
Problem 35.
Compute the transpose of
𝐴
=
(
1
4
2
5
3
6
)
.
Problem 36.
If
𝑀
∈
ℝ
1
0
×
3
, what is the shape of
𝑀
𝑇
?
Problem 37.
Given the column vector
𝒗
=
(
2
5
8
)
, write
𝒗
𝑇
.
Problem 38.
Verify that
(
𝐴
𝑇
)
𝑇
=
𝐴
for
𝐴
=
(
1
3
5
2
4
6
)
.
39
Problem 39.
Let
𝐴
=
(
1
−
3
2
0
)
and
𝐵
=
(
4
2
−
1
5
)
. Compute
(
𝐴
+
𝐵
)
𝑇
.
Determinant
Problem 40.
Compute the determinant of
𝐴
=
(
5
2
3
4
)
.
Problem 41.
Compute the determinant of
𝐷
=
(
7
3
−
1
2
)
.
Problem 42.
Compute the determinant of
𝐵
=
(
−
2
3
6
−
9
)
. Does this matrix have an
inverse?
Problem 43.
For the matrix
𝐶
=
(
𝑎
𝑏
2
𝑎
2
𝑏
)
, compute the determinant. What does this
tell you about matrices where one column is a multiple of the other?
Problem 44.
If
det
(
𝐴
)
=
5
, what is
det
(
2
𝐴
)
for a
2
×
2
matrix? (Hint: work out an
example.)
Problem 45.
The determinant has a geometric interpretation: it tells us how a
matrix scales area. If
det
(
𝐴
)
=
3
, what happens to the area of a unit square when
transformed by
𝐴
? What if
det
(
𝐴
)
=
−
2
?
Identity Matrix and Inverse
Problem 46.
Write the
2
×
2
identity matrix
𝐼
2
and the
3
×
3
identity matrix
𝐼
3
.
Problem 47.
Compute
𝐼
2
𝐴
for
𝐴
=
(
−
1
2
4
0
)
.
Problem 48.
Verify that
𝐴
𝐼
2
=
𝐴
for
𝐴
=
(
3
2
7
5
)
.
Problem 49.
Compute the inverse of
𝐴
=
(
4
2
3
2
)
by hand using the formula
𝐴
−
1
=
1
det
(
𝐴
)
(
𝑑
−
𝑐
−
𝑏
𝑎
)
.
Problem 50.
Compute the inverse of
𝐵
=
(
5
7
2
3
)
by hand. Show all steps.
Problem 51.
Compute the inverse of
𝐶
=
(
1
3
2
5
)
by hand.
Problem 52.
Attempt to compute the inverse of
𝐷
=
(
6
4
3
2
)
. What happens and
why?
Problem 53.
If
𝐴
−
1
=
(
2
−
3
−
1
2
)
, find
𝐴
.
Problem 54.
For which values of
𝑘
is the matrix
𝐴
=
(
1
2
𝑘
4
)
invertible?
Mixed Practice
Problem 55.
Let
𝒂
=
(
2
−
1
0
)
and
𝒃
=
(
1
3
−
2
)
. Compute
(
𝒂
+
𝒃
)
⋅
𝒃
.
Problem 56.
Let
𝐴
=
(
1
0
3
2
−
1
4
)
. Compute
𝐴
𝑇
and then compute
𝐴
𝑇
𝐴
.
Problem 57.
Let
𝐵
=
(
2
−
4
1
−
2
)
. Determine whether
𝐵
is invertible and explain
why.
2.8
Problem Packet
All required calculus derivations are provided in full. For each derivation, you will
be asked to apply or interpret the result rather than re-derive it. Show computations
where requested and explain interpretations in complete sentences.
Data Representation and Design Matrix
Problem 1.
Given the dataset below, write the design matrix
𝑋
including an
intercept column and the label vector
𝑦
. Then interpret the meaning of each column
in one sentence.
Problem 2.
40
Size (ft squared)
Bedrooms
Price ($)
“1,000”
2
“100,000”
“2,000”
3
“200,000”
“3,000”
4
“300,000”
“4,000”
5
“400,000”
State the shape of
𝑋
in the form
𝑋
∈
ℝ
𝑝
×
𝑑
and explain what
𝑝
and
𝑑
represent in
this dataset.
Matrix Operations (Reference)
Problem 3.
Using the design matrix from Problem 1, compute
𝑋
𝑇
𝑋
and
𝑋
𝑇
𝑦
.
Arithmetic may be shown succinctly. Then explain in one sentence what each of
these matrices/vectors measures.
Normal Equation (Derivation Provided)
The loss is the residual sum of squares:
𝐿
(
𝛽
)
=
(
𝑦
−
𝑋
𝛽
)
𝑇
(
𝑦
−
𝑋
𝛽
)
Worked derivation:
𝐿
(
𝛽
)
=
𝑦
𝑇
𝑦
−
2
𝛽
𝑇
𝑋
𝑇
𝑦
+
𝛽
𝑇
𝑋
𝑇
𝑋
𝛽
Taking the gradient with respect to
𝛽
:
∇
𝛽
𝐿
(
𝛽
)
=
−
2
𝑋
𝑇
𝑦
+
2
𝑋
𝑇
𝑋
𝛽
Setting to zero:
−
2
𝑋
𝑇
𝑦
+
2
𝑋
𝑇
𝑋
𝛽
=
0
⟶
𝑋
𝑇
𝑋
𝛽
=
𝑋
𝑇
𝑦
The solution is:
̂
𝛽
=
(
𝑋
𝑇
𝑋
)
−
1
𝑋
𝑇
𝑦
Problem 4.
Using the formula above, compute
̂
𝛽
for Problem 1. Interpret each
coefficient in a single clear sentence.
Problem 5.
Explain why
̂
𝛽
is a minimizer of the loss using geometric or algebraic
intuition.
MSE and Gradient (Derivation Provided)
ℒ︀
(
𝛽
)
=
1
𝑁
(
𝑦
−
𝑋
𝛽
)
𝑇
(
𝑦
−
𝑋
𝛽
)
Worked derivation:
∇
𝛽
ℒ︀
(
𝛽
)
=
1
𝑁
(
−
2
𝑋
𝑇
𝑦
+
2
𝑋
𝑇
𝑋
𝛽
)
=
−
2
𝑁
𝑋
𝑇
(
𝑦
−
𝑋
𝛽
)
The gradient descent update with learning rate
𝜂
is:
𝛽
←
𝛽
−
𝜂
∇
𝛽
ℒ︀
(
𝛽
)
=
𝛽
+
2
𝜂
𝑁
𝑋
𝑇
(
𝑦
−
𝑋
𝛽
)
Problem 6.
Apply the formula. Given
𝑁
=
3
,
𝜂
=
0
.
1
:
41
𝑋
=
(
1
1
1
2
4
1
1
2
0
)
,
𝛽
=
(
5
0
5
2
)
,
𝑦
=
(
6
5
7
5
5
2
)
Compute
𝑦
−
𝑋
𝛽
, then
𝑋
𝑇
(
𝑦
−
𝑋
𝛽
)
, and the update increment
Δ
𝛽
.
Problem 7.
Explain in two sentences what the vector
𝑋
𝑇
(
𝑦
−
𝑋
𝛽
)
represents.
RSS and MSE Application
Problem 8.
Using the results from Problem 6 (where
𝑒
=
(
3
,
1
,
−
3
)
𝑇
), explain if the
model under- or over-predicts for each observation and state what an MSE of 6.33
means.
Collinearity and Remedies
Problem 9.
Explain in two sentences how near-collinearity affects the stability of
̂
𝛽
.
Propose a practical remedies with a one-sentence explanation.
42
Appendix: Programming
Reference
3
Python and Libraries
3.1
NumPy
Accessing NumPy
To use NumPy you must first import it. It is common practice to import NumPy
with the alias np so that subsequent function calls are concise.
import
numpy
as
np
Arrays and Vectors
In NumPy an array is a generic term for a multidimensional set of numbers. One-
dimensional NumPy arrays act like vectors. The following code creates two one-
dimensional arrays and adds them elementwise. If you attempted the same with
plain Python lists you would not get elementwise addition.
x
=
np
.
array
(
[
3
,
4
,
5
]
)
y
=
np
.
array
(
[
4
,
9
,
7
]
)
print
(
x
+
y
)
#
array([ 7, 13, 12])
Matrices as Two-Dimensional Arrays
Matrices in NumPy are typically represented as two-dimensional arrays. The object
returned by np.array has attributes such as ndim for the number of dimensions,
dtype for the data type, and shape for the size of each axis.
x
=
np
.
array
(
[
[
1
,
2
]
,
[
3
,
4
]
]
)
print
(
x
)
#
array([[1, 2], [3, 4]])
print
(
x
.
ndim
)
#
2
print
(
x
.
dtype
)
#
e.g. dtype('int64')
print
(
x
.
shape
)
#
(2, 2)
If any element passed into np.array is a floating point number, NumPy upcasts the
whole array to a floating point dtype.
print
(
np
.
array
(
[
[
1
,
2
]
,
[
3
.
0
,
4
]
]
)
.
dtype
)
#
dtype('float64')
print
(
np
.
array
(
[
[
1
,
2
]
,
[
3
,
4
]
]
,
float
)
.
dtype
)
#
dtype('float64')
Methods and Functions
Methods are functions bound to objects. Calling x.sum() calls the sum method with
x as the implicit first argument. The module-level function np.sum(x) does the same
computation but is not bound to x.
x
=
np
.
array
(
[
1
,
2
,
3
,
4
]
)
print
(
x
.
sum
(
)
)
#
method on the array object
print
(
np
.
sum
(
x
)
)
#
module-level function
The reshape method returns a new view with the same data arranged into a new
shape. You pass a tuple that specifies the new dimensions.
x
=
np
.
array
(
[
1
,
2
,
3
,
4
,
5
,
6
]
)
print
(
"
beginning x:
\n
"
,
x
)
43
x_reshape
=
x
.
reshape
(
(
2
,
3
)
)
print
(
"
reshaped x:
\n
"
,
x_reshape
)
NumPy uses zero-based indexing. The first row and first column entry of x_reshape
is accessed with x_reshape[0, 0]. The entry in the second row and third column is
x_reshape[1, 2]. The third element of the original one-dimensional x is x[2].
print
(
x_reshape
[
0
,
0
]
)
#
1
print
(
x_reshape
[
1
,
2
]
)
#
6
print
(
x
[
2
]
)
#
third element of x
Views and Shared Memory
Reshaping often returns a view rather than a copy. Modifying a view will modify
the original array because they share the same memory. This behavior is important
when you expect independent copies.
print
(
"
x before modification:
\n
"
,
x
)
print
(
"
x_reshape before modification:
\n
"
,
x_reshape
)
x_reshape
[
0
,
0
]
=
5
print
(
"
x_reshape after modification:
\n
"
,
x_reshape
)
print
(
"
x after modification:
\n
"
,
x
)
If you need an independent copy, call x.copy() explicitly.
x
=
np
.
array
(
[
1
,
2
,
3
,
4
,
5
,
6
]
)
x_copy
=
x
.
copy
(
)
x_reshape_copy
=
x_copy
.
reshape
(
(
2
,
3
)
)
x_reshape_copy
[
0
,
0
]
=
99
print
(
"
x remains unchanged:
\n
"
,
x
)
print
(
"
x_reshape_copy changed:
\n
"
,
x_reshape_copy
)
Tuples are immutable sequences in Python and will raise a TypeError if you try to
modify an element. This differs from NumPy arrays and Python lists.
my_tuple
=
(
3
,
4
,
5
)
#
my_tuple[0] = 2 # would raise TypeError: 'tuple' object does not
support item assignment
Transpose, ndim, and shape
You can request several attributes at once. The transpose T flips axes and is useful
for matrix algebra.
print
(
x_reshape
.
shape
,
x_reshape
.
ndim
,
x_reshape
.
T
)
#
For example: ((2, 3), 2, array([[5, 4], [2, 5], [3, 6]]))
Elementwise Operations
NumPy supports elementwise arithmetic and universal functions such as np.sqrt.
Raising an array to a power is elementwise.
print
(
np
.
sqrt
(
x
)
)
#
elementwise square root
print
(
x
*
*
2
)
#
elementwise square
print
(
x
*
*
0
.
5
)
#
alternative for square root
Random Numbers
NumPy provides random number generation. The signature for rng.normal is
normal(loc=0.0, scale=1.0, size=None). The arguments loc and scale are keyword
arguments for mean and standard deviation and size controls the shape of the
output.
x
=
np
.
random
.
normal
(
size
=
50
)
print
(
x
)
#
random sample from N(0,1), different each run
To create a dependent array, add a random variable with a different mean to each
element.
44
y
=
x
+
np
.
random
.
normal
(
loc
=
50
,
scale
=
1
,
size
=
50
)
print
(
np
.
corrcoef
(
x
,
y
)
)
#
correlation matrix between x and y
Reproducibility with the Generator API
To produce identical random numbers across runs, use np.random.default_rng with
an integer seed to create a Generator object and then call its methods. The
Generator API is the recommended approach for reproducibility.
rng
=
np
.
random
.
default_rng
(
1303
)
print
(
rng
.
normal
(
scale
=
5
,
size
=
2
)
)
rng2
=
np
.
random
.
default_rng
(
1303
)
print
(
rng2
.
normal
(
scale
=
5
,
size
=
2
)
)
#
Both prints produce the same arrays because the same seed was used.
When you use rng.standard_normal or rng.normal you are using the Generator
instance, which ensures reproducibility if you control the seed.
rng
=
np
.
random
.
default_rng
(
3
)
y
=
rng
.
standard_normal
(
10
)
print
(
np
.
mean
(
y
)
,
y
.
mean
(
)
)
Mean, Variance, and Standard Deviation
NumPy provides np.mean, np.var, and np.std as module-level functions. Arrays also
have methods mean, var, and std. By default np.var divides by n. If you need the
sample variance that divides by n minus 1, provide ddof=1.
rng
=
np
.
random
.
default_rng
(
3
)
y
=
rng
.
standard_normal
(
10
)
print
(
np
.
var
(
y
)
,
y
.
var
(
)
,
np
.
mean
(
(
y
-
y
.
mean
(
)
)
*
*
2
)
)
print
(
np
.
sqrt
(
np
.
var
(
y
)
)
,
np
.
std
(
y
)
)
#
Use np.var(y, ddof=1) for sample variance dividing by n-1.
Axis Arguments and Row/Column Operations
NumPy arrays are row-major ordered. The first axis, axis=0, refers to rows and the
second axis, axis=1, refers to columns. Passing axis into reduction methods lets you
compute means, sums, and other statistics along rows or columns.
rng
=
np
.
random
.
default_rng
(
3
)
X
=
rng
.
standard_normal
(
(
10
,
3
)
)
print
(
X
)
#
10 by 3 matrix
print
(
X
.
mean
(
axis
=
0
)
)
#
column means
print
(
X
.
mean
(
0
)
)
#
same as previous
When you compute X.mean(axis=1) you obtain a one-dimensional array of row
means. When you compute X.sum(axis=0) you obtain column sums.
Graphics with Matplotlib
Matplotlib is the standard plotting library. A plot consists of a figure and one or
more axes. The subplots function returns a tuple containing the figure and the axes.
The axes object has a plot method and other methods to customize titles, labels, and
markers.
from
matplotlib
.
pyplot
import
subplots
fig
,
ax
=
subplots
(
figsize
=
(
8
,
8
)
)
rng
=
np
.
random
.
default_rng
(
3
)
x
=
rng
.
standard_normal
(
100
)
y
=
rng
.
standard_normal
(
100
)
ax
.
plot
(
x
,
y
)
#
default line plot
ax
.
plot
(
x
,
y
,
'
o
'
)
#
scatter-like circles
#
To save: fig.savefig("scatter.png")
#
To display in an interactive session: import matplotlib.pyplot as
plt; plt.show()
45
Practical Notes
Using np.random.default_rng for all random generation in these examples makes
results reproducible across runs on the same NumPy version. As NumPy changes
across versions there may be minor differences in outputs for some operations.
When computing variance note the ddof argument if you expect sample variance
rather than population variance.
If you would like a single Python script that runs all of these examples sequentially,
or if you would like the Typst document to include rendered figures embedded as
images, I can prepare that next.
46
Appendix: Math Fundamentals
4
Calculus
4.1
Limits
4.2
Derivatives
Limit Definition of Derivative
4.3
Gradients
Vector Valued Functions
Gradient Definition
Partial Derivatives and Rules
5
Linear Algebra
6
Statistics and Probability
47
Reference
48
Glossary of Definitions
Definition 0.1
p. 8
Definition 0.2
p. 8
Definition 1.3.1
p. 11
Definition 1.5.1
p. 13
Definition 1.5.2
p. 14
Definition 1.7.1
p. 18
Definition 2.1.1
p. 22
Definition 2.1.2
p. 23
Definition 2.1.3
p. 23
Definition 2.1.4
p. 24
Definition 2.1.5
p. 24
Definition 2.1.6
p. 25
Definition 2.1.7
p. 25
Definition 2.1.8
p. 26
Definition 2.1.9
p. 26
Definition 2.1.10
p. 27
Definition 2.1.11
p. 28
Definition 2.1.12
p. 28
Definition 2.1.13
p. 28
Definition 2.1.14
p. 28
Definition 2.2.1
p. 30
Definition 2.3.1
p. 30
Definition 2.4.1
p. 32
Definition 2.5.1
p. 33
Definition 2.6.1
p. 36
49
←
Page 1 of ?
→
−
100%
+