Introduction To Artificial
Intelligence and Machine
Learning
Aniketh Chenjeri, Andrew Doyle, Swarnim Ghimire, Mr. Igor
Tomcej
2026-01-07
Contents
Foreword
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
How Is This Book Structured?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
Part I: Supervised Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
1
Linear Regression
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
1.1
What is Linear Regression?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
1.2
The Limits of Simple Estimation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
1.3
Linear Regression Formally Defined
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
1.4
Understanding the Algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
1.5
Measuring and Interpreting Error
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
1.6
Optimization with Gradient Descent
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
1.7
Understanding Gradients
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
1.8
Lab: Building Linear Regression from Scratch
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
1.9
Problem Packet
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
Theory Questions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
Practice Problems
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
Appendix: Programming Reference
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
2
Python and Libraries
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
2.1
NumPy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Accessing NumPy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Arrays and Vectors
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Matrices as Two-Dimensional Arrays
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Methods and Functions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
Views and Shared Memory
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Transpose, ndim, and shape
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Elementwise Operations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Random Numbers
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
Reproducibility with the Generator API
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
Mean, Variance, and Standard Deviation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
Axis Arguments and Row/Column Operations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
Graphics with Matplotlib
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
Practical Notes
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
Appendix: Math Fundamentals
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
3
Calculus
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
3.1
Limits
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
3.2
Derivatives
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
Limit Definition of Derivative
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
3.3
Gradients
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
Vector Valued Functions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
Gradient Definition
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
Partial Derivatives and Rules
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
4
Linear Algebra
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
5
Statistics and Probability
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
Reference
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
Glossary of Definitions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
4
Foreword
The Introduction to Artificial Intelligence (AI) and Machine Learning (ML) class has
had an interesting history. It started at Creek during the 2024-2025 school year with
the intention of giving students a gentle on-ramp to the field. The course dodged
heavy math and focused on teaching practical libraries: TensorFlow, NumPy,
Pandas, Scikit-Learn.
Due to unforeseen circumstances, the 2025-2026 school year began without a solid
foundation for the course. This prompted a comprehensive reflection and led us to
rebuild the syllabus from scratch. While our original intention (a gentle
introduction to machine learning) stayed unchanged, everyone involved agreed we
needed a more rigorous approach alongside an emphasis on intuition.
This task proved difficult because the subject matter of the course is inherently
interdisciplinary. We needed to make the course more intuitive and less
mathematical while still covering the field’s fundamentals.
This textbook represents a collection of lessons, notes, and exercises designed to
serve as the course’s foundation. At the time of writing, there are no mathematical
prerequisites, yet mathematics permeates the field of AI and ML. We faced a choice:
omit the mathematics and build a less rigorous course emphasizing breadth, or
include it and slow our pace. We chose the latter.
This textbook is written for readers with no calculus background. Rather than
requiring formal mathematical preparation, it assumes basic algebraic
understanding and builds intuition about which mathematical operations are
necessary. We will never ask you to compute complex formulas without first
teaching you what each part represents; we focus on understanding and developing
an intuition for what the math represents. This is a practical course, not a
theoretical one. It will give you a solid foundation to build upon if you decide to
pursue a career in the field. However, this book by no means is a replacement for
the mathematics you’ll need to learn if you decide to continue studying this field;
we instead hope to give you a starting point on which you can build.
We gratefully acknowledge the following contributors:
•
Primary Writers:
‣
Aniketh Chenjeri (CCHS ‘26)
‣
Andrew Doyle (CCHS ‘26)
‣
Mr. Igor Tomcej
•
Reviewers:
‣
Hariprasad Gridharan (CCHS ‘25, Cornell ‘29)
‣
Siddharth Menon (CCHS ‘26)
‣
Ani Gadepalli (CCHS ‘26)
5
How Is This Book Structured?
NOTE: This text is a living document, currently undergoing active development
as part of our commitment to pedagogical excellence. In order to ensure
rigorous academic standards, chapters are released sequentially following
comprehensive peer review. This is to say the version of the text you are
viewing right now is not the final one, expect updates to various parts of the
book as we continue to refine and improve the content.
While every effort is made to provide an accurate and authoritative resource,
please note that early editions may contain errata. We encourage students to
actively engage with the material; should you identify any discrepancies or
technical inaccuracies, please report them to your teacher or teachers
assistant(s) for correction in future revisions.
We appreciate your cooperation in refining this resource for future cohorts.
Your feedback is instrumental in ensuring the highest quality of instruction and
material.
This book is written to emphasize an understanding of fundamental concepts in
Artificial Intelligence and Machine Learning. We will begin with supervised
learning and its applications. We are starting with supervised learning because it’s
one of the most common techniques used by practitioners. It’s also the most
intuitive and easy to understand. In this section we’ll also slowly introduce
mathematical concepts and give you exercises to solidify your understanding. We
will never require you to do any calculus; however, it becomes impossible to
understand many algorithms without it. As such we will always “solve” any
calculus involved and ask you to interpret and apply it. This doesn’t mean this
course will omit math entirely; we will learn a lot of applied linear algebra, as it is
the core of how machine learning works. Under supervised learning you will also
learn important concepts for assessing the accuracy of a model and some insight
into which architectures are used for which scenarios. To truly accomplish this we
need to understand many statistics concepts; as a result this book will cover many
topics in statistics and probability. Some of these concepts will be familiar to you if
you take an AP Statistics class or equivalent.
We will then move on to unsupervised learning, where we’ll make a brief stop with
the k-means clustering algorithm and then move to neural networks. Understanding
neural networks is the ultimate goal of this class, as they are a ubiquitous and
powerful tool. If you gain an understanding of neural networks you will be able to
understand many complex algorithms such as Large Language Models, which are
the foundation for tools like ChatGPT, Google Gemini, Anthropic’s Claude, and
more.
Towards the end of this book we’ve compiled a series of documentation-like
chapters for various libraries, frameworks, and mathematical concepts. If you find
yourself not understanding certain concepts, tools, etc., you can always refer to
these documents.
6
The only assumption we make in the writing of this book is some familiarity with
Python (and programming in general) and Algebra 2. Even though we will cover
theory in this class, it will be a programming class first and foremost. You will write
a lot of code but will also be asked to understand theory and math.
In the compilation of this book we’ve pulled from various resources:
•
An Introduction to Statistical Learning
by Gareth James, Daniela Witten, Trevor
Hastie, and Robert Tibshirani
•
The Elements of Statistical Learning
by Trevor Hastie, Robert Tibshirani, and
Jerome Friedman
•
Various books by Justin Skyzak
(including
Introduction to Algorithms and Machine
Learning
)
•
Mathematics for Machine Learning
by Marc Peter Deisenroth, A. Aldo Faisal, and
Cheng Soon Ong
•
The Matrix Cookbook
by Kaare Brandt Petersen and Michael Syskind Pedersen
(Note: Often associated with the Joseph Montanez online version)
7
Part I: Supervised Learning
In supervised learning, we are provided with some input
𝑋
and a corresponding
response
𝑌
. Our fundamental objective is to estimate the relationship between
these variables, typically modeled as:
𝑌
=
𝑓
(
𝑋
)
+
𝜀
where
𝑓
is an unknown fixed function of the predictors and
𝜀
is a random error
term, independent of
𝑋
, with a mean of zero. Using a training dataset
𝒯︀
=
{
(
𝑥
1
,
𝑦
1
)
,
(
𝑥
2
,
𝑦
2
)
,
…
,
(
𝑥
𝑛
,
𝑦
𝑛
)
}
, we aim to apply a learning method to estimate the
function
̂
𝑓
. This estimate allows us to predict the response for previously unseen
observations. It is also common to express this relationship as
̂
𝑦
=
̂
𝑓
(
𝑋
)
.
Definition 0.1.
A
feature
is an individual, measurable property or characteristic of a
phenomenon being observed. In machine learning, it serves as the input
variable
𝑥
that a model uses to identify patterns and make predictions. For
example if we want to predict the price of a house based on its size, number of
bedrooms, and location, the features are size, number of bedrooms, and
location.
Definition 0.2.
A
label
is the output variable
𝑦
that a model uses to make predictions. For
example if we want to predict the price of a house based on its size, number of
bedrooms, and location, the label is the price of the house.
In the supervised learning portion of this textbook, we will learn the following:
•
Linear Regression via the Pseudoinverse: We will learn the closed-form analytical
solution for parameter estimation using the Moore-Penrose pseudoinverse and
the Normal Equation. This includes multiple linear regression, where we will
learn how to handle multiple predictors simultaneously.
•
Optimization via Gradient Decent and Stochastic Gradient Descent (SGD): We
will learn the notation for and how to implement Gradient and Stochastic
Gradient Descent for Optimization.
•
The Bias-Variance Tradeoff: Understanding the decomposition of prediction error
into reducible and irreducible components, and navigating the relationship
between overfitting and underfitting.
•
Polynomial Regression: Extending the linear framework to capture non-linear
relationships by mapping predictors into higher-dimensional feature spaces.
•
Shrinkage Methods (Ridge and Lasso): Applying
𝐿
1
and
𝐿
2
regularization to
minimize the Residual Sum of Squares (RSS) while controlling model complexity
and performing variable selection.
•
Logistic Regression: Transitioning to classification by modeling the conditional
probability of qualitative responses using the logit transformation.
8
•
k-Nearest Neighbors (k-NN): Utilizing a non-parametric, memory-based approach
to prediction based on local density and spatial proximity in the feature space.
Don’t worry if non of that makes sense to you, we’ll be covering it in detail in the
coming chapters.
1
Linear Regression
1.1
What is Linear Regression?
To recall supervised learning starts with known input data
𝑋
and known output
data
𝑌
and we are asked to fit an equation to this dataset.
One of the simplest approaches is to assume a linear relationship between inputs
and outputs. This assumption is usually wrong and somewhat contrived, but linear
models serve an important purpose: they give you a baseline. Once you have a
baseline model, you can test more complex models against it and see if they actually
perform better.
Traditionally, linear regression requires heavy use of linear algebra. We’re not going
to get too into the weeds in this course since this course doesn’t have a math pre-
req. Instead, we’ll use Sklearn, a popular machine learning library. But before we
jump into Sklearn’s code, we need to build intuition about how its algorithms
actually work. Once you understand the underlying process, using the library itself
becomes straightforward.This means we’ll need to do some math along the way.
Example 1.1.1.
Let’s engage in a hypothetical. Suppose you’re given 2 thermometers and
asked to measure the temperature for both metrics. Let’s say our results look
like this:
Celsius (x)
Fahrenheit (y)
0
31.8
5
41.9
10
49.2
15
60.1
20
67.4
25
78.9
30
87.5
Let’s plot the data:
9
Now we know that the equation for converting Celsius to Fahrenheit is
𝑦
=
1
.
8
𝑥
+
3
2
But assume you don’t know this equation and are asked to find purely based
on that data. After sitting with the problem for a while you’ll probably realize
that you can use the equation
𝑚
=
𝑦
2
−
𝑦
1
𝑥
2
−
𝑥
1
to estimate our coefficient in the
equation and we’ve been given a y intercept at
(
0
,
3
1
.
8
)
so from our given
data, after computing, we can say our equation is
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
Note In this equation, the
hat
symbol (
^
) indicates a
predicted value
or an
estimated parameter
, not a measured or input variable. So in our model
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
̂
𝑦
represents the
predicted Fahrenheit temperature
based on the input
𝑥
(in
Celsius). The hat shows that this value comes from our model’s estimation, not
directly from observed data.
You’ll often see this notation in statistics and machine learning to distinguish
predicted
outputs
̃
𝑦
and
estimated coefficients
(
̂
𝑤
,
̂
𝑏
)
from
true
or
observed
values.
Congratulations you’ve just made your first statistical model. But there a few
problems this approach to estimating linear equations from a given dataset.
1.2
The Limits of Simple Estimation
But there are a few problems with this simple approach to estimating linear
equations from a given dataset. Let’s look at what might happen if you actually
went around campus measuring temperature.
The real data that was collected by the AI class in 2025 looks like this:
10
This data is much more realistic and it highlights why our simplistic approach
doesn’t scale; Thermometers aren’t perfect. Maybe one was a little old, maybe you
were standing in direct sunlight for one measurement, or maybe a gust of wind hit
one of them.
That’s exactly why we use methods like linear regression and error measures.
Instead of trusting a single pair of points, regression finds the line that best fits all
our noisy data, balancing those little errors out. The goal isn’t to make every point
perfect (which you can’t do anyway), it’s to minimize the total amount of error
across the whole dataset.
1.3
Linear Regression Formally Defined
Definition 1.3.1.
To restate our problem: we have a given dataset composed of an input
𝑋
for
which we have an output
𝑌
and our job is to develop a equation that
encapsulates the relationship between
𝑋
and
𝑌
in some equation
̂
𝑦
=
𝑊
𝑥
+
𝑏
where W and b are accurate estimates of the real values (in this case 1.8 and 32
respectively)
Let’s say that this our data, it’s randomly generated noisy data and we are using it
as a proxy for a relatively large amount of real data.
11
1.4
Understanding the Algorithm
Let’s break down what sklearn does under the hood. The core method it actually
uses to estimate
𝑊
and
𝑏
are actually quite simple. It starts by picking a random
value for
𝑊
and
𝑏
and then checks how accurate those measures are then keeps
adjusting those numbers till our model becomes accurate.
But we need to slow down and somewhat rigorously lay out what each of those
statements mean.
The statement “picks a random value for
𝑊
and
𝑏
” is intuitive, but the big question
is how does it measure for the accuracy of
𝑊
and
𝑏
and how does it change those
values.
1.5
Measuring and Interpreting Error
Let’s start the first question: how do we measure error?
Example 1.5.1.
Let’s say we have this data and some predicted values in
̂
𝑦
𝑥
𝑦
̂
𝑦
1
3
2.5
2
5
5.2
3
4
4.1
4
7
6.8
5
6
6.3
12
To measure how accurate
𝑦
^
relative
𝑦
, we introduce the algorithm
RSS
Definition 1.5.1.
RSS
stands for Residual Sum of Squares and is defined by the following
equation:
RSS
=
∑
𝑁
𝑖
=
0
(
𝑦
𝑖
−
̂
𝑦
𝑖
)
2
Don’t be scared, this equation simply mean that we are taking every single y value
in the given data set and subtracting it by our models estimated values. Squaring
the function is done for the following reasons:
1.
Squaring makes all residuals positive, so large underpredictions and
overpredictions both contribute to the total error.
2.
It penalizes larger errors more heavily: A residual of 4 counts far more (16) than
a residual of 2 (which counts as 4). This makes the regression more sensitive to
large deviations.
3.
Squaring makes the loss function smooth and differentiable, which makes our life
a lot easier later on
If this still doesn’t make sense we can use this graphic to gain an intuition
13
But RSS gives us the error for each point, We want 1 number to measure error
overall. To do this we can define a function:
Definition 1.5.2.
The Mean Squared Error (MSE) is defined as:
𝐿
(
𝑊
,
𝑏
)
=
1
𝑁
∑
𝑁
𝑖
=
0
(
𝑦
𝑖
−
̂
𝑦
𝑖
)
2
or we can expand this to be:
𝐿
(
𝑊
,
𝑏
)
=
1
𝑁
∑
𝑁
𝑖
=
0
(
(
𝑦
𝑖
−
(
𝑊
𝑥
+
𝑏
)
)
2
The
1
𝑁
here just averages the error at every point.
So great! We now have a really solid way to measure errors
If we were to implement this in pure python we would do something like this:
#
Assume X is a spreadsheet of our data and X[i] is the ith row
def
mse
(
W
,
b
,
X
,
y
)
:
n
=
len
(
X
)
total_error
=
0
for
i
in
range
(
n
)
:
prediction
=
W
*
X
[
i
]
+
b
total_error
+=
(
y
[
i
]
-
prediction
)
*
*
2
return
total_error
/
n
1.6
Optimization with Gradient Descent
So we have a way to measure error with
MSE
. But now we face a new problem:
how do we actually find the best values for
𝑊
and
𝑏
?
We can’t just guess randomly forever. Instead, we need a systematic way to improve
our guesses. This is where derivatives come in
For those if you have taken a calculus class you will be familiar with the concept of
a derivative.
A derivative measures how much a function changes when you change its input
slightly. Think of it like the slope of a hill. If you’re standing on a hill and you want
to know which direction is steepest, the slope tells you. A positive slope means the
hill goes up in that direction, and a negative slope means it goes down.
14
In our case, we want to know: if I change
𝑊
slightly, does my error go up or down?
The derivative of the loss function with respect to
𝑊
answers exactly that question.
It tells us the slope of the error landscape.
If the derivative is positive, increasing
𝑊
increases error, so we should decrease
𝑊
.
If the derivative is negative, increasing
𝑊
decreases error, so we should increase
𝑊
.
By moving in the opposite direction of the derivative, we’re moving downhill
toward lower error.
This process is called gradient descent, and we update our weights using this rule:
𝑊
←
𝜂
𝜕
𝐿
𝜕
𝑊
Here,
𝜂
is the learning rate, which controls how big each step is. Too small and
learning is painfully slow. Too large and you might overshoot the best values
entirely.
We do the same for
𝑏
:
𝑏
←
𝜂
𝜕
𝐿
𝜕
𝑏
We repeat this process over and over, each time getting closer to the optimal
𝑊
and
𝑏
that minimize our error.
Since this class doesn’t not require you do the math We’re going to give you the
value of
𝜕
𝑊
𝜕
𝐿
and
𝑊
←
𝜂
𝜕
𝑏
𝜕
𝐿
:
𝜕
𝐿
𝜕
𝑊
=
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
+
𝑏
)
)
𝑥
𝜕
𝐿
𝜕
𝑏
=
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
+
𝑏
)
)
In Python this code looks like this:
dW
=
(
2
.
0
/
N
)
*
np
.
sum
(
error
*
x_scaled
,
dtype
=
np
.
float64
)
W
-=
lr
*
dW
db
=
(
2
.
0
/
N
)
*
np
.
sum
(
error
,
dtype
=
np
.
float64
)
b
-=
lr
*
db
1.7
Understanding Gradients
If you’re still struggling with the concept of gradients let’s break it down even more.
Instead of a function with 2 inputs let’s start with a function with 1 input something
like
𝑓
(
𝑥
)
=
𝑥
2
. We’ve all seen this function, when graph it looks like this:
15
Let’s say we have some model:
̂
𝑦
=
𝑊
𝑥
Now let’s say
𝑓
(
𝑥
)
=
𝑥
2
represents the
error for some weight (coefficient)
𝑊
. So if
𝑊
=
2
our model’s error is
4
and so on.
We want to find the lowest error value so we want to absolute minimum (the lowest
𝑦
value) in the model.
For this function there are various ways we could find it. Since we know that a
parabola that opens upword’s lowest point is the vertex you can write function in
vertex for and find it there but let’s say our loss landscape looks something like this:
16
If we look at the graph we can find the lowest point but that isn’t always possible.
Gradient Decent is a way of finding that lowest point. For functions with 1 input we
have the derivative.
The derivative is the slope of the tangent line at a point. This literally mean a
straight line that indicates the direction of the function at that point. Visually the
derivative is the purple line bellow
Figure 1: Example tangent line in purple for a loss function
This is the tangent line at
𝑥
=
0
.
4
and notice that if we were to try and display the
purple line as a linear equation in the form of
𝑦
=
𝑚
𝑥
+
𝑏
then
𝑚
or the slope of
the line would be negative. The actual value of m at
𝑥
=
0
.
4
is
−
2
.
2
6
5
2
8
1
5
8
0
5
7
That
𝑚
value for the purple line is the derivative. So, we know that for a function
that represents our loss at 1 point
Derivatives in 2d are represented for the original function
𝑓
(
𝑥
)
is
𝑓
′
(
𝑥
)
. So in this
instance our gradient decent algorithm would simply look like this for some weight
𝑊
:
𝑊
←
𝜂
𝑓
′
(
𝑥
)
17
We now have a solid understanding of what derivatives in 2d are. This assume that
we are trying to model equation
̂
𝑦
=
𝑊
𝑥
. However approach isn’t extremely useful.
Often we have multiple weights we want to estimate with our data. This makes
more complicated. This is where the idea of gradients comes from.
Definition 1.7.1.
A gradient is a generalization of the derivative for functions with multiple
inputs. If your function depends on several variables, like
𝑓
(
𝑥
,
𝑦
)
=
𝑥
2
+
𝑦
2
,
then the gradient is a vector that collects all the partial derivatives — one for
each variable:
∇
𝑓
=
(
𝜕
𝑓
𝜕
𝑥
,
𝜕
𝑓
𝜕
𝑦
)
=
(
2
𝑥
,
2
𝑦
)
Each of the items inside
∇
𝑓
can be represented partial derivative, that is the
derivative of the function
𝑓
with respect to one of it’s variables, which, in this
instance are
𝑥
and
𝑦
. So in our gradient decent algorithms:
𝜕
𝐿
𝜕
𝑊
=
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
+
𝑏
)
)
𝑥
𝜕
𝐿
𝜕
𝑏
=
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
+
𝑏
)
)
We are simply finding the equivalent for
𝑓
′
in the loss function
𝐿
(
𝑊
,
𝑏
)
with
respect to the variables
𝑊
and
𝑏
Bellow the function
𝐿
(
𝑊
,
𝑏
)
graphed out for a model in that we’re training in your
next lab:
18
Figure 2: The loss landscape for our model:
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
4
7
The red line represents the various weights the our model tried and it’s path to
reach the optimal weights with gradient decent: click on
this link
if you want to see
a video of the model being trained and gradient decent working in real time.
1.8
Lab: Building Linear Regression from Scratch
In this lab you’re going to build a linear regression from scratch. It is a 1-1
application of the algorithm we’ve been building in the previous lesson. You will use
just the numpy library and no other libraries. This is a good exercise to get you
familiar with the code and the math.
1.9
Problem Packet
Theory Questions
Problem 1.
The text describes linear models as a “baseline.” Explain the importance
of establishing a baseline model before moving on to more complex machine
learning algorithms.
Problem 2.
In the equation
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
, explain what the “hat” notation
̂
(
)
signifies and why it is crucial for distinguishing between types of data in statistics.
Problem 3.
The lesson provides three specific reasons for squaring residuals in the
RSS formula. List them and explain why making the loss function “smooth and
differentiable” is beneficial for optimization.
Problem 4.
What is the mathematical difference between Residual Sum of Squares
(
RSS
) and Mean Squared Error (
MSE
)? Why is
MSE
generally preferred when
working with datasets of varying sizes?
19
Problem 5.
Explain the role of the learning rate
𝜂
. Based on the “hill” analogy, what
physically happens to our “steps” if
𝜂
is too large versus too small?
Problem 6.
Define a gradient in the context of multiple variables (
𝑊
and
𝑏
). How
does a gradient differ from a standard 2D derivative?
Practice Problems
Problem 7.
You are given the coefficients
𝑎
=
1
,
𝑏
=
4
, and
𝑐
=
2
for the function
𝑓
(
𝑥
)
=
𝑥
2
+
4
𝑥
+
2
. Using the derivative
𝑓
′
(
𝑥
)
=
2
𝑥
+
4
, write a Python function
to find the minimum of
𝑓
(
𝑥
)
using gradient descent. Start at
𝑥
=
1
0
, use
𝜂
=
0
.
1
,
and run for 10 iterations.
Problem 8.
Calculate the
RSS
and
MSE
by hand for the following dataset given
the model
̂
𝑦
=
2
𝑥
+
1
:
•
𝑥
=
[
1
,
2
,
3
]
•
𝑦
=
[
3
,
6
,
7
]
Problem 9.
Given
𝑥
=
[
1
,
2
,
3
]
and
𝑦
=
[
2
,
3
,
4
]
, and initial parameters
𝑊
=
0
and
𝑏
=
0
, compute:
•
The predicted values
̂
𝑦
•
The residuals
(
𝑦
𝑖
−
̂
𝑦
𝑖
)
•
The current
MSE
Problem 10.
Using the data and initial parameters from Problem 9, perform one full
batch gradient descent update to find
𝑊
new
and
𝑏
new
. Use
𝜂
=
0
.
1
and the formulas:
𝜕
𝐿
𝜕
𝑊
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
𝑖
+
𝑏
)
)
𝑥
𝑖
𝜕
𝐿
𝜕
𝑏
=
−
2
𝑁
∑
𝑁
𝑖
=
1
(
𝑦
𝑖
−
(
𝑊
𝑥
𝑖
+
𝑏
)
)
Note: Use the sign convention from the provided Python code where the
gradient is subtracted.
Problem 11.
A thermometer model is trained to
̂
𝑦
=
1
.
8
5
𝑥
+
3
1
.
8
. If the actual
temperature is
0
°
𝐶
and the observed Fahrenheit reading is
3
1
.
8
, what is the
residual? If the actual temperature is
3
0
°
𝐶
and the observed reading is
8
7
.
5
, what is
the residual?
Problem 12.
Write a Python function
get_error(y_true, y_pred)
that returns the
Mean Squared Error using only the standard library (no numpy). Assume both
inputs are lists of equal length.
20
Appendix: Programming
Reference
2
Python and Libraries
2.1
NumPy
Accessing NumPy
To use NumPy you must first import it. It is common practice to import NumPy
with the alias np so that subsequent function calls are concise.
import
numpy
as
np
Arrays and Vectors
In NumPy an array is a generic term for a multidimensional set of numbers. One-
dimensional NumPy arrays act like vectors. The following code creates two one-
dimensional arrays and adds them elementwise. If you attempted the same with
plain Python lists you would not get elementwise addition.
x
=
np
.
array
(
[
3
,
4
,
5
]
)
y
=
np
.
array
(
[
4
,
9
,
7
]
)
print
(
x
+
y
)
#
array([ 7, 13, 12])
Matrices as Two-Dimensional Arrays
Matrices in NumPy are typically represented as two-dimensional arrays. The object
returned by np.array has attributes such as ndim for the number of dimensions,
dtype for the data type, and shape for the size of each axis.
x
=
np
.
array
(
[
[
1
,
2
]
,
[
3
,
4
]
]
)
print
(
x
)
#
array([[1, 2], [3, 4]])
print
(
x
.
ndim
)
#
2
print
(
x
.
dtype
)
#
e.g. dtype('int64')
print
(
x
.
shape
)
#
(2, 2)
If any element passed into np.array is a floating point number, NumPy upcasts the
whole array to a floating point dtype.
print
(
np
.
array
(
[
[
1
,
2
]
,
[
3
.
0
,
4
]
]
)
.
dtype
)
#
dtype('float64')
print
(
np
.
array
(
[
[
1
,
2
]
,
[
3
,
4
]
]
,
float
)
.
dtype
)
#
dtype('float64')
Methods and Functions
Methods are functions bound to objects. Calling x.sum() calls the sum method with
x as the implicit first argument. The module-level function np.sum(x) does the same
computation but is not bound to x.
x
=
np
.
array
(
[
1
,
2
,
3
,
4
]
)
print
(
x
.
sum
(
)
)
#
method on the array object
print
(
np
.
sum
(
x
)
)
#
module-level function
The reshape method returns a new view with the same data arranged into a new
shape. You pass a tuple that specifies the new dimensions.
x
=
np
.
array
(
[
1
,
2
,
3
,
4
,
5
,
6
]
)
print
(
"
beginning x:
\n
"
,
x
)
21
x_reshape
=
x
.
reshape
(
(
2
,
3
)
)
print
(
"
reshaped x:
\n
"
,
x_reshape
)
NumPy uses zero-based indexing. The first row and first column entry of x_reshape
is accessed with x_reshape[0, 0]. The entry in the second row and third column is
x_reshape[1, 2]. The third element of the original one-dimensional x is x[2].
print
(
x_reshape
[
0
,
0
]
)
#
1
print
(
x_reshape
[
1
,
2
]
)
#
6
print
(
x
[
2
]
)
#
third element of x
Views and Shared Memory
Reshaping often returns a view rather than a copy. Modifying a view will modify
the original array because they share the same memory. This behavior is important
when you expect independent copies.
print
(
"
x before modification:
\n
"
,
x
)
print
(
"
x_reshape before modification:
\n
"
,
x_reshape
)
x_reshape
[
0
,
0
]
=
5
print
(
"
x_reshape after modification:
\n
"
,
x_reshape
)
print
(
"
x after modification:
\n
"
,
x
)
If you need an independent copy, call x.copy() explicitly.
x
=
np
.
array
(
[
1
,
2
,
3
,
4
,
5
,
6
]
)
x_copy
=
x
.
copy
(
)
x_reshape_copy
=
x_copy
.
reshape
(
(
2
,
3
)
)
x_reshape_copy
[
0
,
0
]
=
99
print
(
"
x remains unchanged:
\n
"
,
x
)
print
(
"
x_reshape_copy changed:
\n
"
,
x_reshape_copy
)
Tuples are immutable sequences in Python and will raise a TypeError if you try to
modify an element. This differs from NumPy arrays and Python lists.
my_tuple
=
(
3
,
4
,
5
)
#
my_tuple[0] = 2 # would raise TypeError: 'tuple' object does not
support item assignment
Transpose, ndim, and shape
You can request several attributes at once. The transpose T flips axes and is useful
for matrix algebra.
print
(
x_reshape
.
shape
,
x_reshape
.
ndim
,
x_reshape
.
T
)
#
For example: ((2, 3), 2, array([[5, 4], [2, 5], [3, 6]]))
Elementwise Operations
NumPy supports elementwise arithmetic and universal functions such as np.sqrt.
Raising an array to a power is elementwise.
print
(
np
.
sqrt
(
x
)
)
#
elementwise square root
print
(
x
*
*
2
)
#
elementwise square
print
(
x
*
*
0
.
5
)
#
alternative for square root
Random Numbers
NumPy provides random number generation. The signature for rng.normal is
normal(loc=0.0, scale=1.0, size=None). The arguments loc and scale are keyword
arguments for mean and standard deviation and size controls the shape of the
output.
x
=
np
.
random
.
normal
(
size
=
50
)
print
(
x
)
#
random sample from N(0,1), different each run
To create a dependent array, add a random variable with a different mean to each
element.
22
y
=
x
+
np
.
random
.
normal
(
loc
=
50
,
scale
=
1
,
size
=
50
)
print
(
np
.
corrcoef
(
x
,
y
)
)
#
correlation matrix between x and y
Reproducibility with the Generator API
To produce identical random numbers across runs, use np.random.default_rng with
an integer seed to create a Generator object and then call its methods. The
Generator API is the recommended approach for reproducibility.
rng
=
np
.
random
.
default_rng
(
1303
)
print
(
rng
.
normal
(
scale
=
5
,
size
=
2
)
)
rng2
=
np
.
random
.
default_rng
(
1303
)
print
(
rng2
.
normal
(
scale
=
5
,
size
=
2
)
)
#
Both prints produce the same arrays because the same seed was used.
When you use rng.standard_normal or rng.normal you are using the Generator
instance, which ensures reproducibility if you control the seed.
rng
=
np
.
random
.
default_rng
(
3
)
y
=
rng
.
standard_normal
(
10
)
print
(
np
.
mean
(
y
)
,
y
.
mean
(
)
)
Mean, Variance, and Standard Deviation
NumPy provides np.mean, np.var, and np.std as module-level functions. Arrays also
have methods mean, var, and std. By default np.var divides by n. If you need the
sample variance that divides by n minus 1, provide ddof=1.
rng
=
np
.
random
.
default_rng
(
3
)
y
=
rng
.
standard_normal
(
10
)
print
(
np
.
var
(
y
)
,
y
.
var
(
)
,
np
.
mean
(
(
y
-
y
.
mean
(
)
)
*
*
2
)
)
print
(
np
.
sqrt
(
np
.
var
(
y
)
)
,
np
.
std
(
y
)
)
#
Use np.var(y, ddof=1) for sample variance dividing by n-1.
Axis Arguments and Row/Column Operations
NumPy arrays are row-major ordered. The first axis, axis=0, refers to rows and the
second axis, axis=1, refers to columns. Passing axis into reduction methods lets you
compute means, sums, and other statistics along rows or columns.
rng
=
np
.
random
.
default_rng
(
3
)
X
=
rng
.
standard_normal
(
(
10
,
3
)
)
print
(
X
)
#
10 by 3 matrix
print
(
X
.
mean
(
axis
=
0
)
)
#
column means
print
(
X
.
mean
(
0
)
)
#
same as previous
When you compute X.mean(axis=1) you obtain a one-dimensional array of row
means. When you compute X.sum(axis=0) you obtain column sums.
Graphics with Matplotlib
Matplotlib is the standard plotting library. A plot consists of a figure and one or
more axes. The subplots function returns a tuple containing the figure and the axes.
The axes object has a plot method and other methods to customize titles, labels, and
markers.
from
matplotlib
.
pyplot
import
subplots
fig
,
ax
=
subplots
(
figsize
=
(
8
,
8
)
)
rng
=
np
.
random
.
default_rng
(
3
)
x
=
rng
.
standard_normal
(
100
)
y
=
rng
.
standard_normal
(
100
)
ax
.
plot
(
x
,
y
)
#
default line plot
ax
.
plot
(
x
,
y
,
'
o
'
)
#
scatter-like circles
#
To save: fig.savefig("scatter.png")
#
To display in an interactive session: import matplotlib.pyplot as
plt; plt.show()
23
Practical Notes
Using np.random.default_rng for all random generation in these examples makes
results reproducible across runs on the same NumPy version. As NumPy changes
across versions there may be minor differences in outputs for some operations.
When computing variance note the ddof argument if you expect sample variance
rather than population variance.
If you would like a single Python script that runs all of these examples sequentially,
or if you would like the Typst document to include rendered figures embedded as
images, I can prepare that next.
24
Appendix: Math Fundamentals
3
Calculus
3.1
Limits
3.2
Derivatives
Limit Definition of Derivative
3.3
Gradients
Vector Valued Functions
Gradient Definition
Partial Derivatives and Rules
4
Linear Algebra
5
Statistics and Probability
25
Reference
26
Glossary of Definitions
Definition 0.1
p. 7
Definition 0.2
p. 7
Definition 1.3.1
p. 10
Definition 1.5.1
p. 12
Definition 1.5.2
p. 13
Definition 1.7.1
p. 17
27
←
Page 1 of ?
→
−
100%
+