Introduction To Artificial Intelligence and Machine LearningAniketh Chenjeri, Andrew Doyle, Swarnim Ghimire, Mr. Igor Tomcej2026-02-24ContentsForeword .................................................................................................. ⁠4How Is This Book Structured? ........................................................................... ⁠5Part I: Supervised Learning ............................................................................... ⁠81Linear Regression ...................................................................................... ⁠91.1What is Linear Regression? ....................................................................... ⁠91.2The Limits of Simple Estimation ................................................................. ⁠101.3Linear Regression Formally Defined ............................................................. ⁠111.4Understanding the Algorithm .................................................................... ⁠121.5Measuring and Interpreting Error ............................................................... ⁠121.6Optimization with Gradient Descent ............................................................ ⁠141.7Understanding Gradients ......................................................................... ⁠151.81/16/26: Problem Packet .......................................................................... ⁠191.9Problem Packet ................................................................................... ⁠20Theory Questions ............................................................................... ⁠20Practice Problems ............................................................................... ⁠202Pseudoinverse & Multiple Linear Regression ........................................................ ⁠212.1Linear Algebra Primer ............................................................................ ⁠21Sets ............................................................................................... ⁠21Common Number Sets .......................................................................... ⁠21Relationships Between Sets ..................................................................... ⁠22Vectors, Vector Addition, and Scalar Multiplication ........................................... ⁠22Matrices, Notation and Dimensionality ........................................................ ⁠23Sum and Difference of Matrices ................................................................ ⁠24Dot Product ...................................................................................... ⁠25Matrix Multiplication ........................................................................... ⁠26Transpose ........................................................................................ ⁠26Identity Matrix .................................................................................. ⁠27Determinant ..................................................................................... ⁠27Inverse Matrix ................................................................................... ⁠282.2Multiple Linear Regression Formally ............................................................ ⁠302.3The Normal Equation and Pseudoinverses ...................................................... ⁠30Decomposition of the Matrix Form: ............................................................ ⁠312.4Gradient Descent for Multiple Variables ........................................................ ⁠322.5Feature Scaling .................................................................................... ⁠32Start with a contrived failure case .............................................................. ⁠33Training run walkthrough: good, bad, and ugly ............................................... ⁠34Why this happens geometrically ............................................................... ⁠35Generalizing beyond this toy dataset ........................................................... ⁠362.6Interpreting Weights After Scaling .............................................................. ⁠362.7Linear Algebra Practice Problems ................................................................ ⁠36Sets and Number Sets ........................................................................... ⁠36Vectors and Vector Operations ................................................................. ⁠37Notation and Dimensionality ................................................................... ⁠37Dot Product ...................................................................................... ⁠38Matrix Multiplication ........................................................................... ⁠38Transpose ........................................................................................ ⁠38Determinant ..................................................................................... ⁠39Identity Matrix and Inverse ..................................................................... ⁠39Mixed Practice ................................................................................... ⁠392.8Problem Packet ................................................................................... ⁠39Data Representation and Design Matrix ....................................................... ⁠39Matrix Operations (Reference) .................................................................. ⁠40Normal Equation (Derivation Provided) ....................................................... ⁠40MSE and Gradient (Derivation Provided) ...................................................... ⁠40RSS and MSE Application ....................................................................... ⁠41Collinearity and Remedies ...................................................................... ⁠41Appendix: Programming Reference ..................................................................... ⁠423Python and Libraries .................................................................................. ⁠423.1NumPy ............................................................................................ ⁠42Accessing NumPy ............................................................................... ⁠42Arrays and Vectors .............................................................................. ⁠42Matrices as Two-Dimensional Arrays .......................................................... ⁠42Methods and Functions ......................................................................... ⁠42Views and Shared Memory ..................................................................... ⁠43Transpose, ndim, and shape .................................................................... ⁠43Elementwise Operations ........................................................................ ⁠43Random Numbers ............................................................................... ⁠43Reproducibility with the Generator API ....................................................... ⁠44Mean, Variance, and Standard Deviation ....................................................... ⁠44Axis Arguments and Row/Column Operations ................................................ ⁠44Graphics with Matplotlib ....................................................................... ⁠44Practical Notes .................................................................................. ⁠45Appendix: Math Fundamentals ......................................................................... ⁠464Calculus ............................................................................................... ⁠464.1Limits ............................................................................................. ⁠464.2Derivatives ........................................................................................ ⁠46Limit Definition of Derivative .................................................................. ⁠464.3Gradients .......................................................................................... ⁠46Vector Valued Functions ........................................................................ ⁠46Gradient Definition ............................................................................. ⁠46Partial Derivatives and Rules ................................................................... ⁠465Linear Algebra ........................................................................................ ⁠466Statistics and Probability ............................................................................. ⁠46Reference ................................................................................................. ⁠47Glossary of Definitions .................................................................................. ⁠484ForewordThe Introduction to Artificial Intelligence and Machine Learning class has had an interesting history. It started at Creek during the 2024-2025 school year with the intention of giving students a gentle on-ramp to the field. The course dodged heavy math and focused on using libraries such as TensorFlow, NumPy, Pandas, Scikit-Learn.Due to unforeseen circumstances, the 2025-2026 school year began without a solid foundation for the course. This prompted a comprehensive reflection and led us to rebuild the syllabus from scratch. While our original intention, a gentle introduction to machine learning, remained unchanged, everyone involved agreed we needed a more rigorous approach to the class alongside an emphasis on intuition.This task proved difficult because subject is inherently interdisciplinary. We needed to make the course more intuitive and less mathematical while still covering the field’s fundamentals.This textbook represents a collection of dedicated lessons, lecture notes, and exercises designed to serve as the course’s foundation. At the time of writing, there are no mathematical prerequisites, yet mathematics permeates the fields of Artificial Intelligence and Machine Learning. We faced a choice: omit the mathematics and build a less rigorous course emphasizing breadth, or include it and slow our pace. We chose the latter.This textbook is written for readers with no calculus background. Rather than requiring formal mathematical preparation, it assumes basic algebraic understanding and builds intuition about which mathematical operations are necessary. We will never ask you to compute complex formulas without teaching them to you; instead, we focus on understanding and developing an intuition for what the math represents. This is a practical course, not a theoretical one. It will give you a solid foundation to build upon if you decide to pursue a career in the field. However, this book by no means is a replacement for the mathematics you’ll need to learn; we instead hope to give you a starting point on which you can build.We gratefully acknowledge the following contributors, this book would not have been possible without their efforts:Primary Writers:Aniketh Chenjeri (CCHS ‘26)Andrew Doyle (CCHS ‘26)Mr. Igor TomcejReviewers:Hariprasad Gridharan (CCHS ‘25, Cornell ‘29)Siddharth Menon (CCHS ‘26)Ani Gadepalli (CCHS ‘26)5How Is This Book Structured?NOTE: This text is a living document, currently undergoing active development as part of our commitment to pedagogical excellence. In order to ensure rigorous academic standards, chapters are released sequentially following comprehensive peer review. This is to say the version of the text you are viewing right now is not the final one, expect updates to various parts of the book as we continue to refine and improve the content.While every effort is made to provide an accurate and authoritative resource, please note that early editions may contain errata. We encourage students to actively engage with the material; should you identify any discrepancies or technical inaccuracies, please report them to your teacher or teachers assistant for correction in future revisions.We appreciate your cooperation in refining this resource for future cohorts. Your feedback is instrumental in ensuring the highest quality of instruction and material.This book is written to emphasize an understanding of fundamental concepts in AI/ML. We will begin with supervised learning and its applications. We are starting with supervised learning because it’s one of the most common techniques used by practitioners. It’s also the most intuitive and easy to understand. In this section we’ll also slowly introduce mathematical concepts and give you exercises to solidify your understanding. We will never require you to do any calculus; however, it becomes impossible to understand many algorithms without it. As such we will always “solve” any calculus involved and ask you to interpret and apply it. This doesn’t mean this course will omit math entirely; we will learn a lot of applied linear algebra, as it is the core of how machine learning works. Under supervised learning you will also learn important concepts for assessing the accuracy of a model and some insight into which architectures are used for which scenarios. To truly accomplish this we need to understand many statistics concepts; as a result this book will cover many topics in statistics and probability. Some of these concepts will be familiar to you if you take an AP Statistics class or equivalent.We will then move on to unsupervised learning, where we’ll make a brief stop with the k-means clustering algorithm and then move to neural networks. Understanding neural networks is the ultimate goal of this class, as they are a ubiquitous and powerful tool. If you gain an understanding of neural networks you will be able to understand many complex algorithms such as Large Language Models, which are the foundation for tools like ChatGPT, Google Gemini, Anthropic’s Claude, and more.Towards the end of this book we’ve compiled a series of documentation-like chapters for various libraries, frameworks, and mathematical concepts. If you find yourself not understanding certain concepts, tools, etc., you can always refer to these documents.6The only assumption we make in the writing of this book is some familiarity with Python (and programming in general) and Algebra 2. Even though we will cover theory in this class, it will be a programming class first and foremost. You will write a lot of code but will also be asked to understand theory and math.In the compilation of this book we’ve pulled from various resources:1.Introduction to Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman2.The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman3.Various books by Justin Skyzak4.Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong5.The Matrix Cookbook by Joseph Montanez6.The Deep Learning Book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville7Table of ContentsForeword ................................................................................ ⁠4How Is This Book Structured? ......................................................... ⁠5Part I: Supervised Learning ............................................................ ⁠81Linear Regression .................................................................... ⁠91.1What is Linear Regression? ..................................................... ⁠91.2The Limits of Simple Estimation ............................................... ⁠101.3Linear Regression Formally Defined ........................................... ⁠111.4Understanding the Algorithm ................................................. ⁠121.5Measuring and Interpreting Error ............................................. ⁠121.6Optimization with Gradient Descent .......................................... ⁠141.7Understanding Gradients ...................................................... ⁠151.81/16/26: Problem Packet ........................................................ ⁠191.9Problem Packet ................................................................. ⁠202Pseudoinverse & Multiple Linear Regression ...................................... ⁠212.1Linear Algebra Primer ......................................................... ⁠212.2Multiple Linear Regression Formally .......................................... ⁠302.3The Normal Equation and Pseudoinverses .................................... ⁠302.4Gradient Descent for Multiple Variables ...................................... ⁠322.5Feature Scaling ................................................................. ⁠322.6Interpreting Weights After Scaling ............................................ ⁠362.7Linear Algebra Practice Problems ............................................. ⁠362.8Problem Packet ................................................................. ⁠39Appendix: Programming Reference .................................................. ⁠423Python and Libraries ............................................................... ⁠423.1NumPy .......................................................................... ⁠42Appendix: Math Fundamentals ....................................................... ⁠464Calculus ............................................................................. ⁠464.1Limits ........................................................................... ⁠464.2Derivatives ...................................................................... ⁠464.3Gradients ....................................................................... ⁠465Linear Algebra ...................................................................... ⁠466Statistics and Probability ........................................................... ⁠46Reference .............................................................................. ⁠47Glossary of Definitions ................................................................ ⁠488Part I: Supervised LearningIn supervised learning, we are provided with some input 𝑋 and a corresponding response 𝑌. Our fundamental objective is to estimate the relationship between these variables, typically modeled as:𝑌=𝑓(𝑋)+𝜀where 𝑓 is an unknown fixed function of the predictors and 𝜀 is a random error term, independent of 𝑋, with a mean of zero. Using a training dataset 𝒯︀={(𝑥1,𝑦1),(𝑥2,𝑦2),,(𝑥𝑛,𝑦𝑛)}, we aim to apply a learning method to estimate the function ̂𝑓. This estimate allows us to predict the response for previously unseen observations. It is also common to express this relationship as ̂𝑦=̂𝑓(𝑋).Definition 0.1.A feature is an individual, measurable property or characteristic of a phenomenon being observed. In machine learning, it serves as the input variable 𝑥 that a model uses to identify patterns and make predictions. For example if we want to predict the price of a house based on its size, number of bedrooms, and location, the features are size, number of bedrooms, and location.Definition 0.2.A label is the output variable 𝑦 that a model uses to make predictions. For example if we want to predict the price of a house based on its size, number of bedrooms, and location, the label is the price of the house.In the supervised learning portion of this textbook, we will learn the following:Linear Regression via the Pseudoinverse: We will learn the closed-form analytical solution for parameter estimation using the Moore-Penrose pseudoinverse and the Normal Equation. This includes multiple linear regression, where we will learn how to handle multiple predictors simultaneously.Optimization via Gradient Decent and Stochastic Gradient Descent (SGD): We will learn the notation for and how to implement Gradient and Stochastic Gradient Descent for Optimization.The Bias-Variance Tradeoff: Understanding the decomposition of prediction error into reducible and irreducible components, and navigating the relationship between overfitting and underfitting.Polynomial Regression: Extending the linear framework to capture non-linear relationships by mapping predictors into higher-dimensional feature spaces.Shrinkage Methods (Ridge and Lasso): Applying 𝐿1 and 𝐿2 regularization to minimize the Residual Sum of Squares (RSS) while controlling model complexity and performing variable selection.Logistic Regression: Transitioning to classification by modeling the conditional probability of qualitative responses using the logit transformation.9k-Nearest Neighbors (k-NN): Utilizing a non-parametric, memory-based approach to prediction based on local density and spatial proximity in the feature space.Don’t worry if non of that makes sense to you, we’ll be covering it in detail in the coming chapters.1Linear Regression1.1What is Linear Regression?To recall, supervised learning starts with known input data 𝑋 and known output data 𝑌, and we are asked to fit an equation to this dataset.One of the simplest approaches is to assume a linear relationship between inputs and outputs. This assumption is usually wrong and somewhat contrived, but linear models serve an important purpose: they give you a baseline. Once you have a baseline model, you can test more complex models against it and see if they actually perform better.Traditionally, linear regression requires heavy use of linear algebra. We’re not going to get too into the weeds in this course since this course doesn’t have a math pre-req. Instead, we’ll use Sklearn, a popular machine learning library. But before we jump into Sklearn’s code, we need to build intuition about how its algorithms actually work. Once you understand the underlying process, using the library itself becomes straightforward. This means we’ll need to do some math along the way. Example 1.1.1.Let’s engage in a hypothetical. Suppose you’re given 2 thermometers and asked to measure the temperature for both metrics. Let’s say our results look like this:Celsius (x)Fahrenheit (y)031.8541.91049.21560.12067.42578.93087.5Let’s plot the data:10Now we know that the equation for converting Celsius to Fahrenheit is𝑦=1.8𝑥+32But assume you don’t know this equation and are asked to find purely based on that data. After sitting with the problem for a while you’ll probably realize that you can use the equation 𝑚=𝑦2𝑦1𝑥2𝑥1 to estimate our coefficient in the equation and we’ve been given a y intercept at (0,31.8) so from our given data, after computing, we can say our equation iŝ𝑦=1.85𝑥+31.8Note In this equation, the hat symbol (^) indicates a predicted value or an estimated parameter, not a measured or input variable. So in our model̂𝑦=1.85𝑥+31.8̂𝑦 represents the predicted Fahrenheit temperature based on the input 𝑥 (in Celsius). The hat shows that this value comes from our model’s estimation, not directly from observed data.You’ll often see this notation in statistics and machine learning to distinguish predicted outputs ̃𝑦 and estimated coefficients (̂𝑤,̂𝑏) from true or observed values.Congratulations! You’ve just made your first statistical model. But there are a few problems with this approach to estimating linear equations from a given dataset. 1.2The Limits of Simple EstimationBut there are a few problems with this simple approach to estimating linear equations from a given dataset. Let’s look at what might happen if you actually went around campus measuring temperature.The real data that was collected by the AI class in 2025 looks like this:11This data is much more realistic and it highlights why our simplistic approach doesn’t scale. Thermometers aren’t perfect. Maybe one was a little old, maybe you were standing in direct sunlight for one measurement, or maybe a gust of wind hit one of them. That’s exactly why we use methods like linear regression and error measures. Instead of trusting a single pair of points, regression finds the line that best fits all our noisy data, balancing those little errors out. The goal isn’t to make every point perfect (which you can’t do anyway), it’s to minimize the total amount of error across the whole dataset.1.3Linear Regression Formally DefinedDefinition 1.3.1.To restate our problem: we have a given dataset composed of an input 𝑋 for which we have an output 𝑌 and our job is to develop an equation that encapsulates the relationship between 𝑋 and 𝑌 in some equation ̂𝑦=𝑤𝑥+𝑏 where 𝑤 and 𝑏 are accurate estimates of the real values (in this case 1.8 and 32 respectively) Let’s say that this is our data, it’s randomly generated noisy data and we are using it as a proxy for a relatively large amount of real data.12 1.4Understanding the AlgorithmLet’s break down what sklearn does under the hood. The core method it actually uses to estimate 𝑤 and 𝑏 are actually quite simple. It starts by picking a random value for 𝑤 and 𝑏 and then checks how accurate those measures are then keeps adjusting those numbers till our model becomes accurate.But we need to slow down and somewhat rigorously lay out what each of those statements mean.The statement “picks a random value for 𝑤 and 𝑏” is intuitive, but the big question is how does it measure for the accuracy of 𝑤 and 𝑏 and how does it change those values.1.5Measuring and Interpreting ErrorLet’s start the first question: how do we measure error? Example 1.5.1.Let’s say we have this data and some predicted values in ̂𝑦𝑥𝑦̂𝑦132.5255.2344.1476.8566.313To measure how accurate ̂𝑦 is to 𝑦, we introduce the RSS metricDefinition 1.5.1.RSS stands for Residual Sum of Squares and is defined by the following equation:RSS=𝑁𝑖=0(𝑦𝑖̂𝑦𝑖)2Don’t be scared, this equation simply means that we are taking every single y value in the given data set and subtracting it by our models estimated values. Squaring the function is done for the following reasons:1.Squaring makes all residuals positive, so large underpredictions and overpredictions both contribute to the total error.2.It penalizes larger errors more heavily: A residual of 4 counts far more (16) than a residual of 2 (which counts as 4). This makes the regression more sensitive to large deviations.3.Squaring makes the loss function smooth and differentiable, which makes our life a lot easier later on If this still doesn’t make sense we can use this graphic to gain an intuition14But RSS gives us the error for each point. We want 1 number to measure error overall. To do this we can define a function:Definition 1.5.2.The Mean Squared Error (MSE) is defined as:𝐿(𝑤,𝑏)=1𝑁𝑁𝑖=0(𝑦𝑖̂𝑦𝑖)2or we can expand this to be:𝐿(𝑤,𝑏)=1𝑁𝑁𝑖=0(𝑦𝑖(𝑤𝑥𝑖+𝑏))2The 1𝑁 here just averages the error at every point.So great! We now have a really solid way to measure errorIf we were to implement this in pure python we would do something like this:# Assume X is a spreadsheet of our data and X[i] is the ith rowdef mse(W, b, X, y): n = len(X) total_error = 0 for i in range(n): prediction = W * X[i] + b total_error += (y[i] - prediction) ** 2 return total_error / n1.6Optimization with Gradient DescentSo we have a way to measure error with MSE. But now we face a new problem: how do we actually find the best values for 𝑤 and 𝑏?We can’t just guess randomly forever. Instead, we need a systematic way to improve our guesses. This is where derivatives come inFor those who have taken a calculus class you will be familiar with the concept of a derivative.A derivative measures how much a function changes when you change its input slightly. Think of it like the slope of a hill. If you’re standing on a hill and you want to know which direction is steepest, the slope tells you. A positive slope means the hill goes up in that direction, and a negative slope means it goes down.15In our case, we want to know: if I change 𝑤 slightly, does my error go up or down? The derivative of the loss function with respect to 𝑤 answers exactly that question. It tells us the slope of the error landscape.If the derivative is positive, increasing 𝑤 increases error, so we should decrease 𝑤. If the derivative is negative, increasing 𝑤 decreases error, so we should increase 𝑤. By moving in the opposite direction of the derivative, we’re moving downhill toward lower error.This process is called gradient descent, and we update our weights using this rule:𝑤𝑤𝜂𝜕𝐿𝜕𝑤Here, 𝜂 is the learning rate, which controls how big each step is. Too small and learning is painfully slow. Too large and you might overshoot the best values entirely.We do the same for 𝑏:𝑏𝑏𝜂𝜕𝐿𝜕𝑏We repeat this process over and over, each time getting closer to the optimal 𝑤 and 𝑏 that minimize our error.Since this class doesn’t require you to do the math, we’re going to give you the value of 𝜕𝐿𝜕𝑤 and 𝜕𝐿𝜕𝑏:𝜕𝐿𝜕𝑤=2𝑁𝑁𝑖=1(𝑦𝑖(𝑤𝑥𝑖+𝑏))𝑥𝑖𝜕𝐿𝜕𝑏=2𝑁𝑁𝑖=1(𝑦𝑖(𝑤𝑥𝑖+𝑏))In Python this code looks like this:dw = (-2.0 / N) * np.sum(error * x_scaled, dtype=np.float64)w -= lr * dwdb = (-2.0 / N) * np.sum(error, dtype=np.float64)b -= lr * db1.7Understanding GradientsIf you’re still struggling with the concept of gradients, let’s break it down even more. Instead of a function with 2 inputs let’s start with a function with 1 input, something like 𝑓(𝑥)=𝑥2. We’ve all seen this function, when graphed it looks like this:16Let’s say we have some model: ̂𝑦=𝑤𝑥 Now let’s say 𝑓(𝑥)=𝑥2 represents the error for some weight (coefficient) 𝑤. So if 𝑤=2 our model’s error is 4 and so on. We want to find the lowest error value so we want the absolute minimum (the lowest 𝑦 value) in the model.For this function there are various ways we could find it. Since we know that a parabola that opens upwards, it’s lowest point is the vertex. You can write the function in vertex form to find the minimum. However, let’s consider a more complex loss landscape:17If we look at the graph we can find the lowest point but that isn’t always possible. Gradient Descent is a way of finding that lowest point. For functions with 1 input we have the derivative.The derivative is the slope of the tangent line at a point. This literally means a straight line that indicates the direction of the function at that point. Visually the derivative is the purple line belowFigure 1: Example tangent line in purple for a loss functionThis is the tangent line at 𝑥=0.4 and notice that if we were to try and display the purple line as a linear equation in the form of 𝑦=𝑚𝑥+𝑏 then 𝑚 or the slope of the line would be negative. The actual value of m at 𝑥=0.4 is 2.26528158057That 𝑚 value for the purple line is the derivative. So, we know that for a function that represents our loss at 1 pointDerivatives in 2d are represented for the original function 𝑓(𝑥) is 𝑓(𝑥). So in this instance our gradient descent algorithm would simply look like this for some weight 𝑤:𝑤𝑤𝜂𝑓(𝑥)18We now have a solid understanding of what derivatives in 2d are. This assumes that we are trying to model equation ̂𝑦=𝑤𝑥. However, this approach isn’t extremely useful. Often we have multiple weights we want to estimate with our data. This makes more complex. This is where the idea of gradients comes from.Definition 1.7.1.A gradient is a generalization of the derivative for functions with multiple inputs. If your function depends on several variables, like 𝑓(𝑥,𝑦)=𝑥2+𝑦2, then the gradient is a vector that collects all the partial derivatives — one for each variable:𝑓=(𝜕𝑓𝜕𝑥,𝜕𝑓𝜕𝑦)=(2𝑥,2𝑦)Each of the items inside 𝑓 can be represented partial derivative, that is the derivative of the function 𝑓 with respect to one of its variables, which, in this instance are 𝑥and𝑦. So in our gradient descent algorithms:𝜕𝐿𝜕𝑤=2𝑁𝑁𝑖=1(𝑦𝑖(𝑤𝑥𝑖+𝑏))𝑥𝑖𝜕𝐿𝜕𝑏=2𝑁𝑁𝑖=1(𝑦𝑖(𝑤𝑥𝑖+𝑏))We are simply finding the equivalent for 𝑓 in the loss function 𝐿(𝑤,𝑏) with respect to the variables 𝑤 and 𝑏Below the function 𝐿(𝑊,𝑏) graphed out for a model where we’re training in your next lab:19Figure 2: The loss landscape for our model: ̂𝑦=1.85𝑥+31.47The red line represents the various weights our model tried and its path to reach the optimal weights with gradient descent: click on this link if you want to see a video of the model being trained and gradient descent working in real time.1.81/16/26: Problem PacketProblem 1. You’re modeling Fahrenheit from Celsius. You try three baselines:Constant model: ̂𝑦=𝑐Linear model: ̂𝑦=𝑊𝑥+𝑏Piecewise linear with one breakpoint at 𝑥=10: ̂𝑦=𝑊1𝑥+𝑏1 for 𝑥10, and ̂𝑦=𝑊2𝑥+𝑏2 for 𝑥>10Explain, without using any code, which baseline is “stronger” and why a stronger baseline can sometimes make your project harder (but better science). Your answer must include what “strong baseline” means operationally, how it impacts comparisons to later models, and at least two failure modes if your baseline is too weak.Problem 2. In the text, ̂𝑦 is predicted output, and ̂𝑊, ̂𝑏 are estimated parameters. Explain the difference between observed 𝑦, predicted ̂𝑦, “true” but unknown relationship (call it 𝑦), estimated parameters (̂𝑊, ̂𝑏), and measurement noise. Then answer: if you re-collect the dataset tomorrow with the same thermometers, which of these are expected to change and why?Problem 3. The chapter lists reasons to square residuals, explain why these reasons are important. Then argue for one scenario where squaring residuals is a bad idea, and name a better alternative loss. (You can use the internet to find the answer to this question.)20Problem 4. You train the same model form ̂𝑦=𝑊𝑥+𝑏 on two datasets A and B.Dataset A has 𝑁=20 points, RSS = 120.Dataset B has 𝑁=200 points, RSS = 950.1.Compute MSE for both.2.Explain why the RSS value can mislead you across dataset sizes.3.Give one situation where RSS is still useful or preferred (be specific).Problem 5. Suppose your fitted line gives small MSE, but when you plot residuals 𝑟𝑖=𝑦𝑖̂𝑦𝑖 versus 𝑥𝑖, you see a clear U-shape. Explain what this implies about: the linearity assumption, whether the bias term ̂𝑏 is “wrong”, what kind of model change would address it (at least two options), and why MSE alone didn’t warn you.Problem 6. Given: 𝑥=[5,0,5,10], 𝑦=[20.0,31.8,40.0,55.0], Model: ̂𝑦=1.6𝑥+31.8 Compute: ̂𝑦 for each 𝑥 residuals 𝑟𝑖=𝑦𝑖̂𝑦𝑖 RSS and MSE Identify which point contributes most to RSS and explain whyProblem 7. Given: 𝑥=[5,0,5,10] 𝑦=[20.0,31.8,40.0,55.0] Model: ̂𝑦=1.6𝑥+31.8 Compute: ̂𝑦 for each 𝑥 residuals 𝑟𝑖=𝑦𝑖̂𝑦 RSS and MSE Identify which point contributes most to RSS and explain whyProblem 8. Dataset: 𝑥=[0,5,10,15,20] 𝑦=[32.0,41.0,50.5,60.0,68.0] Two candidate models: A: ̂𝑦=1.8𝑥+32𝐵:accent(y, hat) = 1.9 x + 31 Compute RSS for both and decide which is better under RSS/MSE. Then answer: which model is more plausible physically, and can plausibility disagree with MSE here?1.9Problem PacketTheory QuestionsProblem 1. The text describes linear models as a “baseline.” Explain the importance of establishing a baseline model before moving on to more complex machine learning algorithms.Problem 2. In the equation ̂𝑦=1.85𝑥+31.8, explain what the “hat” notation ̂() signifies and why it is crucial for distinguishing between types of data in statistics.Problem 3. The lesson provides three specific reasons for squaring residuals in the RSS formula. List them and explain why making the loss function “smooth and differentiable” is beneficial for optimization.Problem 4. What is the mathematical difference between Residual Sum of Squares (RSS) and Mean Squared Error (MSE)? Why is MSE generally preferred when working with datasets of varying sizes?Practice ProblemsProblem 5. You are given the coefficients 𝑎=1, 𝑏=4, and 𝑐=2 for the function 𝑓(𝑥)=𝑥2+4𝑥+2. Using the derivative 𝑓(𝑥)=2𝑥+4, write a Python function to find the minimum of 𝑓(𝑥) using gradient descent. Start at 𝑥=10, use 𝜂=0.1, and run for 10 iterations.Problem 6. Calculate the RSS and MSE by hand for the following dataset given the model ̂𝑦=2𝑥+1:𝑥=[1,2,3]𝑦=[3,6,7]Problem 7. Given 𝑥=[1,2,3] and 𝑦=[2,3,4], and initial parameters 𝑊=0 and 𝑏=0, compute:The predicted values ̂𝑦The residuals (𝑦𝑖̂𝑦𝑖)The current MSE21Problem 8. Using the data and initial parameters from Problem 9, perform one full batch gradient descent update to find 𝑊new and 𝑏new. Use 𝜂=0.1 and the formulas:𝜕𝐿𝜕𝑊=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥𝑖+𝑏))𝑥𝑖𝜕𝐿𝜕𝑏=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥𝑖+𝑏))Note: Use the sign convention from the provided Python code where the gradient is subtracted.Problem 9. A thermometer model is trained to ̂𝑦=1.85𝑥+31.8. If the actual temperature is 0°𝐶 and the observed Fahrenheit reading is 31.8, what is the residual? If the actual temperature is 30°𝐶 and the observed reading is 87.5, what is the residual?Problem 10. Write a Python function get_error(y_true, y_pred) that returns the Mean Squared Error using only the standard library (no numpy). Assume both inputs are lists of equal length.Go to: https://github.com/CreekCS/ai-ml-textbook-labs/blob/main/intro-to-assignments.ipynb for the remaining problems.2Pseudoinverse & Multiple Linear RegressionSo far we’ve assumed a very simplistic relationship between 𝑋 and 𝑌. But what if we have more than one predictor? For example, what if we wanted to predict the price of a house based on its size, number of bedrooms, and location?To answer this question we need to introduce some linear algebra.2.1Linear Algebra PrimerNOTE: This section is very heavy on the mathematical notation and concepts. We recommend you consult the Essence of Linear Algebra video series by 3Blue1Brown for a more visual and intuitive understanding of the concepts. Many of the concepts covered in those videos are beyond the scope of this class, however, many videos in the series can be used to supplement your understanding of the concepts covered in this section.SetsAs you have learned in the Data Structures class, a set is an abstract data structure designed for the efficient storage and retrieval of unique elements, often prioritizing computational performance. This mirrors the mathematical definition where a set is a distinct collection of objects, as both systems prohibit duplicate members and lack a fundamental requirement for ordering.Common Number SetsIn mathematics, we categorize data into specific sets based on their properties. This hierarchy allows us to define exactly what kind of “values” a variable is allowed to hold:Natural Numbers : The set of positive counting numbers {1,2,3,}. These are used for indices or counting items where you cannot have zero or negatives.Integers : Whole numbers including zero and negatives {,2,1,0,1,2,}. These represent discrete quantities, like the number of bedrooms in a house.Rational Numbers : Numbers that can be expressed as a fraction 𝑝𝑞 where 𝑝,𝑞. This includes terminating decimals like 0.75.22Real Numbers : The set of all possible points on a continuous number line. This includes everything in plus “irrational” numbers like 𝜋 or 2. We use these for measurements requiring high precision, like square footage or temperature.Complex Numbers : Numbers that include an imaginary unit 𝑖. While less common in basic data sets, they are vital for signal processing and advanced physics simulations.Relationships Between SetsIt is helpful to visualize these sets as nested boxes. Every Natural number is also an Integer; every Integer is also a Rational number; and every Rational number is also a Real number. We use the symbol (“subset”) to represent this relationship:Vectors, Vector Addition, and Scalar MultiplicationLet’s say you’re trying to predict the price of a house. To describe its value, we might look at:the total square footage of the living area;the number of bedrooms available;the age of the property in years.The characteristics of the house can now be written by those 3 numbers:(square footage,number of bedrooms,age).If the attributes are (2500,4,10), it means the house has 2500 square feet, 4 bedrooms, and is 10 years old. If the house is instead a brand new construction (and keeps the same size and room count), then the attributes are (2500,4,0).What we have just described is a vector to show the features of the property. This is an example of a “3-vector”.We can write vectors in a “top to bottom” format. So the above 3-vector is written as:(2500410)To distinguish between ordinary numbers from vectors, we will use the word scalar to describe a single number. For example, the number 200 is a scalar, but the vector (200,300,25) is not.In algebra so far, we’ve used symbols like 𝑥,𝑦,𝑧,𝑎,𝑏, to denote scalars. In many linear algebra texts, boldface symbols are used to denote vectors (𝒙,𝒚,𝒛,𝒂,𝒃,).Definition 2.1.1.For a whole number 𝑛, an n-vector is a list of 𝑛 real numbers. We denote by 𝑛 the collection of all possible n-vectors.For this class we can also think of vectors as arrows in a space; the number 𝑛 here represents the number of dimensions in the space. We can visualize this for 𝑛=2 or 𝑛=3 as arrows. For example with 𝑛=2 we can visualize the vector 𝒗=(𝑣1𝑣2) as an arrow in the coordinate plane 2 pointing from the origin (0,0) to the point (𝑣1,𝑣2), and similarly for a 3-vector by working in 3. Examples are shown below:23In the physical sciences, vectors are used to represent quantities that have both a magnitude and a direction (i.e displacement, velocity, force, etc). However from the perspective of machine learning, vectors are used to keep track of collections of numerical data. This means we will nearly always have a very large 𝑛.Example 2.1.1.Suppose there are 100 students in the AI class. We can keep track of all their grades on the first test by using a 100-vector𝑬=(𝐸1𝐸2𝐸100)Here 𝐸1 is the first exam grade of the first student, 𝐸2 the first exam grade of the second student, and so on.Now that we have an understanding of vectors, we can learn how to conduct algebraic operations on vectors. Primarily we will learn vector addition and scalar multiplication.Definition 2.1.2.The sum 𝒗+𝒘 of two vectors is defined only when 𝒗 and 𝒘 are 𝑛-vectors. In that case, we define their sum by the rule (𝑣1𝑣2𝑣𝑛)+(𝑤1𝑤2𝑤𝑛)=(𝑣1+𝑤1𝑣2+𝑤2𝑣𝑛+𝑤𝑛).Definition 2.1.3.We can multiply some scalar 𝑐 against an 𝑛-vector 𝒗=(𝑣1𝑣2𝑣𝑛) by the rule 𝑐(𝑣1𝑣2𝑣𝑛)=(𝑐𝑣1𝑐𝑣2𝑐𝑣𝑛)Example 2.1.2.Practice problem: Let 𝒗=(213) and 𝒘=(542). Compute 𝒗+𝒘 and 2𝒗.𝒗+𝒘=(2+51+43+(2))=(731).2𝒗=(426).Matrices, Notation and DimensionalityThe concept of a matrix is something that you should have briefly touched upon in Algebra 2. As such, we’re assuming some basic knowledge of matrices here.To recall:24Definition 2.1.4.A matrix is a rectangular array of numbers or other mathematical objects with elements or entries arranged in rows and columns. A matrix with 𝑝 rows and 𝑑 columns is called a 𝑝×𝑑 matrix.The symbol (“is an element of”) is the bridge between a single value and its set. However, in Linear Algebra, we use this notation to define the shape and domain of entire matrices.When we write 𝑋𝑝×𝑑, we are not just saying 𝑋 is a real number. We are using the set as a building block to describe a high-dimensional space:1.The tells us that every single entry inside the matrix is a real number.2.The exponent (𝑝×𝑑) defines the “container” size—𝑝 rows and 𝑑 columns.If we expand the matrix 𝑋𝑝×𝑑 it would look something like this:𝑋=(𝑥11𝑥21𝑥𝑝1𝑥12𝑥22𝑥𝑝2𝑥1𝑑𝑥2𝑑𝑥𝑝𝑑)Notice that we have a way to identify the elements of the matrix based on their position in the row and column. The first element would be 𝑥11. The first 1 represents the row and the second 1 represents the column. So 𝑥11 is the first element in the first row and first column, and so on.For example, if you have a spreadsheet of 100 houses (𝑝=100) and each house has 5 features (𝑑=5), your data matrix 𝑋 exists in the space of 100×5. This tells any reader immediately that your data consists of 500 unique real-valued measurements organized in a specific grid.Recall, features are distinct measurements that describe a single observation. For example, the feature “size” might be the height of a house, and “bedrooms” the count of rooms. Features are represented as columns in a matrix.Labels are the target values we aim to predict for an observation. For example, the label “price” is the house’s market value. Labels are typically represented as a single column vector, where each entry corresponds to the observation in that row.Example 2.1.3.Practice problem: Suppose 𝐴2×3 and 𝐵3×4. What is the shape of 𝐴𝐵?Since the inner dimensions match (3), the product is defined and 𝐴𝐵2×4.Sum and Difference of MatricesDefinition 2.1.5.25The sum of 2 matrices 𝐴=(𝑎11𝑎21𝑎𝑝1𝑎12𝑎22𝑎𝑝2𝑎1𝑑𝑎2𝑑𝑎𝑝𝑑) and 𝐵=(𝑏11𝑏21𝑏𝑝1𝑏12𝑏22𝑏𝑝2𝑏1𝑑𝑏2𝑑𝑏𝑝𝑑) is defined only when 𝐴 and 𝐵 are of the same size. In that case, we define their sum by the rule 𝐴+𝐵=(𝑎11+𝑏11𝑎21+𝑏21𝑎𝑝1+𝑏𝑝1𝑎12+𝑏12𝑎22+𝑏22𝑎𝑝2+𝑏𝑝2𝑎1𝑑+𝑏1𝑑𝑎2𝑑+𝑏2𝑑𝑎𝑝𝑑+𝑏𝑝𝑑).Example 2.1.4.Let’s say we have the matrices 𝐴=(1324) and 𝐵=(5768). Then 𝐴+𝐵=(1+53+72+64+8)=(610812).Definition 2.1.6.The difference of 2 matrices 𝐴=(𝑎11𝑎21𝑎𝑝1𝑎12𝑎22𝑎𝑝2𝑎1𝑑𝑎2𝑑𝑎𝑝𝑑) and 𝐵=(𝑏11𝑏21𝑏𝑝1𝑏12𝑏22𝑏𝑝2𝑏1𝑑𝑏2𝑑𝑏𝑝𝑑) is also defined only when 𝐴 and 𝐵 are of the same size (in other words 𝐴𝑝×𝑑 and 𝐵𝑝×𝑑). In that case, we define their difference by using sums but multiplying the second matrix by the scalar 1:first we multiply the second matrix by the scalar 1:𝐵=(𝑏11𝑏21𝑏𝑝1𝑏12𝑏22𝑏𝑝2𝑏1𝑑𝑏2𝑑𝑏𝑝𝑑)then we add the two matrices:𝐴𝐵=𝐴+(𝐵)=(𝑎11𝑏11𝑎21𝑏21𝑎𝑝1𝑏𝑝1𝑎12𝑏12𝑎22𝑏22𝑎𝑝2𝑏𝑝2𝑎1𝑑𝑏1𝑑𝑎2𝑑𝑏2𝑑𝑎𝑝𝑑𝑏𝑝𝑑)Dot ProductDefinition 2.1.7.The dot product of two vectors 𝒂=(𝑎1𝑎2𝑎𝑛) and 𝒃=(𝑏1𝑏2𝑏𝑛) is defined as:𝒂𝒃=𝑛𝑖=1𝑎𝑖𝑏𝑖=𝑎1𝑏1+𝑎2𝑏2++𝑎𝑛𝑏𝑛Example 2.1.5.We can visualize this: assume we have 𝒂=(123) and 𝒃=(456). The dot product looks like this:26𝒂𝒃=(1×4)+(2×5)+(3×6)=4+10+18=32Example 2.1.6.Practice problem: Let 𝒖=(214) and 𝒗=(302). Compute 𝒖𝒗.𝒖𝒗=(2)(3)+(1)(0)+(4)(2)=6+08=2.The dot product is the tool we’ll use to multiply vectors and matrices.Matrix MultiplicationDefinition 2.1.8.Suppose that we have 𝐴𝑟×𝑑 and 𝐵𝑑×𝑠. Then the product of 𝐴 and 𝐵 is denoted 𝐴𝐵. The (𝑖,𝑗)th element of (𝐴𝐵) is computed by multiplying each element of the 𝑖th row of 𝐴 by the corresponding element of the 𝑗th column of 𝐵. That is, (𝐴𝐵)𝑖𝑗=𝑑𝑘=1𝑎𝑖𝑘𝑏𝑘𝑗.Example 2.1.7.As an example, consider𝑨=(1324)and𝑩=(5768).Then𝑨𝑩=(1324)(5768)=(1×5+2×73×5+4×71×6+2×83×6+4×8)=(19432250).Example 2.1.8.Practice problem: Let 𝐴=(231401) and 𝐵=(125201). Compute 𝐴𝐵.First check dimensions: 𝐴2×3 and 𝐵3×2, so 𝐴𝐵2×2.𝐴𝐵=(2×1+(1)×(2)+0×53×1+4×(2)+1×52×2+(1)×0+0×(1)3×2+4×0+1×(1))=(4045).Note that this operation produces an 𝑟×𝑠 matrix. It is only possible to compute 𝑨𝑩 if the number of columns of 𝑨 is the same as the number of rows of 𝑩.TransposeDefinition 2.1.9.To take the transpose of a matrix, we interchange each of its columns with the corresponding row. That is, row 1 becomes column 1, row 2 becomes column 2, and so on. A superscript 𝑇 is used to denote the transpose operation.27So if we have a matrix 𝑋𝑝×𝑑 we can take its transpose 𝑋𝑇𝑑×𝑝 by swapping the rows and columns. Element-wise, (𝑋𝑇)𝑖𝑗=𝑋𝑗𝑖.Example 2.1.9.For example, if we have a matrix 𝑋3×2 we can take its transpose 𝑋𝑇2×3 by swapping the rows and columns:𝑋=(𝑥11𝑥21𝑥31𝑥12𝑥22𝑥32)𝑋𝑇=(𝑥11𝑥12𝑥21𝑥22𝑥31𝑥32)Example 2.1.10.Practice problem: If 𝐶=(023514), compute 𝐶𝑇.𝐶𝑇=(031254).Identity MatrixBefore we can understand the inverse of a matrix, we need to understand the identity matrix. Recall that multiplying any scalar (number) by 1 gives you the same scalar back? The identity matrix plays the same role for matrices.Definition 2.1.10.The identity matrix 𝐼𝑛 (or just 𝐼 when the size is clear) is a square 𝑛×𝑛 matrix with 1s on the diagonal and 0s everywhere else:𝐼3=(100010001)For any matrix 𝐴 of compatible size: 𝐴𝐼=𝐼𝐴=𝐴Example 2.1.11.Let’s verify this works:(2435)(1001)=(2×1+3×04×1+5×02×0+3×14×0+5×1)=(2435)The matrix came out unchanged, just like multiplying a number by 1.DeterminantBefore we can compute the inverse of a matrix, we need to understand the determinant. The determinant is a single number that tells us important information about a matrix — most critically, whether the matrix has an inverse.28Definition 2.1.11.For a 2×2 matrix 𝐴=(𝑎𝑐𝑏𝑑), the determinant is defined as:det(𝐴)=𝑎𝑑𝑏𝑐The determinant is often written as |𝐴| or det(𝐴).Example 2.1.12.For 𝐴=(3124):det(𝐴)=(3)(4)(2)(1)=122=10The determinant has a geometric interpretation: it tells you how much a matrix “stretches” or “squishes” space. If det(𝐴)=2, multiplying by 𝐴 doubles areas. If det(𝐴)=0, the matrix collapses space into a lower dimension — and this is exactly when the matrix has no inverse.Definition 2.1.12.A matrix 𝐴 is invertible (has an inverse) if and only if det(𝐴)0.Example 2.1.13.Let’s check if 𝐵=(1224) has an inverse:det(𝐵)=(1)(4)(2)(2)=44=0Since det(𝐵)=0, this matrix does not have an inverse. Notice that the second row is exactly twice the first row — the rows are linearly dependent.Inverse MatrixNow we can define and compute the inverse. Just like division “undoes” multiplication for numbers (since 5×15=1), an inverse matrix “undoes” matrix multiplication.Definition 2.1.13.For a square matrix 𝐴, its inverse 𝐴1 is the matrix such that:𝐴𝐴1=𝐴1𝐴=𝐼Not every matrix has an inverse. A matrix that has an inverse is called invertible or non-singular.Computing the Inverse of a 2×2 MatrixFor a 2×2 matrix, there’s a simple formula:Definition 2.1.14.If 𝐴=(𝑎𝑐𝑏𝑑) and det(𝐴)0, then:29𝐴1=1det(𝐴)(𝑑𝑐𝑏𝑎)In words: swap the diagonal elements, negate the off-diagonal elements, and divide everything by the determinant.Example 2.1.14.Let’s compute the inverse of 𝐴=(3124) step by step.Step 1: Compute the determinant.det(𝐴)=(3)(4)(2)(1)=122=10Since det(𝐴)=100, the inverse exists.Step 2: Apply the formula.𝐴1=110(4123)=(410110210310)=(0.40.10.20.3)Step 3: Verify by computing 𝐴𝐴1.𝐴𝐴1=(3124)(0.40.10.20.3)=(3(0.4)+2(0.1)1(0.4)+4(0.1)3(0.2)+2(0.3)1(0.2)+4(0.3))=(1.20.20.40.40.6+0.60.2+1.2)=(1001)=𝐼Example 2.1.15.Let’s compute the inverse of 𝐴=(1324).Step 1: det(𝐴)=(1)(4)(2)(3)=46=2Step 2: Apply the formula:𝐴1=12(4321)=(232112)Larger MatricesFor matrices larger than 2×2, computing inverses by hand becomes much more tedious. There are methods like Gaussian elimination and cofactor expansion, but they require many more steps.For a 3×3 matrix, the determinant formula involves 6 terms. For a 4×4 matrix, it involves 24 terms. In general, computing the determinant of an 𝑛×𝑛 matrix by the basic formula involves 𝑛! (factorial) terms — that’s 120 terms for a 5×5 matrix!In practice, we let computers handle matrix inverses. Libraries like NumPy provide efficient algorithms:import numpy as npA = np.array([[3, 2], [1, 4]])A_inv = np.linalg.inv(A)30print(A_inv) # [[ 0.4 -0.2] # [-0.1 0.3]]Due to time constraints and the complexity of the calculations, we will not be computing the inverse of matrices larger than 2×2 by hand in this class. Instead we will introduce various techniques to compute the inverse of matrices but will not require you to ever compute them by hand.2.2Multiple Linear Regression FormallyNow that we’ve established a foundation in Linear Algebra, we can finally tackle the problem of predicting house pricing with more than one feature. As you saw in our previous house table, a single weight 𝑊 isn’t enough when we have size, bedrooms, and location all affecting the pricing of houses.In Multiple Linear Regressions, we don’t just have one 1 coefficient to optimize we have to manipulate multipleDefinition 2.2.1.To make this work, we give every feature it’s own weight. If we have 𝑝 features, our prediction formula ̂𝑦 looks like this:̂𝑦=𝛽0+𝑋1𝛽1+𝑋2𝛽2++𝑋𝑝𝛽𝑝In this equation:𝛽0 is our Intercept replacing 𝑏 from our previous chapter𝑋1,𝑋2, our features (Size Bedrooms, etc.)𝛽1,𝛽2, are the specific weights for those featuresIf you reference library documentation you’ll often see this representing as a single matrix multiplication:̂𝑦=𝑋𝛽2.3The Normal Equation and PseudoinversesIn the previous chapter, we talked about finding the optimal weights using gradient descent. Now that we know some basic linear algebra however, we can introduce a “perfect” mathematical solution to find the best weights immediately without iterative gradient decent.Definition 2.3.1.This is called the Normal Equation. If we want to minimize our total error (RSS) across all houses, we can use this formula:̂𝛽=(𝑋𝑇𝑋)1𝑋𝑇𝑦When you combine them, the math splits out the exact set of weights that creates the lowest possible error.To understand what this equation does let’s consider a simple equation 𝑦=𝛽𝑥. If we wanted to find the value of 𝛽 we would simply divide both sides by 𝑥 like so: 𝑦𝑥=𝛽. This can also be written as 𝛽=𝑥1𝑦. This isn’t too dissimilar to the normal equation. Let’s break down all it’s parts one-b-one:31Decomposition of the Matrix Form:Recall that when dealing with multiple observations and features, we represent our system as 𝑦=𝑋𝛽. However, because 𝑋 is typically a non-square matrix (more observations than features), we cannot simply invert it. We derive the solution:1.The Gram MatrixSince 𝑋 is rectangular, we multiply it’s transpose 𝑋𝑇 to create a symmetric, square matrix.Example 2.3.1.Let’s actually calculate the optimal coefficients for the following dataset:SizeBedroomsPrice1,0002100,0002,0003200,0003,0004300,0004,0005400,000We can first make our data a matrix 𝑋 and define some labels 𝑦:𝑋=(11111,0002,0003,0004,0002345),𝑦=(100,000200,000300,000400,000)First we can transpose our matrix:𝑋𝑇=(11,000212,000313,000414,0005)Then we can calculate 𝑋𝑇𝑋 by multiplying the transpose of 𝑋 by 𝑋:𝑋𝑇𝑋=(410,0001410,00030,000,00040,0001440,00054)We multiply the flipped matrix by the house prices. This shows how each feature relates to the price.𝑋𝑇𝑦=(1,000,0003,000,000,0004,000,000)Then finally we can find the inverse and the final solution: This is the part where the math gets very tedious to do by hand. We will therefore omit the calculation and just give you the result:̂𝛽=(01000)Intercept 𝛽0=0: The baseline price starts at $0.Size 𝛽1=100: For every 1 square foot, the price increases by $10032Bedrooms 𝛽2=0: In this specific (contrived) dataset, the number of bedrooms didn’t add extra value beyond what the square footage already explained.Our final equation would be: ̂𝑦=0+100𝑥1+0𝑥2 or in its simplest form: ̂𝑦=100𝑥1.2.4Gradient Descent for Multiple VariablesSometimes we do not want to take a matrix inverse (or it is too expensive). In that case we can minimize the error iteratively using gradient descent.Definition 2.4.1.In matrix form, the prediction is still ̂𝑦=𝑋𝛽. We define a single loss over all houses (batch loss):𝐽(𝛽)=(12𝑛)𝑋𝛽𝑦2The gradient of this loss tells us the downhill direction:𝛽𝐽=(1𝑛)𝑋𝑇(𝑋𝛽𝑦)Then we repeatedly update every weight at once:𝛽𝛽𝛼𝛽𝐽Example 2.4.1.Batch gradient descent in practice:1.Start with a guess for 𝛽 (often all zeros).2.Compute the predictions 𝑋𝛽.3.Measure the error 𝑋𝛽𝑦.4.Use the gradient formula above to update all weights.5.Repeat until the loss stops changing much.This method is slower than the normal equation, but it scales to large datasets and does not require matrix inversion.2.5Feature ScalingIf gradient descent is “walking downhill,” feature scale determines whether that walk is smooth or chaotic. In a housing dataset, square footage might be around 3000 while bedrooms might be around 3. Both features matter, but they live on very different numeric scales.Without scaling, one weight gets giant updates while another barely moves. That is how you get zig-zagging, very slow progress, or complete divergence.33Figure 3: A literal 3D loss surface view of gradient descent. Left: unscaled features create a thin valley and chaotic zig-zag steps. Right: scaled features create a rounder bowl and smoother progress.Definition 2.5.1.Feature scaling transforms each feature column so that columns are numerically comparable. The most common method is standardization (z-score scaling):𝑥𝑖,𝑗=𝑥𝑖,𝑗𝜇𝑗𝜎𝑗with𝜇𝑗=(1𝑛)𝑛𝑖=1𝑥𝑖,𝑗𝜎𝑗=(1𝑛)𝑛𝑖=1(𝑥𝑖,𝑗𝜇𝑗)2Here, 𝜇𝑗 is the center of feature 𝑗, and 𝜎𝑗 is the feature’s typical distance from that center.Start with a contrived failure caseImagine a model with only two input features:𝑥1 = square footage (roughly 800 to 4500)𝑥2 = bedrooms (roughly 1 to 5)At initialization, many models start with 𝛽=0. Then prediction error is roughly 𝑦, and each gradient component is proportional to:𝜕𝐽𝜕𝛽𝑗𝑝𝑟𝑜𝑝𝑡𝑜(1𝑛)𝑛𝑖=1𝑦𝑖𝑥𝑖,𝑗That means each gradient component is scaled by feature size itself. A feature with values in the thousands naturally creates much larger updates than one with values near 1 to 5.Example 2.5.1.A quick numeric intuition:34If a typical home has about 2500 square feet and 3 bedrooms, then the raw magnitude ratio is about 25003833.So, before scaling, the square-footage gradient can easily be hundreds of times larger than the bedrooms gradient.This does not mean bedrooms are unimportant. It means the optimizer is being biased by units.Figure 4: Raw feature space vs standardized feature space. After z-score scaling, both features occupy comparable numeric ranges.Read this figure left to right:Left panel: the model sees one axis with values in the thousands and another near single digits.Right panel: both axes are centered around 0 with similar spread, so optimization is more balanced.Figure 5: Step-0 gradient magnitudes (log scale). Raw features create an extreme update imbalance; scaled features reduce that gap.This is the mechanism, not just the symptom: when feature scales differ wildly, one weight receives most of the update budget.Training run walkthrough: good, bad, and uglyNow look at an entire training run for the same dataset under different setups.35Figure 6: Three gradient descent runs: unscaled with tiny learning rate (slow), unscaled with larger learning rate (diverges), and scaled with larger learning rate (fast + stable).Example 2.5.2.How to read the three curves:1.Raw + tiny learning rate: stable, but painfully slow. You need very small steps to avoid blowing up.2.Raw + slightly larger learning rate: instability appears quickly and training diverges.3.Scaled + much larger learning rate: loss drops quickly and smoothly.In other words, scaling often expands the range of learning rates that actually work.Why this happens geometricallyGradient descent is navigating a loss surface in parameter space. With poor scaling, that surface is often a thin valley. With better scaling, the valley becomes rounder.Figure 7: Loss contours and parameter paths. Unscaled features create a thin valley and zig-zag path; scaled features produce a rounder bowl and a more direct path.Walkthrough:Left: narrow contours force the optimizer to bounce side-to-side (zig-zag).Right: contours are more circular, so the path can head toward the minimum more directly.36Generalizing beyond this toy datasetThis pattern shows up in almost every multifeature model trained with gradient-based methods:If features use different units (dollars, years, square feet, counts), scaling is usually necessary.Standardization is a strong default for linear models and neural networks.Always fit scaling parameters (𝜇,𝜎) on training data only, then reuse them for validation/test data.If you skip that last step and recompute scaling independently on test data, you change the meaning of each feature and your evaluation becomes unreliable.2.6Interpreting Weights After ScalingWhen you scale features, the model learns weights for the scaled version of each feature. That means the coefficients are no longer in the original units (square feet, bedrooms, years). This is why a scaled gradient descent model can look very different from the normal equation or scikit-learn if those were trained on raw features.Definition 2.6.1.If we standardize each feature with 𝑥𝑗=𝑥𝑗𝜇𝑗𝜎𝑗 and train a model with weights 𝛽 and intercept 𝛽0, then the equivalent weights in the original feature units are:𝛽𝑗=𝛽𝑗𝜎𝑗𝛽0=𝛽0𝑗(𝛽𝑗𝜇𝑗𝜎𝑗)Example 2.6.1.Suppose your trained scaled model has:𝛽0=420,000𝛽sqft=126,500 and 𝜎sqft=1100𝛽bed=19,000 and 𝜎bed=0.9𝜇sqft=2500𝜇bed=3.4Convert back:𝛽sqft=126,5001100115 dollars per square foot𝛽bed=19,0000.921,111 dollars per additional bedroomSo the scaled model is still interpretable, but only after converting back to original units.2.7Linear Algebra Practice ProblemsSets and Number SetsProblem 1. Classify each of the following values into the most specific number set (, , , , or ):73730.7523+2𝑖Problem 2. Classify each value into the most specific number set (, , , , or ):07395+0𝑖Problem 3. True or False: Every natural number is also a rational number. Explain your reasoning using the subset relationships.Problem 4. True or False: Every real number is also a rational number. If false, give a counterexample.Problem 5. A machine learning dataset contains the following columns: “number of bedrooms” (values like 2, 3, 4) and “house price” (values like $245,000.50). Which number set would you use to describe each column?Problem 6. If 𝐴𝐵 and 𝐵𝐶, what can you conclude about the relationship between 𝐴 and 𝐶? Apply this to explain why .Problem 7. Give an example of a number that is in but not in . Why does this distinction matter for computer representations of numbers?Vectors and Vector OperationsProblem 8. A data point for a student has the following features: GPA (3.5), SAT score (1400), and number of extracurriculars (4). Write this as a 3-vector in column notation.Problem 9. Given two vectors 𝒂=(251) and 𝒃=(324), compute 𝒂+𝒃.Problem 10. Compute 3𝒗 where 𝒗=(427).Problem 11. Compute 2𝒂+𝒃 where 𝒂=(314) and 𝒃=(152).Problem 12. Given 𝒖=(12) and 𝒘=(46), compute 2𝒖+3𝒘.Problem 13. State whether each expression is defined. If it is, compute it.1.𝒑+𝒒 where 𝒑=(123) and 𝒒=(456)2.𝒓+𝒔 where 𝒓=(12) and 𝒔=(345)Problem 14. Why can’t you add the vectors 𝒑=(123) and 𝒒=(45)? In a machine learning context, what would this situation represent?Notation and DimensionalityProblem 15. If a matrix 𝑀50×7, how many rows does it have? How many columns? How many total entries?Problem 16. If 𝐴4×2, how many entries are in 𝐴?Problem 17. Write the general form of a matrix 𝐴2×3 using subscript notation for each element.Problem 18. Write the general form of a matrix 𝐵3×2 using subscript notation.38Problem 19. You have a dataset of 1000 images, where each image is represented by 784 pixel values. What is the shape of the data matrix 𝑋 if each row is one image? Write it in the form 𝑋𝑝×𝑑.Problem 20. Given 𝑋3×4, what element is located at row 2, column 3? Write it using subscript notation.Problem 21. A machine learning model takes in data matrix 𝑋𝑛×𝑑 and outputs predictions ̂𝑦𝑛. Explain in plain English what 𝑛 and 𝑑 represent.Dot ProductProblem 22. Compute the dot product of 𝒂=(231) and 𝒃=(415).Problem 23. Compute the dot product of 𝒖=(203) and 𝒗=(541).Problem 24. If 𝒙=(100) and 𝒚=(010), compute 𝒙𝒚.Problem 25. A house has features 𝒙=(120003) (intercept, square footage, bedrooms) and the model weights are 𝜷=(500001005000). Compute the predicted price using the dot product.Problem 26. Compute 𝒗𝒗 where 𝒗=(34).Problem 27. Given 𝒑=(21) and 𝒒=(43), compute 𝒑𝒒.Problem 28. Why must two vectors have the same dimension for the dot product to be defined? Give a practical example where this constraint matters.Matrix MultiplicationProblem 29. Given 𝐴=(1324) and 𝐵=(5768), compute the element in row 1, column 2 of 𝐴𝐵.Problem 30. If 𝐴3×4 and 𝐵4×2, what is the shape of the product 𝐴𝐵?Problem 31. Can you multiply 𝑃2×3 by 𝑄2×3? Explain why or why not.Problem 32. Compute the full matrix product:(2103)(1245)Problem 33. Compute the full matrix product:(101321)(214102)Problem 34. Let 𝐴2×3 and 𝐵3×1. What is the shape of 𝐴𝐵?TransposeProblem 35. Compute the transpose of 𝐴=(142536).Problem 36. If 𝑀10×3, what is the shape of 𝑀𝑇?Problem 37. Given the column vector 𝒗=(258), write 𝒗𝑇.Problem 38. Verify that (𝐴𝑇)𝑇=𝐴 for 𝐴=(135246).39Problem 39. Let 𝐴=(1320) and 𝐵=(4215). Compute (𝐴+𝐵)𝑇.DeterminantProblem 40. Compute the determinant of 𝐴=(5234).Problem 41. Compute the determinant of 𝐷=(7312).Problem 42. Compute the determinant of 𝐵=(2369). Does this matrix have an inverse?Problem 43. For the matrix 𝐶=(𝑎𝑏2𝑎2𝑏), compute the determinant. What does this tell you about matrices where one column is a multiple of the other?Problem 44. If det(𝐴)=5, what is det(2𝐴) for a 2×2 matrix? (Hint: work out an example.)Problem 45. The determinant has a geometric interpretation: it tells us how a matrix scales area. If det(𝐴)=3, what happens to the area of a unit square when transformed by 𝐴? What if det(𝐴)=2?Identity Matrix and InverseProblem 46. Write the 2×2 identity matrix 𝐼2 and the 3×3 identity matrix 𝐼3.Problem 47. Compute 𝐼2𝐴 for 𝐴=(1240).Problem 48. Verify that 𝐴𝐼2=𝐴 for 𝐴=(3275).Problem 49. Compute the inverse of 𝐴=(4232) by hand using the formula 𝐴1=1det(𝐴)(𝑑𝑐𝑏𝑎).Problem 50. Compute the inverse of 𝐵=(5723) by hand. Show all steps.Problem 51. Compute the inverse of 𝐶=(1325) by hand.Problem 52. Attempt to compute the inverse of 𝐷=(6432). What happens and why?Problem 53. If 𝐴1=(2312), find 𝐴.Problem 54. For which values of 𝑘 is the matrix 𝐴=(12𝑘4) invertible?Mixed PracticeProblem 55. Let 𝒂=(210) and 𝒃=(132). Compute (𝒂+𝒃)𝒃.Problem 56. Let 𝐴=(103214). Compute 𝐴𝑇 and then compute 𝐴𝑇𝐴.Problem 57. Let 𝐵=(2412). Determine whether 𝐵 is invertible and explain why.2.8Problem PacketAll required calculus derivations are provided in full. For each derivation, you will be asked to apply or interpret the result rather than re-derive it. Show computations where requested and explain interpretations in complete sentences.Data Representation and Design MatrixProblem 1. Given the dataset below, write the design matrix 𝑋 including an intercept column and the label vector 𝑦. Then interpret the meaning of each column in one sentence.Problem 2.40Size (ft squared)BedroomsPrice ($)“1,000”2“100,000”“2,000”3“200,000”“3,000”4“300,000”“4,000”5“400,000”State the shape of 𝑋 in the form 𝑋𝑝×𝑑 and explain what 𝑝 and 𝑑 represent in this dataset.Matrix Operations (Reference)Problem 3. Using the design matrix from Problem 1, compute 𝑋𝑇𝑋 and 𝑋𝑇𝑦. Arithmetic may be shown succinctly. Then explain in one sentence what each of these matrices/vectors measures.Normal Equation (Derivation Provided)The loss is the residual sum of squares:𝐿(𝛽)=(𝑦𝑋𝛽)𝑇(𝑦𝑋𝛽)Worked derivation:𝐿(𝛽)=𝑦𝑇𝑦2𝛽𝑇𝑋𝑇𝑦+𝛽𝑇𝑋𝑇𝑋𝛽Taking the gradient with respect to 𝛽:𝛽𝐿(𝛽)=2𝑋𝑇𝑦+2𝑋𝑇𝑋𝛽Setting to zero:2𝑋𝑇𝑦+2𝑋𝑇𝑋𝛽=0𝑋𝑇𝑋𝛽=𝑋𝑇𝑦The solution is:̂𝛽=(𝑋𝑇𝑋)1𝑋𝑇𝑦Problem 4. Using the formula above, compute ̂𝛽 for Problem 1. Interpret each coefficient in a single clear sentence.Problem 5. Explain why ̂𝛽 is a minimizer of the loss using geometric or algebraic intuition.MSE and Gradient (Derivation Provided)ℒ︀(𝛽)=1𝑁(𝑦𝑋𝛽)𝑇(𝑦𝑋𝛽)Worked derivation:𝛽ℒ︀(𝛽)=1𝑁(2𝑋𝑇𝑦+2𝑋𝑇𝑋𝛽)=2𝑁𝑋𝑇(𝑦𝑋𝛽)The gradient descent update with learning rate 𝜂 is:𝛽𝛽𝜂𝛽ℒ︀(𝛽)=𝛽+2𝜂𝑁𝑋𝑇(𝑦𝑋𝛽)Problem 6. Apply the formula. Given 𝑁=3, 𝜂=0.1:41𝑋=(111241120),𝛽=(5052),𝑦=(657552)Compute 𝑦𝑋𝛽, then 𝑋𝑇(𝑦𝑋𝛽), and the update increment Δ𝛽.Problem 7. Explain in two sentences what the vector 𝑋𝑇(𝑦𝑋𝛽) represents.RSS and MSE ApplicationProblem 8. Using the results from Problem 6 (where 𝑒=(3,1,3)𝑇), explain if the model under- or over-predicts for each observation and state what an MSE of 6.33 means.Collinearity and RemediesProblem 9. Explain in two sentences how near-collinearity affects the stability of ̂𝛽. Propose a practical remedies with a one-sentence explanation.42Appendix: Programming Reference3Python and Libraries3.1NumPyAccessing NumPyTo use NumPy you must first import it. It is common practice to import NumPy with the alias np so that subsequent function calls are concise.import numpy as npArrays and VectorsIn NumPy an array is a generic term for a multidimensional set of numbers. One-dimensional NumPy arrays act like vectors. The following code creates two one-dimensional arrays and adds them elementwise. If you attempted the same with plain Python lists you would not get elementwise addition.x = np.array([3, 4, 5])y = np.array([4, 9, 7])print(x + y) # array([ 7, 13, 12])Matrices as Two-Dimensional ArraysMatrices in NumPy are typically represented as two-dimensional arrays. The object returned by np.array has attributes such as ndim for the number of dimensions, dtype for the data type, and shape for the size of each axis.x = np.array([[1, 2], [3, 4]])print(x) # array([[1, 2], [3, 4]])print(x.ndim) # 2print(x.dtype) # e.g. dtype('int64')print(x.shape) # (2, 2)If any element passed into np.array is a floating point number, NumPy upcasts the whole array to a floating point dtype.print(np.array([[1, 2], [3.0, 4]]).dtype) # dtype('float64')print(np.array([[1, 2], [3, 4]], float).dtype) # dtype('float64')Methods and FunctionsMethods are functions bound to objects. Calling x.sum() calls the sum method with x as the implicit first argument. The module-level function np.sum(x) does the same computation but is not bound to x.x = np.array([1, 2, 3, 4])print(x.sum()) # method on the array objectprint(np.sum(x)) # module-level functionThe reshape method returns a new view with the same data arranged into a new shape. You pass a tuple that specifies the new dimensions.x = np.array([1, 2, 3, 4, 5, 6])print("beginning x:\n", x)43x_reshape = x.reshape((2, 3))print("reshaped x:\n", x_reshape)NumPy uses zero-based indexing. The first row and first column entry of x_reshape is accessed with x_reshape[0, 0]. The entry in the second row and third column is x_reshape[1, 2]. The third element of the original one-dimensional x is x[2].print(x_reshape[0, 0]) # 1print(x_reshape[1, 2]) # 6print(x[2]) # third element of xViews and Shared MemoryReshaping often returns a view rather than a copy. Modifying a view will modify the original array because they share the same memory. This behavior is important when you expect independent copies.print("x before modification:\n", x)print("x_reshape before modification:\n", x_reshape)x_reshape[0, 0] = 5print("x_reshape after modification:\n", x_reshape)print("x after modification:\n", x)If you need an independent copy, call x.copy() explicitly.x = np.array([1, 2, 3, 4, 5, 6])x_copy = x.copy()x_reshape_copy = x_copy.reshape((2, 3))x_reshape_copy[0, 0] = 99print("x remains unchanged:\n", x)print("x_reshape_copy changed:\n", x_reshape_copy)Tuples are immutable sequences in Python and will raise a TypeError if you try to modify an element. This differs from NumPy arrays and Python lists.my_tuple = (3, 4, 5)# my_tuple[0] = 2 # would raise TypeError: 'tuple' object does not support item assignmentTranspose, ndim, and shapeYou can request several attributes at once. The transpose T flips axes and is useful for matrix algebra.print(x_reshape.shape, x_reshape.ndim, x_reshape.T)# For example: ((2, 3), 2, array([[5, 4], [2, 5], [3, 6]]))Elementwise OperationsNumPy supports elementwise arithmetic and universal functions such as np.sqrt. Raising an array to a power is elementwise.print(np.sqrt(x)) # elementwise square rootprint(x ** 2) # elementwise squareprint(x ** 0.5) # alternative for square rootRandom NumbersNumPy provides random number generation. The signature for rng.normal is normal(loc=0.0, scale=1.0, size=None). The arguments loc and scale are keyword arguments for mean and standard deviation and size controls the shape of the output.x = np.random.normal(size=50)print(x) # random sample from N(0,1), different each runTo create a dependent array, add a random variable with a different mean to each element.44y = x + np.random.normal(loc=50, scale=1, size=50)print(np.corrcoef(x, y)) # correlation matrix between x and yReproducibility with the Generator APITo produce identical random numbers across runs, use np.random.default_rng with an integer seed to create a Generator object and then call its methods. The Generator API is the recommended approach for reproducibility.rng = np.random.default_rng(1303)print(rng.normal(scale=5, size=2))rng2 = np.random.default_rng(1303)print(rng2.normal(scale=5, size=2))# Both prints produce the same arrays because the same seed was used.When you use rng.standard_normal or rng.normal you are using the Generator instance, which ensures reproducibility if you control the seed.rng = np.random.default_rng(3)y = rng.standard_normal(10)print(np.mean(y), y.mean())Mean, Variance, and Standard DeviationNumPy provides np.mean, np.var, and np.std as module-level functions. Arrays also have methods mean, var, and std. By default np.var divides by n. If you need the sample variance that divides by n minus 1, provide ddof=1.rng = np.random.default_rng(3)y = rng.standard_normal(10)print(np.var(y), y.var(), np.mean((y - y.mean())**2))print(np.sqrt(np.var(y)), np.std(y))# Use np.var(y, ddof=1) for sample variance dividing by n-1.Axis Arguments and Row/Column OperationsNumPy arrays are row-major ordered. The first axis, axis=0, refers to rows and the second axis, axis=1, refers to columns. Passing axis into reduction methods lets you compute means, sums, and other statistics along rows or columns.rng = np.random.default_rng(3)X = rng.standard_normal((10, 3))print(X) # 10 by 3 matrixprint(X.mean(axis=0)) # column meansprint(X.mean(0)) # same as previousWhen you compute X.mean(axis=1) you obtain a one-dimensional array of row means. When you compute X.sum(axis=0) you obtain column sums.Graphics with MatplotlibMatplotlib is the standard plotting library. A plot consists of a figure and one or more axes. The subplots function returns a tuple containing the figure and the axes. The axes object has a plot method and other methods to customize titles, labels, and markers.from matplotlib.pyplot import subplotsfig, ax = subplots(figsize=(8, 8))rng = np.random.default_rng(3)x = rng.standard_normal(100)y = rng.standard_normal(100)ax.plot(x, y) # default line plotax.plot(x, y, 'o') # scatter-like circles# To save: fig.savefig("scatter.png")# To display in an interactive session: import matplotlib.pyplot as plt; plt.show()45Practical NotesUsing np.random.default_rng for all random generation in these examples makes results reproducible across runs on the same NumPy version. As NumPy changes across versions there may be minor differences in outputs for some operations. When computing variance note the ddof argument if you expect sample variance rather than population variance.If you would like a single Python script that runs all of these examples sequentially, or if you would like the Typst document to include rendered figures embedded as images, I can prepare that next.46Appendix: Math Fundamentals4Calculus4.1Limits4.2DerivativesLimit Definition of Derivative4.3GradientsVector Valued FunctionsGradient DefinitionPartial Derivatives and Rules5Linear Algebra6Statistics and Probability47Reference48Glossary of DefinitionsDefinition 0.1p. 8Definition 0.2p. 8Definition 1.3.1p. 11Definition 1.5.1p. 13Definition 1.5.2p. 14Definition 1.7.1p. 18Definition 2.1.1p. 22Definition 2.1.2p. 23Definition 2.1.3p. 23Definition 2.1.4p. 24Definition 2.1.5p. 24Definition 2.1.6p. 25Definition 2.1.7p. 25Definition 2.1.8p. 26Definition 2.1.9p. 26Definition 2.1.10p. 27Definition 2.1.11p. 28Definition 2.1.12p. 28Definition 2.1.13p. 28Definition 2.1.14p. 28Definition 2.2.1p. 30Definition 2.3.1p. 30Definition 2.4.1p. 32Definition 2.5.1p. 33Definition 2.6.1p. 3649