Introduction To Artificial Intelligence and Machine LearningAniketh Chenjeri, Andrew Doyle, Swarnim Ghimire, Mr. Igor Tomcej2026-01-07ContentsForeword .................................................................................................. ⁠4How Is This Book Structured? ........................................................................... ⁠5Part I: Supervised Learning ............................................................................... ⁠71Linear Regression ...................................................................................... ⁠81.1What is Linear Regression? ....................................................................... ⁠81.2The Limits of Simple Estimation .................................................................. ⁠91.3Linear Regression Formally Defined ............................................................. ⁠101.4Understanding the Algorithm .................................................................... ⁠111.5Measuring and Interpreting Error ............................................................... ⁠111.6Optimization with Gradient Descent ............................................................ ⁠131.7Understanding Gradients ......................................................................... ⁠141.8Lab: Building Linear Regression from Scratch .................................................. ⁠181.9Problem Packet ................................................................................... ⁠18Theory Questions ............................................................................... ⁠18Practice Problems ............................................................................... ⁠19Appendix: Programming Reference ..................................................................... ⁠202Python and Libraries .................................................................................. ⁠202.1NumPy ............................................................................................ ⁠20Accessing NumPy ............................................................................... ⁠20Arrays and Vectors .............................................................................. ⁠20Matrices as Two-Dimensional Arrays .......................................................... ⁠20Methods and Functions ......................................................................... ⁠20Views and Shared Memory ..................................................................... ⁠21Transpose, ndim, and shape .................................................................... ⁠21Elementwise Operations ........................................................................ ⁠21Random Numbers ............................................................................... ⁠21Reproducibility with the Generator API ....................................................... ⁠22Mean, Variance, and Standard Deviation ....................................................... ⁠22Axis Arguments and Row/Column Operations ................................................ ⁠22Graphics with Matplotlib ....................................................................... ⁠22Practical Notes .................................................................................. ⁠23Appendix: Math Fundamentals ......................................................................... ⁠243Calculus ............................................................................................... ⁠243.1Limits ............................................................................................. ⁠243.2Derivatives ........................................................................................ ⁠24Limit Definition of Derivative .................................................................. ⁠243.3Gradients .......................................................................................... ⁠24Vector Valued Functions ........................................................................ ⁠24Gradient Definition ............................................................................. ⁠24Partial Derivatives and Rules ................................................................... ⁠244Linear Algebra ........................................................................................ ⁠245Statistics and Probability ............................................................................. ⁠24Reference ................................................................................................. ⁠25Glossary of Definitions .................................................................................. ⁠264ForewordThe Introduction to Artificial Intelligence (AI) and Machine Learning (ML) class has had an interesting history. It started at Creek during the 2024-2025 school year with the intention of giving students a gentle on-ramp to the field. The course dodged heavy math and focused on teaching practical libraries: TensorFlow, NumPy, Pandas, Scikit-Learn.Due to unforeseen circumstances, the 2025-2026 school year began without a solid foundation for the course. This prompted a comprehensive reflection and led us to rebuild the syllabus from scratch. While our original intention (a gentle introduction to machine learning) stayed unchanged, everyone involved agreed we needed a more rigorous approach alongside an emphasis on intuition.This task proved difficult because the subject matter of the course is inherently interdisciplinary. We needed to make the course more intuitive and less mathematical while still covering the field’s fundamentals.This textbook represents a collection of lessons, notes, and exercises designed to serve as the course’s foundation. At the time of writing, there are no mathematical prerequisites, yet mathematics permeates the field of AI and ML. We faced a choice: omit the mathematics and build a less rigorous course emphasizing breadth, or include it and slow our pace. We chose the latter.This textbook is written for readers with no calculus background. Rather than requiring formal mathematical preparation, it assumes basic algebraic understanding and builds intuition about which mathematical operations are necessary. We will never ask you to compute complex formulas without first teaching you what each part represents; we focus on understanding and developing an intuition for what the math represents. This is a practical course, not a theoretical one. It will give you a solid foundation to build upon if you decide to pursue a career in the field. However, this book by no means is a replacement for the mathematics you’ll need to learn if you decide to continue studying this field; we instead hope to give you a starting point on which you can build.We gratefully acknowledge the following contributors:Primary Writers:Aniketh Chenjeri (CCHS ‘26)Andrew Doyle (CCHS ‘26)Mr. Igor TomcejReviewers:Hariprasad Gridharan (CCHS ‘25, Cornell ‘29)Siddharth Menon (CCHS ‘26)Ani Gadepalli (CCHS ‘26)5How Is This Book Structured?NOTE: This text is a living document, currently undergoing active development as part of our commitment to pedagogical excellence. In order to ensure rigorous academic standards, chapters are released sequentially following comprehensive peer review. This is to say the version of the text you are viewing right now is not the final one, expect updates to various parts of the book as we continue to refine and improve the content.While every effort is made to provide an accurate and authoritative resource, please note that early editions may contain errata. We encourage students to actively engage with the material; should you identify any discrepancies or technical inaccuracies, please report them to your teacher or teachers assistant(s) for correction in future revisions.We appreciate your cooperation in refining this resource for future cohorts. Your feedback is instrumental in ensuring the highest quality of instruction and material.This book is written to emphasize an understanding of fundamental concepts in Artificial Intelligence and Machine Learning. We will begin with supervised learning and its applications. We are starting with supervised learning because it’s one of the most common techniques used by practitioners. It’s also the most intuitive and easy to understand. In this section we’ll also slowly introduce mathematical concepts and give you exercises to solidify your understanding. We will never require you to do any calculus; however, it becomes impossible to understand many algorithms without it. As such we will always “solve” any calculus involved and ask you to interpret and apply it. This doesn’t mean this course will omit math entirely; we will learn a lot of applied linear algebra, as it is the core of how machine learning works. Under supervised learning you will also learn important concepts for assessing the accuracy of a model and some insight into which architectures are used for which scenarios. To truly accomplish this we need to understand many statistics concepts; as a result this book will cover many topics in statistics and probability. Some of these concepts will be familiar to you if you take an AP Statistics class or equivalent.We will then move on to unsupervised learning, where we’ll make a brief stop with the k-means clustering algorithm and then move to neural networks. Understanding neural networks is the ultimate goal of this class, as they are a ubiquitous and powerful tool. If you gain an understanding of neural networks you will be able to understand many complex algorithms such as Large Language Models, which are the foundation for tools like ChatGPT, Google Gemini, Anthropic’s Claude, and more.Towards the end of this book we’ve compiled a series of documentation-like chapters for various libraries, frameworks, and mathematical concepts. If you find yourself not understanding certain concepts, tools, etc., you can always refer to these documents.6The only assumption we make in the writing of this book is some familiarity with Python (and programming in general) and Algebra 2. Even though we will cover theory in this class, it will be a programming class first and foremost. You will write a lot of code but will also be asked to understand theory and math.In the compilation of this book we’ve pulled from various resources:An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert TibshiraniThe Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome FriedmanVarious books by Justin Skyzak (including Introduction to Algorithms and Machine Learning)Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon OngThe Matrix Cookbook by Kaare Brandt Petersen and Michael Syskind Pedersen (Note: Often associated with the Joseph Montanez online version)7Part I: Supervised LearningIn supervised learning, we are provided with some input 𝑋 and a corresponding response 𝑌. Our fundamental objective is to estimate the relationship between these variables, typically modeled as:𝑌=𝑓(𝑋)+𝜀where 𝑓 is an unknown fixed function of the predictors and 𝜀 is a random error term, independent of 𝑋, with a mean of zero. Using a training dataset 𝒯︀={(𝑥1,𝑦1),(𝑥2,𝑦2),,(𝑥𝑛,𝑦𝑛)}, we aim to apply a learning method to estimate the function ̂𝑓. This estimate allows us to predict the response for previously unseen observations. It is also common to express this relationship as ̂𝑦=̂𝑓(𝑋).Definition 0.1.A feature is an individual, measurable property or characteristic of a phenomenon being observed. In machine learning, it serves as the input variable 𝑥 that a model uses to identify patterns and make predictions. For example if we want to predict the price of a house based on its size, number of bedrooms, and location, the features are size, number of bedrooms, and location.Definition 0.2.A label is the output variable 𝑦 that a model uses to make predictions. For example if we want to predict the price of a house based on its size, number of bedrooms, and location, the label is the price of the house.In the supervised learning portion of this textbook, we will learn the following:Linear Regression via the Pseudoinverse: We will learn the closed-form analytical solution for parameter estimation using the Moore-Penrose pseudoinverse and the Normal Equation. This includes multiple linear regression, where we will learn how to handle multiple predictors simultaneously.Optimization via Gradient Decent and Stochastic Gradient Descent (SGD): We will learn the notation for and how to implement Gradient and Stochastic Gradient Descent for Optimization.The Bias-Variance Tradeoff: Understanding the decomposition of prediction error into reducible and irreducible components, and navigating the relationship between overfitting and underfitting.Polynomial Regression: Extending the linear framework to capture non-linear relationships by mapping predictors into higher-dimensional feature spaces.Shrinkage Methods (Ridge and Lasso): Applying 𝐿1 and 𝐿2 regularization to minimize the Residual Sum of Squares (RSS) while controlling model complexity and performing variable selection.Logistic Regression: Transitioning to classification by modeling the conditional probability of qualitative responses using the logit transformation.8k-Nearest Neighbors (k-NN): Utilizing a non-parametric, memory-based approach to prediction based on local density and spatial proximity in the feature space.Don’t worry if non of that makes sense to you, we’ll be covering it in detail in the coming chapters.1Linear Regression1.1What is Linear Regression?To recall supervised learning starts with known input data 𝑋 and known output data 𝑌 and we are asked to fit an equation to this dataset.One of the simplest approaches is to assume a linear relationship between inputs and outputs. This assumption is usually wrong and somewhat contrived, but linear models serve an important purpose: they give you a baseline. Once you have a baseline model, you can test more complex models against it and see if they actually perform better.Traditionally, linear regression requires heavy use of linear algebra. We’re not going to get too into the weeds in this course since this course doesn’t have a math pre-req. Instead, we’ll use Sklearn, a popular machine learning library. But before we jump into Sklearn’s code, we need to build intuition about how its algorithms actually work. Once you understand the underlying process, using the library itself becomes straightforward.This means we’ll need to do some math along the way. Example 1.1.1.Let’s engage in a hypothetical. Suppose you’re given 2 thermometers and asked to measure the temperature for both metrics. Let’s say our results look like this:Celsius (x)Fahrenheit (y)031.8541.91049.21560.12067.42578.93087.5Let’s plot the data:9Now we know that the equation for converting Celsius to Fahrenheit is𝑦=1.8𝑥+32But assume you don’t know this equation and are asked to find purely based on that data. After sitting with the problem for a while you’ll probably realize that you can use the equation 𝑚=𝑦2𝑦1𝑥2𝑥1 to estimate our coefficient in the equation and we’ve been given a y intercept at (0,31.8) so from our given data, after computing, we can say our equation iŝ𝑦=1.85𝑥+31.8Note In this equation, the hat symbol (^) indicates a predicted value or an estimated parameter, not a measured or input variable. So in our model̂𝑦=1.85𝑥+31.8̂𝑦 represents the predicted Fahrenheit temperature based on the input 𝑥 (in Celsius). The hat shows that this value comes from our model’s estimation, not directly from observed data.You’ll often see this notation in statistics and machine learning to distinguish predicted outputs ̃𝑦 and estimated coefficients (̂𝑤,̂𝑏) from true or observed values.Congratulations you’ve just made your first statistical model. But there a few problems this approach to estimating linear equations from a given dataset. 1.2The Limits of Simple EstimationBut there are a few problems with this simple approach to estimating linear equations from a given dataset. Let’s look at what might happen if you actually went around campus measuring temperature.The real data that was collected by the AI class in 2025 looks like this:10This data is much more realistic and it highlights why our simplistic approach doesn’t scale; Thermometers aren’t perfect. Maybe one was a little old, maybe you were standing in direct sunlight for one measurement, or maybe a gust of wind hit one of them. That’s exactly why we use methods like linear regression and error measures. Instead of trusting a single pair of points, regression finds the line that best fits all our noisy data, balancing those little errors out. The goal isn’t to make every point perfect (which you can’t do anyway), it’s to minimize the total amount of error across the whole dataset.1.3Linear Regression Formally DefinedDefinition 1.3.1.To restate our problem: we have a given dataset composed of an input 𝑋 for which we have an output 𝑌 and our job is to develop a equation that encapsulates the relationship between 𝑋 and 𝑌 in some equation ̂𝑦=𝑊𝑥+𝑏 where W and b are accurate estimates of the real values (in this case 1.8 and 32 respectively) Let’s say that this our data, it’s randomly generated noisy data and we are using it as a proxy for a relatively large amount of real data.11 1.4Understanding the AlgorithmLet’s break down what sklearn does under the hood. The core method it actually uses to estimate 𝑊 and 𝑏 are actually quite simple. It starts by picking a random value for 𝑊 and 𝑏 and then checks how accurate those measures are then keeps adjusting those numbers till our model becomes accurate.But we need to slow down and somewhat rigorously lay out what each of those statements mean.The statement “picks a random value for 𝑊 and 𝑏” is intuitive, but the big question is how does it measure for the accuracy of 𝑊 and 𝑏 and how does it change those values.1.5Measuring and Interpreting ErrorLet’s start the first question: how do we measure error? Example 1.5.1.Let’s say we have this data and some predicted values in ̂𝑦𝑥𝑦̂𝑦132.5255.2344.1476.8566.312To measure how accurate 𝑦^ relative 𝑦, we introduce the algorithm RSSDefinition 1.5.1.RSS stands for Residual Sum of Squares and is defined by the following equation:RSS=𝑁𝑖=0(𝑦𝑖̂𝑦𝑖)2Don’t be scared, this equation simply mean that we are taking every single y value in the given data set and subtracting it by our models estimated values. Squaring the function is done for the following reasons:1.Squaring makes all residuals positive, so large underpredictions and overpredictions both contribute to the total error.2.It penalizes larger errors more heavily: A residual of 4 counts far more (16) than a residual of 2 (which counts as 4). This makes the regression more sensitive to large deviations.3.Squaring makes the loss function smooth and differentiable, which makes our life a lot easier later on If this still doesn’t make sense we can use this graphic to gain an intuition13But RSS gives us the error for each point, We want 1 number to measure error overall. To do this we can define a function:Definition 1.5.2.The Mean Squared Error (MSE) is defined as:𝐿(𝑊,𝑏)=1𝑁𝑁𝑖=0(𝑦𝑖̂𝑦𝑖)2or we can expand this to be:𝐿(𝑊,𝑏)=1𝑁𝑁𝑖=0((𝑦𝑖(𝑊𝑥+𝑏))2The 1𝑁 here just averages the error at every point.So great! We now have a really solid way to measure errorsIf we were to implement this in pure python we would do something like this:# Assume X is a spreadsheet of our data and X[i] is the ith rowdef mse(W, b, X, y): n = len(X) total_error = 0 for i in range(n): prediction = W * X[i] + b total_error += (y[i] - prediction) ** 2 return total_error / n1.6Optimization with Gradient DescentSo we have a way to measure error with MSE. But now we face a new problem: how do we actually find the best values for 𝑊 and 𝑏?We can’t just guess randomly forever. Instead, we need a systematic way to improve our guesses. This is where derivatives come inFor those if you have taken a calculus class you will be familiar with the concept of a derivative.A derivative measures how much a function changes when you change its input slightly. Think of it like the slope of a hill. If you’re standing on a hill and you want to know which direction is steepest, the slope tells you. A positive slope means the hill goes up in that direction, and a negative slope means it goes down.14In our case, we want to know: if I change 𝑊 slightly, does my error go up or down? The derivative of the loss function with respect to 𝑊 answers exactly that question. It tells us the slope of the error landscape.If the derivative is positive, increasing 𝑊 increases error, so we should decrease 𝑊. If the derivative is negative, increasing 𝑊 decreases error, so we should increase 𝑊. By moving in the opposite direction of the derivative, we’re moving downhill toward lower error.This process is called gradient descent, and we update our weights using this rule:𝑊𝜂𝜕𝐿𝜕𝑊Here,𝜂 is the learning rate, which controls how big each step is. Too small and learning is painfully slow. Too large and you might overshoot the best values entirely.We do the same for 𝑏:𝑏𝜂𝜕𝐿𝜕𝑏We repeat this process over and over, each time getting closer to the optimal 𝑊 and 𝑏 that minimize our error.Since this class doesn’t not require you do the math We’re going to give you the value of 𝜕𝑊𝜕𝐿 and 𝑊𝜂𝜕𝑏𝜕𝐿:𝜕𝐿𝜕𝑊=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥+𝑏))𝑥𝜕𝐿𝜕𝑏=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥+𝑏))In Python this code looks like this:dW = (2.0 / N) * np.sum(error * x_scaled, dtype=np.float64)W -= lr * dWdb = (2.0 / N) * np.sum(error, dtype=np.float64)b -= lr * db1.7Understanding GradientsIf you’re still struggling with the concept of gradients let’s break it down even more. Instead of a function with 2 inputs let’s start with a function with 1 input something like 𝑓(𝑥)=𝑥2. We’ve all seen this function, when graph it looks like this:15Let’s say we have some model: ̂𝑦=𝑊𝑥 Now let’s say 𝑓(𝑥)=𝑥2 represents the error for some weight (coefficient) 𝑊. So if 𝑊=2 our model’s error is 4 and so on. We want to find the lowest error value so we want to absolute minimum (the lowest 𝑦 value) in the model.For this function there are various ways we could find it. Since we know that a parabola that opens upword’s lowest point is the vertex you can write function in vertex for and find it there but let’s say our loss landscape looks something like this:16If we look at the graph we can find the lowest point but that isn’t always possible. Gradient Decent is a way of finding that lowest point. For functions with 1 input we have the derivative.The derivative is the slope of the tangent line at a point. This literally mean a straight line that indicates the direction of the function at that point. Visually the derivative is the purple line bellowFigure 1: Example tangent line in purple for a loss functionThis is the tangent line at 𝑥=0.4 and notice that if we were to try and display the purple line as a linear equation in the form of 𝑦=𝑚𝑥+𝑏 then 𝑚 or the slope of the line would be negative. The actual value of m at 𝑥=0.4 is 2.26528158057That 𝑚 value for the purple line is the derivative. So, we know that for a function that represents our loss at 1 pointDerivatives in 2d are represented for the original function 𝑓(𝑥) is 𝑓(𝑥). So in this instance our gradient decent algorithm would simply look like this for some weight 𝑊:𝑊𝜂𝑓(𝑥)17We now have a solid understanding of what derivatives in 2d are. This assume that we are trying to model equation ̂𝑦=𝑊𝑥. However approach isn’t extremely useful. Often we have multiple weights we want to estimate with our data. This makes more complicated. This is where the idea of gradients comes from.Definition 1.7.1.A gradient is a generalization of the derivative for functions with multiple inputs. If your function depends on several variables, like 𝑓(𝑥,𝑦)=𝑥2+𝑦2, then the gradient is a vector that collects all the partial derivatives — one for each variable:𝑓=(𝜕𝑓𝜕𝑥,𝜕𝑓𝜕𝑦)=(2𝑥,2𝑦)Each of the items inside 𝑓 can be represented partial derivative, that is the derivative of the function 𝑓 with respect to one of it’s variables, which, in this instance are 𝑥and𝑦. So in our gradient decent algorithms:𝜕𝐿𝜕𝑊=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥+𝑏))𝑥𝜕𝐿𝜕𝑏=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥+𝑏))We are simply finding the equivalent for 𝑓 in the loss function 𝐿(𝑊,𝑏) with respect to the variables 𝑊 and 𝑏Bellow the function 𝐿(𝑊,𝑏) graphed out for a model in that we’re training in your next lab:18Figure 2: The loss landscape for our model: ̂𝑦=1.85𝑥+31.47The red line represents the various weights the our model tried and it’s path to reach the optimal weights with gradient decent: click on this link if you want to see a video of the model being trained and gradient decent working in real time.1.8Lab: Building Linear Regression from ScratchIn this lab you’re going to build a linear regression from scratch. It is a 1-1 application of the algorithm we’ve been building in the previous lesson. You will use just the numpy library and no other libraries. This is a good exercise to get you familiar with the code and the math.1.9Problem PacketTheory QuestionsProblem 1. The text describes linear models as a “baseline.” Explain the importance of establishing a baseline model before moving on to more complex machine learning algorithms.Problem 2. In the equation ̂𝑦=1.85𝑥+31.8, explain what the “hat” notation ̂() signifies and why it is crucial for distinguishing between types of data in statistics.Problem 3. The lesson provides three specific reasons for squaring residuals in the RSS formula. List them and explain why making the loss function “smooth and differentiable” is beneficial for optimization.Problem 4. What is the mathematical difference between Residual Sum of Squares (RSS) and Mean Squared Error (MSE)? Why is MSE generally preferred when working with datasets of varying sizes?19Problem 5. Explain the role of the learning rate 𝜂. Based on the “hill” analogy, what physically happens to our “steps” if 𝜂 is too large versus too small?Problem 6. Define a gradient in the context of multiple variables (𝑊 and 𝑏). How does a gradient differ from a standard 2D derivative?Practice ProblemsProblem 7. You are given the coefficients 𝑎=1, 𝑏=4, and 𝑐=2 for the function 𝑓(𝑥)=𝑥2+4𝑥+2. Using the derivative 𝑓(𝑥)=2𝑥+4, write a Python function to find the minimum of 𝑓(𝑥) using gradient descent. Start at 𝑥=10, use 𝜂=0.1, and run for 10 iterations.Problem 8. Calculate the RSS and MSE by hand for the following dataset given the model ̂𝑦=2𝑥+1:𝑥=[1,2,3]𝑦=[3,6,7]Problem 9. Given 𝑥=[1,2,3] and 𝑦=[2,3,4], and initial parameters 𝑊=0 and 𝑏=0, compute:The predicted values ̂𝑦The residuals (𝑦𝑖̂𝑦𝑖)The current MSEProblem 10. Using the data and initial parameters from Problem 9, perform one full batch gradient descent update to find 𝑊new and 𝑏new. Use 𝜂=0.1 and the formulas:𝜕𝐿𝜕𝑊=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥𝑖+𝑏))𝑥𝑖𝜕𝐿𝜕𝑏=2𝑁𝑁𝑖=1(𝑦𝑖(𝑊𝑥𝑖+𝑏))Note: Use the sign convention from the provided Python code where the gradient is subtracted.Problem 11. A thermometer model is trained to ̂𝑦=1.85𝑥+31.8. If the actual temperature is 0°𝐶 and the observed Fahrenheit reading is 31.8, what is the residual? If the actual temperature is 30°𝐶 and the observed reading is 87.5, what is the residual?Problem 12. Write a Python function get_error(y_true, y_pred) that returns the Mean Squared Error using only the standard library (no numpy). Assume both inputs are lists of equal length.20Appendix: Programming Reference2Python and Libraries2.1NumPyAccessing NumPyTo use NumPy you must first import it. It is common practice to import NumPy with the alias np so that subsequent function calls are concise.import numpy as npArrays and VectorsIn NumPy an array is a generic term for a multidimensional set of numbers. One-dimensional NumPy arrays act like vectors. The following code creates two one-dimensional arrays and adds them elementwise. If you attempted the same with plain Python lists you would not get elementwise addition.x = np.array([3, 4, 5])y = np.array([4, 9, 7])print(x + y) # array([ 7, 13, 12])Matrices as Two-Dimensional ArraysMatrices in NumPy are typically represented as two-dimensional arrays. The object returned by np.array has attributes such as ndim for the number of dimensions, dtype for the data type, and shape for the size of each axis.x = np.array([[1, 2], [3, 4]])print(x) # array([[1, 2], [3, 4]])print(x.ndim) # 2print(x.dtype) # e.g. dtype('int64')print(x.shape) # (2, 2)If any element passed into np.array is a floating point number, NumPy upcasts the whole array to a floating point dtype.print(np.array([[1, 2], [3.0, 4]]).dtype) # dtype('float64')print(np.array([[1, 2], [3, 4]], float).dtype) # dtype('float64')Methods and FunctionsMethods are functions bound to objects. Calling x.sum() calls the sum method with x as the implicit first argument. The module-level function np.sum(x) does the same computation but is not bound to x.x = np.array([1, 2, 3, 4])print(x.sum()) # method on the array objectprint(np.sum(x)) # module-level functionThe reshape method returns a new view with the same data arranged into a new shape. You pass a tuple that specifies the new dimensions.x = np.array([1, 2, 3, 4, 5, 6])print("beginning x:\n", x)21x_reshape = x.reshape((2, 3))print("reshaped x:\n", x_reshape)NumPy uses zero-based indexing. The first row and first column entry of x_reshape is accessed with x_reshape[0, 0]. The entry in the second row and third column is x_reshape[1, 2]. The third element of the original one-dimensional x is x[2].print(x_reshape[0, 0]) # 1print(x_reshape[1, 2]) # 6print(x[2]) # third element of xViews and Shared MemoryReshaping often returns a view rather than a copy. Modifying a view will modify the original array because they share the same memory. This behavior is important when you expect independent copies.print("x before modification:\n", x)print("x_reshape before modification:\n", x_reshape)x_reshape[0, 0] = 5print("x_reshape after modification:\n", x_reshape)print("x after modification:\n", x)If you need an independent copy, call x.copy() explicitly.x = np.array([1, 2, 3, 4, 5, 6])x_copy = x.copy()x_reshape_copy = x_copy.reshape((2, 3))x_reshape_copy[0, 0] = 99print("x remains unchanged:\n", x)print("x_reshape_copy changed:\n", x_reshape_copy)Tuples are immutable sequences in Python and will raise a TypeError if you try to modify an element. This differs from NumPy arrays and Python lists.my_tuple = (3, 4, 5)# my_tuple[0] = 2 # would raise TypeError: 'tuple' object does not support item assignmentTranspose, ndim, and shapeYou can request several attributes at once. The transpose T flips axes and is useful for matrix algebra.print(x_reshape.shape, x_reshape.ndim, x_reshape.T)# For example: ((2, 3), 2, array([[5, 4], [2, 5], [3, 6]]))Elementwise OperationsNumPy supports elementwise arithmetic and universal functions such as np.sqrt. Raising an array to a power is elementwise.print(np.sqrt(x)) # elementwise square rootprint(x ** 2) # elementwise squareprint(x ** 0.5) # alternative for square rootRandom NumbersNumPy provides random number generation. The signature for rng.normal is normal(loc=0.0, scale=1.0, size=None). The arguments loc and scale are keyword arguments for mean and standard deviation and size controls the shape of the output.x = np.random.normal(size=50)print(x) # random sample from N(0,1), different each runTo create a dependent array, add a random variable with a different mean to each element.22y = x + np.random.normal(loc=50, scale=1, size=50)print(np.corrcoef(x, y)) # correlation matrix between x and yReproducibility with the Generator APITo produce identical random numbers across runs, use np.random.default_rng with an integer seed to create a Generator object and then call its methods. The Generator API is the recommended approach for reproducibility.rng = np.random.default_rng(1303)print(rng.normal(scale=5, size=2))rng2 = np.random.default_rng(1303)print(rng2.normal(scale=5, size=2))# Both prints produce the same arrays because the same seed was used.When you use rng.standard_normal or rng.normal you are using the Generator instance, which ensures reproducibility if you control the seed.rng = np.random.default_rng(3)y = rng.standard_normal(10)print(np.mean(y), y.mean())Mean, Variance, and Standard DeviationNumPy provides np.mean, np.var, and np.std as module-level functions. Arrays also have methods mean, var, and std. By default np.var divides by n. If you need the sample variance that divides by n minus 1, provide ddof=1.rng = np.random.default_rng(3)y = rng.standard_normal(10)print(np.var(y), y.var(), np.mean((y - y.mean())**2))print(np.sqrt(np.var(y)), np.std(y))# Use np.var(y, ddof=1) for sample variance dividing by n-1.Axis Arguments and Row/Column OperationsNumPy arrays are row-major ordered. The first axis, axis=0, refers to rows and the second axis, axis=1, refers to columns. Passing axis into reduction methods lets you compute means, sums, and other statistics along rows or columns.rng = np.random.default_rng(3)X = rng.standard_normal((10, 3))print(X) # 10 by 3 matrixprint(X.mean(axis=0)) # column meansprint(X.mean(0)) # same as previousWhen you compute X.mean(axis=1) you obtain a one-dimensional array of row means. When you compute X.sum(axis=0) you obtain column sums.Graphics with MatplotlibMatplotlib is the standard plotting library. A plot consists of a figure and one or more axes. The subplots function returns a tuple containing the figure and the axes. The axes object has a plot method and other methods to customize titles, labels, and markers.from matplotlib.pyplot import subplotsfig, ax = subplots(figsize=(8, 8))rng = np.random.default_rng(3)x = rng.standard_normal(100)y = rng.standard_normal(100)ax.plot(x, y) # default line plotax.plot(x, y, 'o') # scatter-like circles# To save: fig.savefig("scatter.png")# To display in an interactive session: import matplotlib.pyplot as plt; plt.show()23Practical NotesUsing np.random.default_rng for all random generation in these examples makes results reproducible across runs on the same NumPy version. As NumPy changes across versions there may be minor differences in outputs for some operations. When computing variance note the ddof argument if you expect sample variance rather than population variance.If you would like a single Python script that runs all of these examples sequentially, or if you would like the Typst document to include rendered figures embedded as images, I can prepare that next.24Appendix: Math Fundamentals3Calculus3.1Limits3.2DerivativesLimit Definition of Derivative3.3GradientsVector Valued FunctionsGradient DefinitionPartial Derivatives and Rules4Linear Algebra5Statistics and Probability25Reference26Glossary of DefinitionsDefinition 0.1p. 7Definition 0.2p. 7Definition 1.3.1p. 10Definition 1.5.1p. 12Definition 1.5.2p. 13Definition 1.7.1p. 1727