šFAQs
Moving forward, weāre going to try and update this page each week to provide answers to questions asked (1) live in lecture, (2) at q.dsc40a.com during lecture, (3) on Ed, and (4) in the relevant Reflection and Feedback Form. If you have other related questions, feel free to post them on Ed.
Jump to:
- Weeks 3-5: Regression and Linear Algebra
- Week 2: Loss Functions, Center and Spread, and Simple Linear Regression
Weeks 3-5: Regression and Linear Algebra
Coming soonā¦
Can you recap the proof of the formula for \(w_1^*\) that includes \(r\)?
What do you mean by āthe inner dimensions need to match in order to perform matrix multiplicationā?
Whatās the relationship between spans, projections, and multiple linear regression?
Why does the design matrix have a column of all 1s?
What is the projection of \(\vec{y}\) onto \(\text{span}(\vec{x})\) ā is it \(w^*\) or \(w^* \vec{x}\)?
Do the normal equations work even when there is only one column in the matrix \(X\)?
When do two vectors in \(\mathbb{R}^2\) span all of \(\mathbb{R}^2\)? When do \(n\) vectors in \(\mathbb{R}^n\) span all of \(\mathbb{R}^n\)?
When \(X^TX\) isnāt invertible, how do we solve the normal equations?
What does it mean for a matrix to be full rank?
In multiple linear regression, is \(\vec{h}^*\) orthogonal to \(\vec{y}\)?
Why does the multiple linear regression model with two features look like a plane?
Is there a more detailed version of the MSE proof shown in Lecture 5?
Yes. Hereās a proof of the fact that \(R_\text{sq}(w_0^*, w_1^*) = \sigma_y^2 (1 - r^2)\).
First, note that since \(\sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\), we have that \(\sum_{i = 1}^n (x_i - \bar{x})^2 = n \sigma_x^2\). Then:
\(R_{\text{sq}}( w_0^*, w_1^* )\) = \(\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y} - w_1^*(x_i - \bar{x}))^2\)
= \(\frac{1}{n} \sum_{i=1}^{n} \left[ (y_i - \bar{y})^2 - 2 w_1^*(x_i - \bar{x})(y_i - \bar{y}) + w_1^{*2} (x_i - \bar{x})^2 \right]\)
= \(\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2 - \frac{2w_1^*}{n} \sum_{i=1}^{n} ((x_i - \bar{x})(y_i - \bar{y})) + \frac{w_1^{*2}}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2\)
= \(\sigma_y^2 - \frac{2w_1^*}{n} \sum_{i=1}^{n} ((x_i - \bar{x})(y_i - \bar{y})) + w_1^{*2} \sigma_x^2\)
= \(\sigma_y^2 - \frac{2w_1^*}{n} \frac{\sum_{i=1}^{n} ((x_i - \bar{x})(y_i - \bar{y}))}{\sum_{i=1}^{n} (x_i - \bar{x})^2} (\sum_{i=1}^{n} (x_i - \bar{x})^2) + r^2 \sigma_y^2\)
= \(\sigma_y^2 - 2w_1^* \frac{\sum_{i=1}^{n} ((x_i - \bar{x})(y_i - \bar{y}))}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n} + r^2 \sigma_y^2\)
= \(\sigma_y^2 - 2w_1^{*2} \sigma_x^2 + r^2 \sigma_y^2\)
= \(\sigma_y^2 - 2(r^2\frac{\sigma_y^2}{\sigma_x^2}) \sigma_x^2 + r^2 \sigma_y^2\)
= \(\sigma_y^2 - 2r^2\sigma_y^2 + r^2 \sigma_y^2\)
= \(\sigma_y^2 - r^2 \sigma_y^2\)
= \(\sigma_y^2 (1 - r^2)\)
Week 2: Loss Functions, Center and Spread, and Simple Linear Regression
Isnāt the mean affected by outliers? How is it the best prediction?
A prediction is only the ābestā relative to some loss function. When using the constant model, \(H(x) = h\), the mean is the best prediction only if we choose to use the squared loss function, \(L_\text{sq}(y_i, h) = (y_i - h)^2\). If we choose another loss function, like absolute loss \(L_\text{abs}(y_i, h) = \lvert y_i - h \rvert\), the mean is no longer the best prediction.
The key idea is that different loss functions lead to different ābestā parameters.
Does empirical risk = mean squared error?
āEmpirical riskā is another term for āaverage loss for whatever loss function youāre using.ā Any loss function \(L(y_i, h)\) can be used to create an empirical risk function \(R(h)\). Weāve seen two common loss function choices:
- When using absolute loss, \(L_\text{abs}(y_i, h) = \lvert y_i - h\rvert\), the empirical risk, \(R_\text{abs}(y_i, h) = \frac{1}{n} \sum_{i = 1}^n \lvert y_i - h\rvert\), has a special name: āmean absolute error.ā
- When using squared loss, \(L_\text{sq}(y_i, h) = (y_i - h)^2\), the empirical risk, \(R_\text{sq}(y_i, h) = \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2\), has a special name: āmean squared error.ā
What does it mean for a minimizer to be unique?
Letās suppose weāre working with the constant model, \(H(x) = h\).
The minimizer of mean squared error is unique, because the minimizer of mean squared error for the constant model is the mean, and the mean of a collection of numbers \(y_1, y_2, ..., y_n\) is always just a single number. Specifically, itās the number \(\frac{y_1 + y_2 + ... + y_n}{n}\).
The minimizer of mean absolute error is not necessarily unique. Itās unique when thereās an odd number of data points ā specifically, if the data points are sorted in order, with \(y_1\) being the smallest and \(y_n\) being the largest, then the minimizer of mean absolute error is the median, \(y_{\frac{n+1}{2}}\). But if there are an even number of data points, then any of the infinitely many numbers on the number line between \(y_{\frac{n}{2}}\) and \(y_{\frac{n}{2} + 1}\) minimize mean absolute error, so the minimizer of mean absolute error is not necessarily unique.
For example, in the dataset 72, 90, 61, 85, 92, 75, there are an infinite number of possible predictions that minimize mean absolute error. 75 is one of them, but so is 75.001, 76, 79.913, etc ā anything between 75 and 85, inclusive, minimizes mean absolute error.
What was the point of plugging in \(h^*\) into \(R(h)\)?
We spent the first week of class minimizing empirical risk, \(R(h)\). We found that, depending on our choice of loss function, \(h^*\) ended up being a different measure of the center of our dataset. The point was to show that the values of \(R(h)\) actually have some meaning as well, and in particular, the smallest possible value of \(R(h)\) (which is \(R(h^*)\)) happens to describe the spread of our dataset.
In the image above, \(h^*\) is the \(x\)-coordinate of the vertex (80 and 85). We know what 80 and 85 mean ā theyāre the mean and median of the dataset 72, 90, 61, 85, 92, respectively. What we were trying to give context to is what 150 and 9.9 mean ā theyāre the variance and the mean absolute deviation from the median of our dataset. Both the variance and mean absolute deviation from the median are measurements of spread.
Are there more loss functions outside of what we learned in class?
There are plenty! For example, thereās Huber loss, which is like a smoothed version of absolute loss (itās absolute loss, with the corner at the bottom replaced with the bottom of a parabola). Thereās also cross-entropy loss, also known as ālog lossā, which is designed for models that predict probabilities (like logistic regression). These, and many more, will come up in future ML classes, like DSC 140A and CSE 158/DSC 148.
How do I know which loss function to choose in practice?
As weāve seen, different loss functions have different properties. At least with regards to the constant model:
In practice, various models have a ādefaultā choice of loss function. Regression usually uses squared loss, not just because squared loss is easily differentiable, but also because squared loss comes with lots of nice theoretical properties (which youāll learn about in DSC 140A, like the fact that implicitly assumes that the distribution of errors is normal/Gaussian). But depending on your model, you can just try different loss functions and see which ends up creating the model with the best performance!
What was the point of the midrange and infinity loss? Will I actually use that in practice?
Iāve never heard of anyone using \(\lvert y_i - h\rvert^p\) with \(p \rightarrow \infty\) as a loss function in practice, so no. But the point of us studying that was for us to get a better understanding of how different loss functions penalize different kinds of errors, and in particular, how the optimal constant prediction is influenced by outliers.
Again, for the constant model \(H(x) = h\):
- Absolute loss, \(\lvert y_i - h\rvert\), isnāt sensitive to outliers, itās very robust. Remember, the minimizer (the median) was found by finding the \(h\) where (# points to the left of \(h\) = # points to the right of \(h\)).
- Squared loss, \((y_i - h)^2\), is more sensitive to outliers. Remember, the minimizer (the mean) was found by finding the \(h\) where \(-\frac{2}{n} \sum_{i = 1}^n (y_i - h)= 0\), because \(-\frac{2}{n} \sum_{i = 1}^n (y_i - h)\) is the derivative of \(R_\text{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2\). Since this is the case, the mean is āpulledā in the direction of the outliers, since it needs to balance the deviations.
- Following the pattern, \(\lvert y_i - h\rvert^3\) would be even more sensitive to outliers.
As we keep increasing the exponent, \(\lvert y_i - h\rvert^p\) creates a prediction thatās extremely sensitive to outliers, to the point where its goal is to balance the worst case (maximum distance) from any one point. Thatās where the midrange comes in ā itās in the middle of the data, so itās not too far from any one point.
So while no, you wonāt really use the idea of āinfinity lossā in practice, I hope that by deeply understanding how it works, youāll better understand how loss functions (including those we havenāt seen in class, but do exist in the real world) work and impact your predictions.
In Lecture 4, is the \(x_i\) not part of the summation since it is out of the parentheses?
The question was referring to a summation like this one:
\[\sum_{i = 1}^n (y_i - w_0 - w_1 x_i) x_i\]Here, \(x_i\) is indeed a part of the summation. The sum is of \(n\) terms, each of which are the form \((y_i - w_0 - w_1 x_i) \cdot x_i\). That is, the summation above is equivalent to:
\[\sum_{i = 1}^n \left( (y_i - w_0 - w_1 x_i) x_i \right)\]On the other hand, the following expression is invalid, since \(x_i\) doesnāt have any meaning when not part of a summation over \(i\):
\[\left( \sum_{i = 1}^n (y_i - w_0 - w_1 x_i) \right) x_i\]What was the 3D graph in Lecture 4 about?
On the left, we have the graph of the mean squared error of a constant prediction, \(h\), on the dataset 72, 90, 61, 85, 92. It shows us that there is some best \(h\), which weāve been calling \(h^*\), that makes the mean squared error as small as possible. We showed, using calculus, that the value of \(h^*\) for any dataset is \(\text{Mean}(y_1, y_2, ..., y_n)\).
On the right, we have the graph of mean squared error of the line \(H(x) = w_0 + w_1 x\). The dataset is the dataset of departure times and commute times weāve been using as our running example. Specifically:
The two axes on the āgroundā of the plot represent different intercepts, \(w_0\), and slopes, \(w_1\), that we could be using for making predictions.
The height of the graph above any \((w_0, w_1)\) pair is \(\frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2\). \(x_i\) represents the \(i\)th departure time (e.g. 8.5, corresponding to 8:30AM) and \(y_i\) represents the \(i\)th actual commute time (e.g. 75 minutes).
The point was to show what the function \(R_\text{sq}(w_0, w_1) = \frac{1}{n} \sum_{i = 1}^n (y_i - (w_0 + w_1 x_i))^2\) actually looks like, before we went to use calculus to minimize it. It kind of looks like a bowl, and has a clearly defined minimum. Calculus helped us find that minimum, which occurs at \(w_0^* = \bar{y} - w_1^* \bar{x}\) and \(w_1^* = \frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2}\).
Can we minimize the mean absolute error of the simple linear regression model?
Yes, we can! The issue is just that there doesnāt exist a closed-form solution, i.e. a formula, for the optimal \(w_0^*\) and \(w_1^*\) in:
\[R_\text{abs}(w_0, w_1) = \frac{1}{n} \sum_{i = 1}^n \lvert y_i - (w_0 + w_1 x_i) \rvert\]So, we have to use the computer to approximate the answer. Regression with squared loss is called āleast squares regression,ā but regression with absolute los is called āleast absolute deviations regression.ā You can learn more here.
Can you post the slides earlier than 20 minutes before lecture?
Iāll try, but Iām making lots of changes to the lectures this quarter, and that usually takes me until right before lecture š