This interactive demo helps you understand why we use mean squared error (or other averages) rather than simple sums of errors in regression problems. More importantly, it illustrates how feature space and parameter space are intrinsically connected through the concept of loss functions. You can interactively adjust model parameters and see how they affect both the model fit and the optimization landscape.
When evaluating regression models, we typically use averaged error metrics (like Mean Squared Error) rather than simple sums of errors. Here's why:
For raw errors (predicted - actual), positive and negative errors cancel each other out: $$\sum_{i=1}^n (\hat{y}_i - y_i)$$ A model could make huge errors in both directions, but still have a near-zero sum!
We often use squared errors or absolute errors to make all errors positive: $$\text{Squared Error: } \sum_{i=1}^n (\hat{y}_i - y_i)^2 \quad \text{or} \quad \text{Absolute Error: } \sum_{i=1}^n |\hat{y}_i - y_i|$$
In machine learning, we work in two complementary spaces that are connected through our loss function:
Feature Space: This is where your data lives and where you visualize your model's predictions. For regression, this typically shows input features (x-axis) and output values (y-axis). Each point represents an observation, and the line/curve shows your model's predictions across the input domain.
Parameter Space: This is the space of all possible parameter values for your model.
The error surface in parameter space shows how the error changes as you vary the model parameters. Each point in parameter space corresponds to a specific model configuration in feature space. The key insight is that certain loss functions (like squared error) create nice, convex error surfaces with a single global minimum, making optimization straightforward.
When you select "Raw Error" as your error metric, you'll observe some interesting behavior:
For the constant model: The parameter space shows a straight line rather than a curve. This happens because the raw error function for a constant model is: $$\frac{1}{n}\sum_{i=1}^n (c - y_i) = c - \frac{1}{n}\sum_{i=1}^n y_i$$ This is a linear function of the parameter c, creating a straight line in parameter space. The line crosses zero at c = mean(y), which is why the minimum raw error occurs at the mean of y values.
For the linear model: The parameter space shows a flat plane with a line where the error is zero. This happens because the raw error function for a linear model is: $$\frac{1}{n}\sum_{i=1}^n (mx_i + b - y_i) = m\cdot\text{mean}(x) + b - \text{mean}(y)$$ This creates a flat plane in the (m,b) space. Any combination of m and b that satisfies m·mean(x) + b = mean(y) will have zero raw error! This is why raw error is problematic for optimization - it doesn't provide a unique solution for the model parameters.
Try it yourself: Experiment with the sliders to see how changing parameters affects both spaces simultaneously. Notice how squared error creates a bowl-shaped (paraboloid) surface in parameter space, while absolute error creates a more angular surface, and raw error can create complex surfaces with no clear minimum.