Feature vs. Parameter Spaces

A Match Made in Heaven?

This interactive demo helps you understand why we use mean squared error (or other averages) rather than simple sums of errors in regression problems. More importantly, it illustrates how feature space and parameter space are intrinsically connected through the concept of loss functions. You can interactively adjust model parameters and see how they affect both the model fit and the optimization landscape.


0.00
1.0 20

Error Metrics:

Sum of Errors: 0.00 Mean Error: 0.00 Sum of Squared Errors: 0.00 Mean Squared Error: 0.00

Feature Space

Parameter Space

Why Do We Use Average Errors Instead of Summing?

When evaluating regression models, we typically use averaged error metrics (like Mean Squared Error) rather than simple sums of errors. Here's why:

Sum of Raw Errors Problem:

For raw errors (predicted - actual), positive and negative errors cancel each other out: $$\sum_{i=1}^n (\hat{y}_i - y_i)$$ A model could make huge errors in both directions, but still have a near-zero sum!

Squared or Absolute Errors:

We often use squared errors or absolute errors to make all errors positive: $$\text{Squared Error: } \sum_{i=1}^n (\hat{y}_i - y_i)^2 \quad \text{or} \quad \text{Absolute Error: } \sum_{i=1}^n |\hat{y}_i - y_i|$$

Why Average (Mean) Instead of Sum?

  1. Scale Invariance: The error sum grows with dataset size, making it difficult to compare models across different-sized datasets.
  2. Interpretation: Means are easier to interpret - they represent the typical error per data point.
  3. Mathematical Properties: For gradient-based optimization, using means introduces a constant factor (1/n) that doesn't change the optimization landscape's shape.

Parameter Space vs. Feature Space

In machine learning, we work in two complementary spaces that are connected through our loss function:

Feature Space: This is where your data lives and where you visualize your model's predictions. For regression, this typically shows input features (x-axis) and output values (y-axis). Each point represents an observation, and the line/curve shows your model's predictions across the input domain.

Parameter Space: This is the space of all possible parameter values for your model.

  • For a constant model: 1-dimensional space (just the constant c)
  • For a linear model: 2-dimensional space (slope m and intercept b)
  • For more complex models: Higher dimensional spaces

The error surface in parameter space shows how the error changes as you vary the model parameters. Each point in parameter space corresponds to a specific model configuration in feature space. The key insight is that certain loss functions (like squared error) create nice, convex error surfaces with a single global minimum, making optimization straightforward.

Special Case: Raw Error Loss Landscapes

When you select "Raw Error" as your error metric, you'll observe some interesting behavior:

For the constant model: The parameter space shows a straight line rather than a curve. This happens because the raw error function for a constant model is: $$\frac{1}{n}\sum_{i=1}^n (c - y_i) = c - \frac{1}{n}\sum_{i=1}^n y_i$$ This is a linear function of the parameter c, creating a straight line in parameter space. The line crosses zero at c = mean(y), which is why the minimum raw error occurs at the mean of y values.

For the linear model: The parameter space shows a flat plane with a line where the error is zero. This happens because the raw error function for a linear model is: $$\frac{1}{n}\sum_{i=1}^n (mx_i + b - y_i) = m\cdot\text{mean}(x) + b - \text{mean}(y)$$ This creates a flat plane in the (m,b) space. Any combination of m and b that satisfies m·mean(x) + b = mean(y) will have zero raw error! This is why raw error is problematic for optimization - it doesn't provide a unique solution for the model parameters.

Try it yourself: Experiment with the sliders to see how changing parameters affects both spaces simultaneously. Notice how squared error creates a bowl-shaped (paraboloid) surface in parameter space, while absolute error creates a more angular surface, and raw error can create complex surfaces with no clear minimum.