Weighted Least Squares and locally weighted linear regression

From the post on Closed Form Solution for Linear regression, we computed the parameter vector $\theta$ which minimizes the square of the error between the predicted value $h_{\theta}(x)$ and the actual output $y$ for all $j$ values in the training set. In that model all the $j$ values in the training set is given equal importance. Let us consider the case where it is known some observations are important than the other. This post attempts to the discuss the case where some observations need to be given more weights than others (also known as weighted least squares).

Table of Contents

Notations

Let’s revisit the notations.

$m$ be the number of training set (in our case top 50 articles),

$x$ be the input sequence (the page index),

$y$ be the output sequence (the page views for each page index)

$n$ be the number of features/parameters (=2 for our example).

The value of $(x^j,y^j)$ corresponds to the $j^{th}$ training set.

Let $w^j$ be the weight given to the $j^{th}$ training set.

The predicted the number of page views for a given page index using a hypothesis $h_{\theta}(x)$ defined as :

$\begin{array}{lll}h_{\theta}(x)&=&\theta{_0}x_0 + \theta{_1}x_1\\&=&\sum_{i=0}^{n-1}\theta{_i}x_i\\&=&\theta^Tx\\\end{array}$

Goal is to find the parameter vector $\theta$ which minimizes the square of the error between the predicted value $h_{\theta}(x)$ and the actual output $y$ for all $j$ values in the training set with weight $w^j$ i.e.

$\min_{\theta} \sum_{j=1}^m w^j\[h_{\theta}(x^j) - y^j\]^2$ .

From matrix algebra, we know that

$\sum_{j=1}^mw^j\[h_{\theta}(x^j) - y^j\]^2=$X\theta-Y$^TW$X\theta-Y$$ .

where,

$W=\[\begin{array}{mmm}w^1 & 0 & 0 \\ 0 & w^2 & 0 \\\vdots &\ddots&\vdots \\ 0 & 0 & w^m\end{array}\]$ is the diagonal matrix of dimension [m x m].

$X = \[\begin{array}{mm}x_0^1 & x_1^1 \\x_0^2 & x_1^2\\ \vdots & \vdots \\x_0^m & x_1^m\end{array}\]$ is the input sequence of dimension [m x n]

$Y = \[\begin{array}{m}y^1\\y^2\\\vdots \\y^m\end{array}\]$ is the measured values of dimension [m x 1]

$\theta=\[\begin{array}{m}\theta_0\\\theta_1\end{array}\]$ is the parameter vector of dimension [n x 1].

Defining the cost function $J$\theta$$ as,

$J$\theta$=\frac{1}{2}\sum_{j=1}^m w^j\[h_{\theta}(x^j) - y^j\]^2 = \frac{1}{2}$X\theta-Y$^TW$X\theta-Y$$ .

To find the value of $\theta$ which minimizes $J$\theta$$ , we can differentiate $J$\theta$$ with respect to $\theta$ , i.e.

$\begin{array}{lll}\frac{\partial}{\partial\theta}J$\theta$& =& \frac{1}{2}\frac{\partial}{\partial\theta}$X\theta-Y$^TW$X\theta-Y$\\&=&\frac{1}{2}\frac{\partial}{\partial\theta}$\theta^TX^TWX\theta - \theta^TX^TWY -Y^TWX\theta+Y^TWY $\\&=&$X^TWX\theta - X^TWY$\end{array}}$

To find the value of $\theta$ which minimizes $\theta$ , we set

$\frac{\partial}{\partial\theta}J(\theta)=0$ ,

$\begin{array}{lll}$X^TWX\theta - X^TWY$&=&0\end{array}$ .

The weighted least squares solution is,

$\begin{array}{lll}\theta & = & $X^TWX$^{-1}X^TWY\end{array}$

Local weights using exponential function

As given in Chapter 4 of CS229 Lecture notes1, Probabilistic Interpretation, Prof. Andrew Ng. let us assume a weighting function defined as,

$w^j = e^{-\frac{(x^j-x)^2}{2\tau^2}}$ .

When computing the predicted value for an observation $x^j$ , less weightage is given to observation far away from $x^j$ .

Further an additional parameter, $\tau$ controls the width of the weighting function. Higher the value of $\tau$ , wider the weight function.

Figure: Plot of the exponential weighting function for different values of $\tau$

Matlab/Octave code snippet

clear ;
close all;
x = [1:50].';
y = [4554 3014 2171 1891 1593 1532 1416 1326 1297 1266 ...
	1248 1052 951 936 918 797 743 665 662 652 ...
	629 609 596 590 582 547 486 471 462 435 ...
	424 403 400 386 386 384 384 383 370 365 ...
	360 358 354 347 320 319 318 311 307 290 ].';

m = length(y); % store the number of training examples
x = [ ones(m,1) x]; % Add a column of ones to x
n = size(x,2); % number of features
theta_vec = inv(x'*x)*x'*y;
tau = [1 10 25 ];
y_est = zeros(length(tau),length(x));
for kk = 1:length(tau)
	for ii = 1:length(x);
		w_ii = exp(-(x(ii,2) - x(:,2)).^2./(2*tau(kk)^2));
		W = diag(w_ii);
		theta_vec = inv(x'*W*x)*x'*W*y;
		y_est(kk, ii) = x(ii,:)*theta_vec;
	end
end

figure;
plot(x(:,2),y,'ks-'); hold on
plot(x(:,2),y_est(1,:),'bp-');
plot(x(:,2),y_est(2,:),'rx-');
plot(x(:,2),y_est(3,:),'go-');
legend('measured', 'predicted, tau=1', 'predicted, tau=10','predicted, tau=25');
grid on;
xlabel('Page index, x');
ylabel('Page views, y');
title('Measured and predicted page views with weighted least squares');

Observations

a) For a smaller value of $\tau$ (=1), the measured and predicted values are almost on top of each other

b) For a higher value of $\tau$ (=25), the predicted value is close to the curve obtained from the no weighting case.

c) When predicting using the locally weighted least squares case, we need to have the training set handy to compute the weighting function. In contrast, for the unweighted case one could have ignored the training set once parameter vector $\theta$ is computed.

References

CS229 Lecture notes1, Chapter 3 Locally weighted linear regression, Prof. Andrew Ng

Notations

Local weights using exponential function

References

Leave a Reply Cancel reply