Minimize the square loss function:
L ( w , b ) = ∑ i [ y i − ( w ⋅ x i + b ) ] 2 L(w, b) = \sum_i [y_i - (w \cdot x_i + b)]^2 L ( w , b ) = i ∑ [ y i − ( w ⋅ x i + b ) ] 2
Predictor/independent variable (x): used to predict the response variable.
Response variable (y): the variable that we want to predict.
In one dimension, the problem becomes fitting a straight line of the form:
y ^ = w x + b \hat{y} = wx +b y ^ = w x + b , where w w w is the slope and b b b is the intercept. If we were
given a bunch of training data points ( x 1 , y 1 ) , (x_1, y_1), ( x 1 , y 1 ) , ( x 2 , y 2 ) , (x_2, y_2), ( x 2 , y 2 ) , . . . , ..., ... , ( x n , y n ) (x_n,
y_n) ( x n , y n ) , our task is to find a line (w w w and b b b ) for which the square error is
minimum.
We can find the minimum of the square loss function and obtain solution for w w w
and b b b :
d L d w = d L d b = 0 \frac{dL}{dw} = \frac{dL}{db} = 0 d w d L = d b d L = 0
⇒ ∑ i n 2 ( y i − ( w x i + b ) ) = 0 \Rightarrow \sum_i^n 2 (y_i - (w x_i + b)) = 0 ⇒ i ∑ n 2 ( y i − ( w x i + b )) = 0
⇒ ∑ i n y i = w ∑ i n x i + n b \Rightarrow \sum_i^n y_i = w \sum_i^n x_i + nb ⇒ i ∑ n y i = w i ∑ n x i + nb
⇒ b = 1 n ∑ i n y i − w n ∑ i n x i = y ˉ − w x ˉ \Rightarrow b = \frac{1}{n} \sum_i^n y_i - \frac{w}{n} \sum_i^n x_i = \bar{y} -w
\bar{x} ⇒ b = n 1 i ∑ n y i − n w i ∑ n x i = y ˉ − w x ˉ
We can solve for w w w by setting d L d w = 0 \frac{dL}{dw} = 0 d w d L = 0 :
w = ∑ i n ( y i − y ˉ ) ( x i − x ˉ ) ∑ i n ( x i − x ˉ ) 2 w = \frac{\sum_i^n (y_i - \bar{y})(x_i - \bar{x})}{\sum_i^n (x_i - \bar{x})^2} w = ∑ i n ( x i − x ˉ ) 2 ∑ i n ( y i − y ˉ ) ( x i − x ˉ )
The above method can be extended for more than one predictor variable. In that
case,
y ^ = w ( 1 ) x ( 1 ) + w ( 2 ) x ( 2 ) + w ( k ) x ( k ) + b = w ⋅ x + b \hat{y} = w^{(1)} x^{(1)} + w^{(2)} x^{(2)} + w^{(k)} x^{(k)} + b = w \cdot x +
b y ^ = w ( 1 ) x ( 1 ) + w ( 2 ) x ( 2 ) + w ( k ) x ( k ) + b = w ⋅ x + b
We can incorporate b b b in w w w s by assuming an extra predictor variable:
y ^ = w ⋅ x + b = w ~ ⋅ x ~ \hat{y} = w \cdot x + b = \tilde{w} \cdot \tilde{x} y ^ = w ⋅ x + b = w ~ ⋅ x ~
where x ~ = ( 1 , x ) \tilde{x} = (1, x) x ~ = ( 1 , x ) and w ~ = ( b , w ) \tilde{w} = (b, w) w ~ = ( b , w ) .
Our variables can be written as matrices:
X = ( ← x 1 ~ → ← x 2 ~ → ← . . . → ← x n ~ → ) X = \begin{pmatrix} \leftarrow & \tilde{x_1} & \rightarrow \\ \leftarrow &
\tilde{x_2} & \rightarrow \\ \leftarrow & ... & \rightarrow \\ \leftarrow &
\tilde{x_n} & \rightarrow \end{pmatrix} X = ← ← ← ← x 1 ~ x 2 ~ ... x n ~ → → → → , y = ( y 1 y 2 . . . y n ) y = \begin{pmatrix} y_1 \\ y_2
\\ ... \\ y_n \end{pmatrix} y = y 1 y 2 ... y n
The loss function is minimized at: w ~ = ( X T X ) − 1 ( X T y ) \tilde{w} = (X^TX)^{-1}(X^Ty) w ~ = ( X T X ) − 1 ( X T y ) .
Scaling of data is not important when we have multiple variable in case of
linear regression.