A prelude to get deeper into ML, I first like to point out why linear algebra is so important. It’s a magic way to collapse tons of sample data and parameter data into a very short, flat, succinct one-line equation: prediction = data matrix * parameters, below uses single factor (size) for housing price prediction using linear regression model.

Now we delved into the multivariate regression. It’s just an extension to single linear regression model as below, but instead of dealing with feature or variant of one, multiple features/variants are allowed and filled in equation. Usually, a subscript denotation of j is inserted.

Perform math deduction of partial derivative of J here by combining(substituting parts) in above three equations, we get

Gradient Descent is a great tool to get to the minimal theta value(vector), but it’s very computational expensive and time consuming. Normal Equations comes to rescue if you want a quick and dirty solution.

So what is normal equation. It’s basically just a short cut to reach theta values by mathmatically forcing the differentials = 0 rather than iterate numerous times across test data points.

Outlined in linear algebra language, it is

Using normal equation avoids choosing of alpha(learning rate) and iteration, so it should be a good approach, however, since inverse transpose of matrix is applied, the computation can go up exponentially when number of features(n) goes up. A rule of thumb per prof. Ng’s experience is that for n <= 10,000, normal equation can handle, but for n greater than that value, to millions, gradient descent is a better choice.