The previous discussion is meant to solve predicting problem that is on continuous values. What if the value you try to predict is discretionary such as category A, B, or even just binary, positive or negative, yes or no etc.

These are classification problems and the tool is not linear, or multivariate or polynomial regression model but logistic regression model. Logistic function, also is called sigmoid function, as is illustrated below, squeeze all possible values to a zone bounded by 0 and 1. So the trick here is to transform the normal function htheta(x) to g function, so the output shows sigmoid/logistic function characteristics.

Using some examples(not cited here as it’s straightforward), we can get to know the concept of “decision boundary” in classification problem approximated by above sigmoid functions.

Then, to apply this concept to rigorous mathmatical computation, cost function of logistic(sigmoid) regression needs to build.

If we borrow directly from linear regression, as is shown below, the nature of logistic function makes the shape of J(theta) non=convex(left), which is not ideal, what we seek, is to find a convex curve like that at the right.

A solution is as follows, it explains to the intuition perfectly however, I don’t know the math deduction yet.

Further statistical math derivation leads to

Once cost function is built, next step is same as linear regression – to apply grade descent to find the theta vector so the cost is minimized.

One can see it’s exact same as in linear regression but be aware that the ho(x) is Theta transposed time X vector in linear, while the logistic transformation of the same in logistic regression.

In terms of optimization, other than gradient descent, more advanced optimization algos are Conjugate gradient, BFGS, L-BFGS. The good side is that you don’t have to manually pick learning rate alpha, often faster than gradient descent, but disadvantage is more complex.

Along the same vein, we can apply the same approach to solve multi-classification problem by using one-versus- all(one-versus-rest) method.