Previous discussion is all about supervised learning, meaning there is a labeled or known output data pairs (x, y) for the algorithm to be trained. However, in real world, there are circumstances when intellectual creatures like human beings who can tell patterns without known data or first time. This is realized by another kind of ML – unsupervised learning.

Clustering falls under the unsupervised learning area. Within clustering, K-means is a common technique.

How does K-means work? Randomly initialize K clusters centroids then repeatedly calculate the mean or average of points assigned to cluster k, forming K clusters, then within each cluster, recalculate the new centroid which is the mean or average of all data points categorized within the previous K cluster. It’s a dynamic process, the objective of optimizing K-means is

So there is the question of how to choose K, the practice is random. However, sometime it will stuck to a so-call optima but actually not the ideal one, so iteration across multiple random initiation of K, deciding based on the lowest cost function would be a solution, whereas, according to professor Ng, hand-picking is commonly a better option.

In dealing with gigantic amount of unlabled data (unsupervised learning) we sometime really need to compress the data leveraging the either directly compressing multiple dimensions to lower dimensions (such as length in inches versus length in meters is totally compressible) or more sophiticated approach – PCA (Principle Component Analysis).

The algorithm of PCA is as follows after applying mean normalization and feature scaling if needed, then the choosing of k – number of components is also statistically based.

As to the application of PCA, it’s not limited to unsupervised learning. One can see it can be used for large amount of supervised learning with a much higher speed.

It’s worth noting that PCA is not meant to reduce the issue of “over-fitting”, which is supposed to be dealt with using regularization. The reason sits deeply in the foundation of these two math mechanism. PCA is pure distance (projection error) based optimization, hence, PCA will possibly lead to the disposal of valuable features.

However, people tend to apply PCA early on, as it seems improve efficiency, cutting work load tremendously.