Manifold Learning and Principal Component Analysis(PCA) in Preprocessing the Data Before Building a Classification Pipeline

“Dimensionality curse” has been prevalent across the spectrum of the data analysis. Now machine learning claims its power in conducting complex tasks such as hand-writing digits, human face recognition, the huge amount of data and high dimensions cause big hurdles for the execution of such good methodology. Hence, an effective way to reduce dimensions is indispensable to proceed and complete machine learning algorithms.
Manifold Learning and Principal Component Analysis(PCA) are used extensively for dimension reduction.

PCA is more well-known in this realm and relatively easier to understand. The simplest form of PCA can be illustrated in the linear regression model, where two-dimensional data points are reduced to one dimension (x-axis) with fair-good predictability/match of the y-axis data. A further example on hand-written digits can be used to see the powerfulness of PCA in processing high-dimensional data. If we load this set of digits data, it contains 1797 samples with 64 features – 8 by 8 pixels with a numeric value representing the brightness of that dot.
applying this line: pca = PCA(2) to project from 64 to 2 dimensions, then projected = pca.fit_transform(digits.data). Drawing that out in a two-dimensional graph, we see, even without labels, the 10 digits are well clustered roughly by this two components identified by PCA. The meaning of PCA component can be understood as salient features.

two dimensions setting pca of digit data
The next vital step is to determine how many components are needed to keep the intactness while reducing dimensions to an effective degree. It is determined by looking at the cumulative explained variance ratio as a function of the number of components. From the below chart, we need about 10 components to keep 75% of original data.

number of PCA components

PCA’s main weakness is that it tends to be highly affected by outliers in the data. For this reason, many robust variants of PCA have been developed, which can be found in scilearn package. In addition, PCA does not do well in nonlinear relationships data. For nonlinear relationship data, some of the Manifold methods are useful.

There are a number of manifold methods: multidimensional scaling (MDS), locally linear embedding (LLE), and isometric mapping (IsoMap). MDS is efficient to process data rotations, translations, and scalings of data into higher-dimensional spaces. Where MDS breaks down is when the embedding is nonlinear approach – Locally Linear Embedding (LLE) comes in. According to the author Jake,

  • For high-dimensional data from real-world sources, LLE often produces poor results, and isometric mapping (IsoMap) seems to generally lead to more meaningful embeddings. This is implemented in sklearn.manifold.Isomap
  • For data that is highly clustered, t-distributed stochastic neighbor embedding (t-SNE) seems to work very well, though can be very slow compared to other methods. This is implemented in sklearn.manifold.TSNE.

Referenced from Jake Machine Learning site: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s