Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputs and using that to make predictions or decisions, rather than following only explicitly programmed instructions. It oftentimes is conflated with data mining, although that focuses more on exploratory data analysis.
Using the famous iris dataset as an example, before feeding into a machine learning algorithm, the data ought to be organized in nsample (rows) and nfeartures(columns denoting features of iris such as Sepal.Length, Sepal.Width, Petal.Length and Petal.Width. Therefore, it’s important to prepare the data, which is termed feature engineering. It includes Categorical Features, Text Features, Image Features, Derived Features, Missing Data Imputation, which is detailed herein: in https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html
The next big step is to choose the right algorithm or estimator to conduct machine learning. Jake has listed the following on his website:
Naive Bayes Theory
Support Vector Machines
— one of a good application of SVM is Face Recognition.
Decision Trees and Random Forests
Principal Component Analysis
Gaussian Mixture Models
Kernel Density Estimation
–Application: A Face Detection Pipeline
In order to make the right choice of the estimator, it’s essential that we understand the logic behind each. For example, Naive Bayes theory, in essence, is about finding the probability of a label given some observed features, which we can write as P(L | features). To understand it, let’s use a practical example: say, on a campus, 60% are boys, they all wear pants, the rest – 40%- are girls, and girls have 50% possibility they wear pants, if you pick anybody blindly, what’s the chance of he/she is wearing pants?
easy and straightforward, = U * P(Boy) * P(Pants|Boy) + U * P(Girl) * P(Pants|Girl)
the question can be altered to “among the whole group of people wearing pants, what’s the chance that it is a girl?”
this looks daunting at first glance, but provided the first easy question that we have answered of, it’s easy to just put the part of U * P(Girl) * P(Pants|Girl) on the nominator, and divide it by the whole probability: U * P(Girl) * P(Pants|Girl) / [U * P(Boy) * P(Pants|Boy) + U * P(Girl) * P(Pants|Girl)] => P(Girl|Pants) = P(Girl) * P(Pants|Girl) / [P(Boy) * P(Pants|Boy) + P(Girl) * P(Pants|Girl)]
deduce it to a general form:
P(B|A) = P(A|B) * P(B) / [P(A|B) * P(B) + P(A|~B) * P(~B) ]
note p(B/A) is a conditional probability, meaning it is the probability of A, conditioned on B, or vice versa.
Peter Norvig wrote a book on AI, an example in this book is about how to apply Bayes to correct spelling mistakes when we type a searching word. it can be transformed into Bayes probability symbol as
P(B|A) P(guessing the word|actual wrong word she entered)
Specifically to the usable algos, there are Gaussian Naive Bayes, the assumption of which is that data from each label is drawn from a simple Gaussian distribution, and Multinomial Naive Bayes. The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates. One of its applications is in text classification.
Similar to Naive Bayes classifier, Linear Regression models are very good baseline classifiers too. It’s worth noting that not only simple linear regression can be applied, there are other more complex ones such as polynomial, Gaussian Basis functions, Ridge regression and Lasso regressions. etc.
Support vector machines(SVM) takes a different approach than classifier with regard to its attempt to draw a line between two categories of data by minimizing the margin of some width, up to the nearest point. In combination with kernels, SVM can be used for non-linear data. facial recognition problem can be tackled with SVM by first extract face pixel values using principal component analysis.
Random forests are built on decision trees, and the concept of decision trees is as straightforward as can be intuitively conjured up. It’s like a splitting of tree branches based on a yes-no answer to questions on each node. Instead of simply taking this decision tree model, if we randomized the data by fitting each estimator with a random subset of the training points, it is Random forest, in sciki-learn, the RandomeForestClassifier does this job quickly premised on our selection of the number of estimators. Random forests are set well to do digits recognition and stock categorization based on a large array of factors.
K-Means itself is a popular way of clusterings such as Fuzzy C-means – an overlapping clustering algorithm, Hierarchical clustering, and Mixture of Gaussian, a probabilistic clustering algorithm. The K is a predetermined number of clusters within an unlabeled multidimensional dataset. The intuition behind the algo is: first, the “cluster center” is the arithmetic mean of all the points belonging to the cluster; second, each point is closer to its own cluster center than to other cluster centers. People confuse KNN with K-Means often as both are centroid-based, it’s worth to make a distinction between the two. K-NN is a classification (supervised learning) algorithm, it plots the decision boundaries for each class as illustrated below.
K-Means is a clustering (unsupervised learning) algorithm. For example, handwritten digits data (PCA reduced) can be clustered in the below 10 centroids. K-Means is also used for color compression.
To sum up, here are some good visuals:
Jake’s website: https://jakevdp.github.io/PythonDataScienceHandbook/index.html; Scikit learn python package official website: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html; Andre Ng ML series lectures; ARPM ML application in finance https://www.arpm.co/certificate/body-of-knowledge/data-science-for-finance/
And Kaggle’s survey in regard to how frequently these algos are used: