I stumbled upon a training course given by a so-called expert to train algorithm workers, claiming they will be paid extraordinary salary once qualified. So I jot down the content and fill in what I know accordingly.
1.语言特性 – versatile, easy to pick up
Anaconda环境简介 – It is an open source, easy-to-install high-performance Python and R distribution, with the conda package and environment manager and collection of 1,000+ open source packages with free community support. (from Anaconda distribution documentation)
Numpy、 Pandas、 Matplotlib等等
1.算法分析与Big O简介 – Big O problem describes the performance or complexity of an algorithm, specifically when the worst-case scenario happens.
2.Big O 案例
3.Python 数据结构中的Big O
1.贪心算法：原理与实例, in contrast to the brute force approach, in optimizing assignment problems, Greedy Algorithm approach is adopted to pair the longest task with a sorted list.
A = [6, 3, 2, 7, 5, 5] A = sorted(A) print(A) for i in range(len(A)//2): print(A[i], A[~i]
2.递归与遍历 – recursion
5.Hash函数：原理、Hash表的应用 – Hash Tables and Hash Functions
key value pairs, or hash map, transform the key to a small index number this process is hashing algorithm or function.
address = key Mod n
plus 3 rehash
2.Lagarange法：案例投资组合管理 – Maximization of a function with a Constraint is common in economic situations. That’s why Lagarange multipliers come into play. Referring to this article for its application.
1.EM算法思想： Kmeans算法等 -K-means clustering is an unsupervised machine learning algorithms. It starts with a randomly selected centroids, then performs iterative calculations to optimize the positions of the centroids.
2.树类算法：不纯度计算：熵与Gini系数 – cluster
Ensemble原理：Boosting，Bagging, Stacking –
Bagging, Boosting and Stacking are three main terms describing the ensemble (combination) of various models into one more effective model:
1. Bagging, is shorthand for the combination of bootstrapping and aggregating. Bootstrapping is a method to help decrease the variance of the classifier and reduce overfitting, by resampling data from the training set with the same cardinality as the original set. The model created should be less overfitted than a single individual model.
2. Boosting is to add additional models to the overall ensemble model sequentially.
3. Stacking is a new model is trained from the combined predictions of two (or more) previous model. The predictions from the models are used as inputs for each sequential layer, and combined to form a new set of predictions.
Citing from the paper by Scott Fortmann, “Bagging and other resampling techniques can be used to reduce the variance in model predictions. In bagging (Bootstrap Aggregating), numerous replicates of the original data set are created using random selection with replacement. Each derivative data set is then used to construct a new model and the models are gathered together into an ensemble. To make a prediction, all of the models in the ensemble are polled and their results are averaged.
One powerful modeling algorithm that makes good use of bagging is Random Forests. Random Forests works by training numerous decision trees each based on a different resampling of the original training data. In Random Forests the bias of the full model is equivalent to the bias of a single decision tree (which itself has high variance). By creating many of these trees, in effect a “forest”, and then averaging them the variance of the final model can be greatly reduced over that of a single tree. In practice the only limitation on the size of the forest is computing time as an infinite number of trees could be trained without ever increasing bias and with a continual (if asymptotically declining) decrease in the variance.”
GBDT,RandomForest -GBM and RF both are ensemble learning methods. GBM and RF differs in the way the trees are built. GBT build trees one at a time, GBMs are more sensitive to overfitting if the data is noisy. FRs train each tree independently, using a random sample, are less likely to overfit on the training data.
4.支持向量机 – SVM
损失函数：从Cross Entropy到Hinge – cross entropy measures the average number of bits needs to identify an event drawn from the dataset.
4.Garch模型：原理及Python实现 – ARCH model is appropriate when the error variance in a time series follows an autoregressive (AR) model; if an autoregressive moving average (ARMA) model is assumed for the error variance, the model is a generalized autoregressive conditional heteroskedasticity (GARCH) model
ARIMA股价预测 –ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.