What An Algorithm Worker Needs To Grasp

I stumbled upon a training course given by a so-called expert to train algorithm workers, claiming they will be paid extraordinary salary once qualified. So I jot down the content and fill in what I know accordingly.

第一模块:Python程序设计基础 (2个课时)

一、Python基础

1.语言特性 – versatile, easy to pick up

2.程序设计环境

 Anaconda环境简介 – It is an open source, easy-to-install high-performance Python and R distribution, with the conda package and environment manager and collection of 1,000+ open source packages with free community support. (from Anaconda distribution documentation)

3.语法基础

变量及标记语法

4.常用函数

5.语句结构

顺序、循环、条件与递归

6. 常用库介绍

Numpy、 Pandas、 Matplotlib等等

7. 面向对象方法

 案例

二、算法分析与Big O

1.算法分析与Big O简介 – Big O problem describes the performance or complexity of an algorithm, specifically when the worst-case scenario happens.

2.Big O 案例

3.Python 数据结构中的Big O

第二模块:常用库及应用 (2个课时)

一、Numpy库的应用

1.特性

2.函数与方法

二、Pandas库的应用

1.时间序列处理初步

2.Dataframe与Series

3.常用方法与函数

4.类数据库查询

三、可视化库的应用

1.可视化图件意义及制作方法

散点图、饼图、频度图、QQ、热力图等

2.Matplotlib,Seaborn及Pandas Plotting应用

3.对象特性

第三模块:常用数据结构(2.5个课时)

一、数组

数组序列简介

动态数组与低级别数组

常见面试问题

二、栈、队列与双端队列

简介

Python实现方式

常见案例

三、链表

单链表与双链表

常见问题

四、树

树结构的表征

树的遍历

二叉搜索树

常见应用

五、图

图的简介

邻接矩阵与邻接列表

常见应用

第四模块:经典算法的Python实现(2.5个课时)

1.贪心算法:原理与实例, in contrast to the brute force approach, in optimizing assignment problems, Greedy Algorithm approach is adopted to pair the longest task with a sorted list.

 A = [6, 3, 2, 7, 5, 5]
 A = sorted(A)
 print(A)
 for i in range(len(A)//2):
     print(A[i], A[~i]

2.递归与遍历 – recursion

递归原理

序列遍历与二分法

深度优先与广度优先遍历

常用场景

3.常用排序算法:算法原理、实例

4.动态规划算法初步:原理、应用场景案例

5.Hash函数:原理、Hash表的应用 – Hash Tables and Hash Functions

key value pairs, or hash map, transform the key to a small index number this process is hashing algorithm or function.
address = key Mod n
Collision resolution
-linear probing
plus 3 rehash
quadratic probing
double hashing

第五模块:机器学习算法原理及Python应用(4个课时)

一、机器学习算法概览与数学基础

1.概率论与统计基础

2.Bayes原理

3.最大似然原理

4.机器学习“武器库”概况

二、最优化问题相关算法

1.预测模型与最小二乘:(多元)线性回归

2.Lagarange法:案例投资组合管理 – Maximization of a function with a Constraint is common in economic situations. That’s why Lagarange multipliers come into play. Referring to this article for its application.

3.牛顿法,最速下降及其变种

三、Logit回归及机器学习重要概念

1.Logit回归原理

2.损失函数

3.偏差与方差

4.欠拟合与过拟合

5.评估参数与方法

6.案例

四、经典机器学习算法思想

1.EM算法思想: Kmeans算法等 -K-means clustering is an unsupervised machine learning algorithms. It starts with a randomly selected centroids, then performs iterative calculations to optimize the positions of the centroids.

2.树类算法:不纯度计算:熵与Gini系数 – cluster

Ensemble原理:Boosting,Bagging, Stacking –

Bagging, Boosting and Stacking are three main terms describing the ensemble (combination) of various models into one more effective model:
1. Bagging, is shorthand for the combination of bootstrapping and aggregating. Bootstrapping is a method to help decrease the variance of the classifier and reduce overfitting, by resampling data from the training set with the same cardinality as the original set. The model created should be less overfitted than a single individual model.
2. Boosting is to add additional models to the overall ensemble model sequentially.
3. Stacking is a new model is trained from the combined predictions of two (or more) previous model. The predictions from the models are used as inputs for each sequential layer, and combined to form a new set of predictions.

Citing from the paper by Scott Fortmann, “Bagging and other resampling techniques can be used to reduce the variance in model predictions. In bagging (Bootstrap Aggregating), numerous replicates of the original data set are created using random selection with replacement. Each derivative data set is then used to construct a new model and the models are gathered together into an ensemble. To make a prediction, all of the models in the ensemble are polled and their results are averaged.

One powerful modeling algorithm that makes good use of bagging is Random Forests. Random Forests works by training numerous decision trees each based on a different resampling of the original training data. In Random Forests the bias of the full model is equivalent to the bias of a single decision tree (which itself has high variance). By creating many of these trees, in effect a “forest”, and then averaging them the variance of the final model can be greatly reduced over that of a single tree. In practice the only limitation on the size of the forest is computing time as an infinite number of trees could be trained without ever increasing bias and with a continual (if asymptotically declining) decrease in the variance.”

GBDT,RandomForest -GBM and RF both are ensemble learning methods. GBM and RF differs in the way the trees are built. GBT build trees one at a time, GBMs are more sensitive to overfitting if the data is noisy. FRs train each tree independently, using a random sample, are less likely to overfit on the training data.

算法优化

案例

3.聚类算法:PCA、 SVD、T-SNE

4.支持向量机 – SVM

间隔与几何间隔

对偶最优化问题

核技巧

损失函数:从Cross Entropy到Hinge – cross entropy measures the average number of bits needs to identify an event drawn from the dataset.

应用案例

5.特征工程及实战技巧

Sk-Learn库使用方法

特征工程基础

常用特征工程算法

K-Fold交叉检验

数据清洗与充填

异常值检验

第六模块:时间序列分析常用算法(3个课时)

1.信号分解与时频分析

2.滤波与重构

3.ARIMA模型

模型定阶初步

 Python实现

4.Garch模型:原理及Python实现 – ARCH model is appropriate when the error variance in a time series follows an autoregressive (AR) model; if an autoregressive moving average (ARMA) model is assumed for the error variance, the model is a generalized autoregressive conditional heteroskedasticity (GARCHmodel

5.随机过程:理论、随机采样,蒙特卡罗法

6.案例

ARIMA股价预测 –ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.

多阶指数滤波

信号分解重构

第七模块:深度学习实践与提高(3个课时)

一、神经网络原理

1.激活函数

2.梯度下降算法

3.正向传播与反向传播

二、神经网络实现

1.Tensorflow、Keras、Theano库应用基础

2.手把手学习底层代码          

三、主要问题及优化

1.Dropout

2.BatchNormalisation

3.激活函数优化

4.结构优化

四、卷积神经网络初步

1.图像滤波与特征

2.输出特征尺寸计算

3.参数调优

五、深度学习经典模型及研究进展

1.循环神经网络(RNN)

2.LSTM,Gru等

3.新技术及学习方法

第八模块:算法类岗位面试问题解决(2个课时)

1.统计与概率题

2.智力题

3.数据库SQL

4.经典算法题概要

5.机器学习算法相关

6.实务类算法设计题

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.