Machine Learning Cheat Sheet

2021-01-14 C2 Memo
Machine Learning Cheat SheetBasics In Machine LearningGeneralization Ability

A model’s ability to generalize to new data.

Overfitting V.S. Underfitting

Loss Function

Typical choices are:• 

Generalization Error

Generalization error: 

Test/Training error: 

Learning Objective

In the mean time, watch out for overfitting!

Model EvaluationAccuracy, Precision and Recall

Bias-variance decompositionthis equation's interpretation is that the prediction and true value of a particular data's expected error, can be decomposed to bias, variance and noise, which is the three expressions on the RHS.Bias is the expected error from true value, and variance is the difference between the model trained and the expected model.Increase the amount of data is the only cure.Regularization

Adding a regularization term adjusted by the coefficient 

L2 regularization (Ridge):

L1 regularization (Lasso):

Validation StrategyLeave-one-out cross validationLinear ModelsSimple Linear RegressionMultivariate Linear RegressionLogistic

Logistic Function (sigmoid) :

Binary  classification using logistic function:

Training the logistic regression

MLE to draw the likelihood function 

Gradient descent method

Batch gradient descent:

Stochastic gradient descent:

Choice of learning rate

Loss function's changes against times of iteration.

TPR,FPR and ROC

Regularization of logistic regression

Support Vector MachineMarginMaximum margin classifier

The Lagrange method

Soft Margin SVMSoft margin

subject to:

Loss function of SVM

SVM hinge loss 

Kernel Function

Replace 

Kernel functions:

Neural NetworksPerceptron Model

A single layer perceptron with activation function:

updating rules:  

Not effective for not linearly separable data. Solution: Hidden Layer

Multi-layer Neural NetworksClassification or regressionOne or more hidden layersNot necessarily linear relationship

The square error loss for one set of data 

If there are a accumulated data set 

Updating rules: 

For 

According to Chain Rule and 

For 

By Chain Rule: 

The Back-Propagation algorithm

Activation functionsSigmoid function
Differentiable, not zero-centered, vanishing gradientsTanh activation function
Differentiable, zero-centered, vanishing gradientsReLU
No vanishing gradients in the positive region, computationally efficient, kills gradients when 

Get stuck at local minima.

Prone to overfitting.

Early stopping

Dropout regularization.

Deep LearningConvolution Layer


Filter is the target of training, like 

Advantage of convolutionPooling Layer

Decrease feature map size. Keeping the important information.

max pooling (often better than average pooling)

Stride

Step width of "scanning".

Padding

Method to control the shape of feature map after convolution. For example, the shape is not strides' integral multiple, or I want the shape to remain the same after convolution...

Hyperparameters in CNNsFilter size F (but the filter itself is parameters learned!!)the amount of zero-padding PFlatten

Connected with fully connected layers.

Fully connected layer is the same as regular neural network. The last FC layer is called the output layer and use activation function such as Softmax.

Softmaxused in multiclass classificationAssign probabilities to each class in a multi-class problem.Logistic function is a special case, 

Loss function in Logistic Regression & Neural Networks with Softmax

To interpret the loss function, 

Decision Tree

Segments the attribute space into a number of simple regions, use labels of the regions for prediction. Labels can be average, maximum...

Separate the space by layers of "if".

can handle mixed variables

implicitly perform feature selection

Entropy

n as amount of possible value.

Information Gain


S as the dataset and a as the attribute.Information Gain Ratio (Normalized Information Gain)

The larger the size of possible value, the larger 

Hedging the influence of the more splits the more information gain.

Gini GainID 3 Algorithmstart from the root node with all datacalculate information gains for all attributeselect one with highest information gainsplit the data of the node with respect to this attributeContinuous Attribute in ID3max the the information gainOverfitting

Constraints

Minimum leaf size - Stop split S if number of samples falls below a fixed thresholdMaximum Depth - Stop split S if times of split reach thresholdMaximum number of nodes - Stop if tree's nodes reach threshold

Pruning

Generalization ability may increase after some "pruning"

Pre-pruning: stop the spread if not much difference is madepost-pruning: start from the bottom of the tree and examine each non-leaf subtree, replace the subtree with leaf if not much difference is made.

Regression Tree

Split region 

Greedy Recursive: Choose the attribute and threshold to minimize the loss function on that level.

$$R_1(j,t)=\{x| x<j\} and r_2(j,t)="\{x|" x_j\ge t\} $$ $$min_{j,t}[\sum_{i:x_i\in r_1(j,t)}(y_i-\hat c_1)^2+\sum_{i:x_i\in r_2(j,t)}(y_i-\hat c_2)^2]Pruning the regression treeuse cost-complexity pruning

How to choose 

Ensemble Learning

Collective wisdom is greater than the smartest in the crowd!

Bagging

Train several different models separately, then have all of the models vote on the output for test example.

Create random subsets of training data using bootstrap samplingTrain the baAse learner on each bootstrap sample separatelyAverage output of all base learners•Majority voting for classification•Averaging for regression

Recall 

It works well for low-bias and high-variance leaners (e.g., decision trees)It may not benefit much high-bias and low-variance learners (e.g., simple linear model)

Out-of-bag estimation

Leave out about 37%.

Random Forests

The problem of bagging: the models trained from bootstrap samples are probably positively correlated

To reduce correlation: random feature selection!

Each tree is learned from a bootstrap sample•To grow a random-forest tree ℎ𝑡, repeat the following steps•Randomly select 𝑚 variables at random from the 𝑝(>𝑚)variables•Find the best split based on the 𝑚 variables and split the nodePractically, 

Random forests = Decision tree + Bagging + Random feature selection

Boosting

Problem: How to take a lesson from previous learner? Why not rely on better performing learners more?

AdaBoost

Weight Update Method

To see the mechanism of weight update

if wrong

if right

For 

The update in other form:

After adjustment, the weight distribution for right and wrong are still 


Naïve BayesLoss FunctionFormulation of Naïve Bayes ClassifierNaïve assumption: attributes are conditionally independent given the class.MLE Formulation to Estimate

Draw the components of the classifier from the sample data!Laplacian Correction

In case some components has 0 value, causing too extreme(overfitted) estimation.Could be other constant instead of 1.Continuous FeaturesAssuming it follows Gaussian distribution.


Bayesian NetworkConditional IndependenceThe above chain rule formulation gets too complicated as Graph G(V,E) consists of nodes V and a set of edges E.A path is a series of edges hat leading from one node to anotherA cycle is a series of nodes such that we can get back to where we started.

For the directed graph

parents of a node is the set of all nodes that feed into itAncestors, a node's father and its father and its father...Directed acyclic graph(DAG), a directed graph with no directed cycles.Bayesian Network Representation

conditional distributions for each node

Bayesian network = Graph + Local Conditional Probabilities

Bayesian network assumes that a node and its non-descendant nodes are independent given itsparent nodes.D-separation


𝑋 is d-separated from 𝑌 by 𝑍 if there is no active path between 𝑋 and 𝑌 when 𝑍 is observed.𝑋 and 𝑌 are independent given 𝑍 if X is d-separated from 𝑌 given 𝑍.ClusteringK-Means

Minimize the SSE!

Choice of K


if b(i) is large and a(i) is small then s(i) -> 1, otherwise s(i) -> -1How to Initialize Wisely

More likely to choose centroids in different clusters.Hierarchical ClusteringAgglomerative: merge clusters successively (「bottom-up」)Divisive: divided clusters successively (「top-down」)Agglomerative Clustering

The idea of Agglomerative Clustering is to merge similar clusters and incrementally build larger clusters out of smaller clusters.

How to Define CLOSE:

Gaussian mixture model

Mixture of GaussiansDimensionality ReductionPrincipal component analysis

the variance of projected data is maximized: pick a line along which the data is spread out the mostequivalently, the projection error is minimized: linear projection should minimize the average projection cost,computed as the mean squared distance between the data points and their projectionsPCA Algorithm


For a dataset 

Then move the dataset to be around around 0! (each feature's mean is 0)

Compute the 

Compute the eigenvectors/eigenvalues of the covariance matrix (eigen decomposition)

A matrix of eigenvectors, which are the principal component vectors

Then, rank the eigenvectors by eigenvalues, highest to lowest 

By choosing k eigenvalues, we realized dimension reductionNew dimensions are orthogonal, thus transformed features have 0 covarianceReconstruction from compressed representation

Choice of K

相關焦點

  • Cheat sheet?
    Reader question:Please explain 「cheat sheet」 in this: 「I even brought a cheat sheet withA cheat sheet, you see, is a slip of paper on which are written answers for a test.
  • X-ray Imaging through Machine LearningI 本周物理學講座
    報告人:Ge Wang,Rensselaer Polytechnic Institute時間:8月9日(周四)15:00單位:中科院高能所地點:化學樓C305Computer vision and image analysis are great examples of machine
  • 前端開發相關速查表Cheatsheets整理集合
    /regex/javascriptReacthttp://www.css88.com/dev/react.htmlReduxhttps://github.com/linkmesrl/react-journey-2016/blob/master/resources/egghead-redux-cheat-sheet-3-2-1.pdf
  • CFA二級思維導圖分享:機器學習(Machine Learning)
    CFA二級思維導圖分享:機器學習(machine learning)Reading7主要了解機器學習的一些常見概念,主要分類、了解常用算法的原理及其用途。常用的分類算法有支持向量機(Support vector machine (SVM))、近鄰算法(K-nearest neighbor (KNN) 、分類回歸樹(Classification and Regression Tree (CART)),以及集成算法,集成算法為將多種不同的算法或模型集成到一起,將各個不同模型的結果放到一起,按模型結果的最大值作為整個算法的結果,如Bootstrap Aggregating
  • Linux命令cheat的安裝和使用
    大家好,今天周二,還是和往常一樣,沒有精神,每天坐在電腦前一坐就是一天,感覺身體也是一天不如一天了,不說這些,今天說一下Linux命令cheat, cheat 允許你在命令行中創建和查看交互式的速查表cheatsheet。它能幫助提醒 *nix 系統管理員他們經常使用但還沒頻繁到會記住的命令的選項,raksmart伺服器。
  • 【預告】施汝為系列講座 | 武漢大學劉惠軍教授:Machine learning in the study of ...
    主題:Machine learning in the study of thermoelectric and topological materials主講人:劉惠軍 教授單位:武漢大學主辦方:合肥微尺度物質科學國家研究中心
  • 【徵稿】Evolutionary Transfer Learning and Transfer Optimisation
    In machine learning, transfer learning aims to transfer knowledge acquired in one problem domain, i.e. the source domain, onto another domain, i.e. the target domain.
  • 職場口語情景會話:Co-signing an approval sheet 共同批准
    新東方網>英語>英語學習>職場英語>職場百科>正文職場口語情景會話:Co-signing an approval sheet 共同批准 2011-09-24 00:28 來源:網絡整理
  • Machine Learning: 一部氣勢恢宏的人工智慧發展史
    當前統計學習領域最熱門方法主要有deep learning和SVM(supportvector machine),它們是統計學習的代表方法。可以認為神經網絡與支持向量機都源自於感知機。神經網絡與支持向量機一直處於「競爭」關係。
  • 絕地求生Enable Anti-cheat是什麼意思 anti-cheat是什麼
    絕地求生Enable Anti-cheat是什麼意思 anti-cheat是什麼 絕地求生很多玩家在遊戲看到這個enable anti-cheat的單詞,不明白是什麼意思,到底要不要選擇點上呢?
  • Multi-Label Classification with Deep Learning
    Unlike normal classification tasks where class labels are mutually exclusive, multi-label classification requires specialized machine learning algorithms that support predicting multiple mutually
  • 絕地求生enable anti-cheat是什麼意思?原因及解決方法
    enable anti-cheat絕地求生是最近非常火的一個吃雞遊戲,最近有些小夥伴就在說自己在登錄的時候遇到了絕地求生enable anti-cheat,這個是怎麼回事呢?怎麼辦?小編就為大家帶來了絕地求生enable anti-cheat原因及解決方法!