Machine Learning Cheat Sheet

2021-01-14 C2 Memo

Machine Learning Cheat SheetBasics In Machine LearningGeneralization Ability

A model’s ability to generalize to new data.

Overfitting V.S. Underfitting

Loss Function

Typical choices are:•

Generalization Error

Generalization error:

Test/Training error:

Learning Objective

In the mean time, watch out for overfitting!

Model EvaluationAccuracy, Precision and Recall

Bias-variance decompositionthis equation's interpretation is that the prediction and true value of a particular data's expected error, can be decomposed to bias, variance and noise, which is the three expressions on the RHS.Bias is the expected error from true value, and variance is the difference between the model trained and the expected model.Increase the amount of data is the only cure.Regularization

Adding a regularization term adjusted by the coefficient

L2 regularization (Ridge):

L1 regularization (Lasso):

Validation StrategyLeave-one-out cross validationLinear ModelsSimple Linear RegressionMultivariate Linear RegressionLogistic

Logistic Function (sigmoid) :

Binary classification using logistic function:

Training the logistic regression

MLE to draw the likelihood function

Gradient descent method

Batch gradient descent:

Stochastic gradient descent:

Choice of learning rate

Loss function's changes against times of iteration.

TPR,FPR and ROC

Regularization of logistic regression

Support Vector MachineMarginMaximum margin classifier

The Lagrange method

Soft Margin SVMSoft margin

subject to:

Loss function of SVM

SVM hinge loss

Kernel Function

Replace

Kernel functions:

Neural NetworksPerceptron Model

A single layer perceptron with activation function:

updating rules:

Not effective for not linearly separable data. Solution: Hidden Layer

Multi-layer Neural NetworksClassification or regressionOne or more hidden layersNot necessarily linear relationship

The square error loss for one set of data

If there are a accumulated data set

Updating rules:

For

According to Chain Rule and

For

By Chain Rule:

The Back-Propagation algorithm

Activation functionsSigmoid function
Differentiable, not zero-centered, vanishing gradientsTanh activation function
Differentiable, zero-centered, vanishing gradientsReLU
No vanishing gradients in the positive region, computationally efficient, kills gradients when

Get stuck at local minima.

Prone to overfitting.

Early stopping

Dropout regularization.

Deep LearningConvolution Layer

Filter is the target of training, like

Advantage of convolutionPooling Layer

Decrease feature map size. Keeping the important information.

max pooling (often better than average pooling)

Stride

Step width of "scanning".

Padding

Method to control the shape of feature map after convolution. For example, the shape is not strides' integral multiple, or I want the shape to remain the same after convolution...

Hyperparameters in CNNsFilter size F (but the filter itself is parameters learned!!)the amount of zero-padding PFlatten

Connected with fully connected layers.

Fully connected layer is the same as regular neural network. The last FC layer is called the output layer and use activation function such as Softmax.

Softmaxused in multiclass classificationAssign probabilities to each class in a multi-class problem.Logistic function is a special case,

Loss function in Logistic Regression & Neural Networks with Softmax

To interpret the loss function,

Decision Tree

Segments the attribute space into a number of simple regions, use labels of the regions for prediction. Labels can be average, maximum...

Separate the space by layers of "if".

can handle mixed variables

implicitly perform feature selection

Entropy

n as amount of possible value.

Information Gain

S as the dataset and a as the attribute.Information Gain Ratio (Normalized Information Gain)

The larger the size of possible value, the larger

Hedging the influence of the more splits the more information gain.

Gini GainID 3 Algorithmstart from the root node with all datacalculate information gains for all attributeselect one with highest information gainsplit the data of the node with respect to this attributeContinuous Attribute in ID3max the the information gainOverfitting

Constraints

Minimum leaf size - Stop split S if number of samples falls below a fixed thresholdMaximum Depth - Stop split S if times of split reach thresholdMaximum number of nodes - Stop if tree's nodes reach threshold

Pruning

Generalization ability may increase after some "pruning"

Pre-pruning: stop the spread if not much difference is madepost-pruning: start from the bottom of the tree and examine each non-leaf subtree, replace the subtree with leaf if not much difference is made.

Regression Tree

Split region

Greedy Recursive: Choose the attribute and threshold to minimize the loss function on that level.

$$R_1(j,t)=\{x| x<j\} and r_2(j,t)="\{x|" x_j\ge t\} $$ $$min_{j,t}[\sum_{i:x_i\in r_1(j,t)}(y_i-\hat c_1)^2+\sum_{i:x_i\in r_2(j,t)}(y_i-\hat c_2)^2]Pruning the regression treeuse cost-complexity pruning

How to choose

Ensemble Learning

Collective wisdom is greater than the smartest in the crowd!

Bagging

Train several different models separately, then have all of the models vote on the output for test example.

Create random subsets of training data using bootstrap samplingTrain the baAse learner on each bootstrap sample separatelyAverage output of all base learners•Majority voting for classification•Averaging for regression

Recall

It works well for low-bias and high-variance leaners (e.g., decision trees)It may not benefit much high-bias and low-variance learners (e.g., simple linear model)

Out-of-bag estimation

Leave out about 37%.

Random Forests

The problem of bagging: the models trained from bootstrap samples are probably positively correlated

To reduce correlation: random feature selection!

Each tree is learned from a bootstrap sample•To grow a random-forest tree ℎ𝑡, repeat the following steps•Randomly select 𝑚 variables at random from the 𝑝(>𝑚)variables•Find the best split based on the 𝑚 variables and split the nodePractically,

Random forests = Decision tree + Bagging + Random feature selection

Boosting

Problem: How to take a lesson from previous learner? Why not rely on better performing learners more?

AdaBoost

Weight Update Method

To see the mechanism of weight update

if wrong

if right

For

The update in other form:

After adjustment, the weight distribution for right and wrong are still

Naïve BayesLoss FunctionFormulation of Naïve Bayes ClassifierNaïve assumption: attributes are conditionally independent given the class.MLE Formulation to Estimate

Draw the components of the classifier from the sample data!Laplacian Correction

In case some components has 0 value, causing too extreme(overfitted) estimation.Could be other constant instead of 1.Continuous FeaturesAssuming it follows Gaussian distribution.

Bayesian NetworkConditional IndependenceThe above chain rule formulation gets too complicated as Graph G(V,E) consists of nodes V and a set of edges E.A path is a series of edges hat leading from one node to anotherA cycle is a series of nodes such that we can get back to where we started.

For the directed graph

parents of a node is the set of all nodes that feed into itAncestors, a node's father and its father and its father...Directed acyclic graph(DAG), a directed graph with no directed cycles.Bayesian Network Representation

conditional distributions for each node

Bayesian network = Graph + Local Conditional Probabilities

Bayesian network assumes that a node and its non-descendant nodes are independent given itsparent nodes.D-separation

𝑋 is d-separated from 𝑌 by 𝑍 if there is no active path between 𝑋 and 𝑌 when 𝑍 is observed.𝑋 and 𝑌 are independent given 𝑍 if X is d-separated from 𝑌 given 𝑍.ClusteringK-Means

Minimize the SSE!

Choice of K

if b(i) is large and a(i) is small then s(i) -> 1, otherwise s(i) -> -1How to Initialize Wisely

More likely to choose centroids in different clusters.Hierarchical ClusteringAgglomerative: merge clusters successively (「bottom-up」)Divisive: divided clusters successively (「top-down」)Agglomerative Clustering

The idea of Agglomerative Clustering is to merge similar clusters and incrementally build larger clusters out of smaller clusters.

How to Define CLOSE:

Gaussian mixture model

Mixture of GaussiansDimensionality ReductionPrincipal component analysis

the variance of projected data is maximized: pick a line along which the data is spread out the mostequivalently, the projection error is minimized: linear projection should minimize the average projection cost,computed as the mean squared distance between the data points and their projectionsPCA Algorithm

For a dataset

Then move the dataset to be around around 0! (each feature's mean is 0)

Compute the

Compute the eigenvectors/eigenvalues of the covariance matrix (eigen decomposition)

A matrix of eigenvectors, which are the principal component vectors

Then, rank the eigenvectors by eigenvalues, highest to lowest

By choosing k eigenvalues, we realized dimension reductionNew dimensions are orthogonal, thus transformed features have 0 covarianceReconstruction from compressed representation

Choice of K

Machine Learning Cheat Sheet

相關焦點

Cheat sheet?

X-ray Imaging through Machine LearningI 本周物理學講座

前端開發相關速查表Cheatsheets整理集合

CFA二級思維導圖分享:機器學習(Machine Learning)

Linux命令cheat的安裝和使用

【預告】施汝為系列講座 | 武漢大學劉惠軍教授:Machine learning in the study of ...

【徵稿】Evolutionary Transfer Learning and Transfer Optimisation

職場口語情景會話:Co-signing an approval sheet 共同批准

Machine Learning: 一部氣勢恢宏的人工智慧發展史

絕地求生Enable Anti-cheat是什麼意思 anti-cheat是什麼

Multi-Label Classification with Deep Learning

絕地求生enable anti-cheat是什麼意思?原因及解決方法