Machine Learning Cheat SheetBasics In Machine LearningGeneralization Ability
A model’s ability to generalize to new data.
Overfitting V.S. Underfitting
Loss Function
Typical choices are:•
Generalization Error
Generalization error:
Test/Training error:
Learning Objective
In the mean time, watch out for overfitting!
Model EvaluationAccuracy, Precision and RecallBias-variance decompositionthis equation's interpretation is that the prediction and true value of a particular data's expected error, can be decomposed to bias, variance and noise, which is the three expressions on the RHS.Bias is the expected error from true value, and variance is the difference between the model trained and the expected model.Increase the amount of data is the only cure.Regularization
Adding a regularization term adjusted by the coefficient .
L2 regularization (Ridge):
L1 regularization (Lasso):
Validation StrategyLeave-one-out cross validationLinear ModelsSimple Linear RegressionMultivariate Linear RegressionLogistic
Logistic Function (sigmoid) :
Binary classification using logistic function:
Training the logistic regression
MLE to draw the likelihood function :
is a convex function and we can apply gradient descent method.
Gradient descent method
Batch gradient descent:
Stochastic gradient descent:
Choice of learning rate
Loss function's changes against times of iteration.
TPR,FPR and ROCRegularization of logistic regression Support Vector MachineMarginMaximum margin classifierThe Lagrange method Soft Margin SVMSoft margin
subject to:
Loss function of SVM
SVM hinge loss :
Kernel Function
Replace by
Kernel functions:
Neural NetworksPerceptron Model
A single layer perceptron with activation function:
updating rules:
Not effective for not linearly separable data. Solution: Hidden Layer
Multi-layer Neural NetworksClassification or regressionOne or more hidden layersNot necessarily linear relationship
The square error loss for one set of data :
If there are a accumulated data set .
Updating rules: ,.
For :
According to Chain Rule and , , ,
For :
By Chain Rule:
can be interpretated as error at node j.
The Back-Propagation algorithm
Activation functionsSigmoid function Differentiable, not zero-centered, vanishing gradientsTanh activation function Differentiable, zero-centered, vanishing gradientsReLU No vanishing gradients in the positive region, computationally efficient, kills gradients when Issues of neural networks
Get stuck at local minima.
Prone to overfitting.
Early stopping
Dropout regularization.
Deep LearningConvolution Layer
Filter is the target of training, like in neural network.
Advantage of convolutionPooling Layer
Decrease feature map size. Keeping the important information.
max pooling (often better than average pooling)Stride
Step width of "scanning".
Padding
Method to control the shape of feature map after convolution. For example, the shape is not strides' integral multiple, or I want the shape to remain the same after convolution...
Hyperparameters in CNNsFilter size F (but the filter itself is parameters learned!!)the amount of zero-padding PFlatten
Connected with fully connected layers.
Fully connected layer is the same as regular neural network. The last FC layer is called the output layer and use activation function such as Softmax.
Softmaxused in multiclass classificationAssign probabilities to each class in a multi-class problem.Logistic function is a special case,
Loss function in Logistic Regression & Neural Networks with Softmax
To interpret the loss function, is the cumulative likelihood of predicting existing features. Our goal is to maximize it, and consequently minimize -- the Loss Function!
Decision Tree
Segments the attribute space into a number of simple regions, use labels of the regions for prediction. Labels can be average, maximum...
Separate the space by layers of "if".
can handle mixed variables
implicitly perform feature selection
Entropy
n as amount of possible value.
. For a separation of decision tree, if H(x) is too large, this separation is making very few difference!
Information Gain
S as the dataset and a as the attribute.Information Gain Ratio (Normalized Information Gain)
The larger the size of possible value, the larger is.
Hedging the influence of the more splits the more information gain.
Gini GainID 3 Algorithmstart from the root node with all datacalculate information gains for all attributeselect one with highest information gainsplit the data of the node with respect to this attributeContinuous Attribute in ID3max the the information gainOverfitting
Constraints
Minimum leaf size - Stop split S if number of samples falls below a fixed thresholdMaximum Depth - Stop split S if times of split reach thresholdMaximum number of nodes - Stop if tree's nodes reach threshold
Pruning
Generalization ability may increase after some "pruning"
Pre-pruning: stop the spread if not much difference is madepost-pruning: start from the bottom of the tree and examine each non-leaf subtree, replace the subtree with leaf if not much difference is made.Regression Tree
Split region
's expression is determined by .
Greedy Recursive: Choose the attribute and threshold to minimize the loss function on that level.
$$R_1(j,t)=\{x| x<j\} and r_2(j,t)="\{x|" x_j\ge t\} $$ $$min_{j,t}[\sum_{i:x_i\in r_1(j,t)}(y_i-\hat c_1)^2+\sum_{i:x_i\in r_2(j,t)}(y_i-\hat c_2)^2]Pruning the regression treeuse cost-complexity pruning
How to choose ? Cross validation!
Ensemble Learning
Collective wisdom is greater than the smartest in the crowd!
Bagging
Train several different models separately, then have all of the models vote on the output for test example.
Create random subsets of training data using bootstrap samplingTrain the baAse learner on each bootstrap sample separatelyAverage output of all base learners•Majority voting for classification•Averaging for regression
Recall and
It works well for low-bias and high-variance leaners (e.g., decision trees)It may not benefit much high-bias and low-variance learners (e.g., simple linear model)
Out-of-bag estimation
Leave out about 37%.
Random Forests
The problem of bagging: the models trained from bootstrap samples are probably positively correlated
To reduce correlation: random feature selection!
Each tree is learned from a bootstrap sample•To grow a random-forest tree ℎ𝑡, repeat the following steps•Randomly select 𝑚 variables at random from the 𝑝(>𝑚)variables•Find the best split based on the 𝑚 variables and split the nodePractically, may be a good choice!
Random forests = Decision tree + Bagging + Random feature selection
Boosting
Problem: How to take a lesson from previous learner? Why not rely on better performing learners more?
AdaBoost
Weight Update Method
To see the mechanism of weight update
if wrong
if right
For
The update in other form:
After adjustment, the weight distribution for right and wrong are still .
Naïve BayesLoss FunctionFormulation of Naïve Bayes ClassifierNaïve assumption: attributes are conditionally independent given the class.MLE Formulation to EstimateDraw the components of the classifier from the sample data!Laplacian CorrectionIn case some components has 0 value, causing too extreme(overfitted) estimation.Could be other constant instead of 1.Continuous FeaturesAssuming it follows Gaussian distribution. Bayesian NetworkConditional IndependenceThe above chain rule formulation gets too complicated as gets bigger.Conditional Independence suggests given , 丨丨丨丨丨.One example, 丨丨丨, which means only the last attribute matters for prediction of the focal attribute.Graphical TerminologyGraph G(V,E) consists of nodes V and a set of edges E.A path is a series of edges hat leading from one node to anotherA cycle is a series of nodes such that we can get back to where we started.
For the directed graph
parents of a node is the set of all nodes that feed into itAncestors, a node's father and its father and its father...Directed acyclic graph(DAG), a directed graph with no directed cycles.Bayesian Network Representation
conditional distributions for each node
Bayesian network = Graph + Local Conditional Probabilities
Bayesian network assumes that a node and its non-descendant nodes are independent given itsparent nodes.D-separation
𝑋 is d-separated from 𝑌 by 𝑍 if there is no active path between 𝑋 and 𝑌 when 𝑍 is observed.𝑋 and 𝑌 are independent given 𝑍 if X is d-separated from 𝑌 given 𝑍.ClusteringK-Means
Minimize the SSE!
Choice of K if b(i) is large and a(i) is small then s(i) -> 1, otherwise s(i) -> -1How to Initialize WiselyMore likely to choose centroids in different clusters.Hierarchical ClusteringAgglomerative: merge clusters successively (「bottom-up」)Divisive: divided clusters successively (「top-down」)Agglomerative Clustering
The idea of Agglomerative Clustering is to merge similar clusters and incrementally build larger clusters out of smaller clusters.
How to Define CLOSE:
Gaussian mixture modelMixture of GaussiansDimensionality ReductionPrincipal component analysisthe variance of projected data is maximized: pick a line along which the data is spread out the mostequivalently, the projection error is minimized: linear projection should minimize the average projection cost,computed as the mean squared distance between the data points and their projectionsPCA Algorithm
For a dataset where
Then move the dataset to be around around 0! (each feature's mean is 0)
Compute the covariance matrix
Compute the eigenvectors/eigenvalues of the covariance matrix (eigen decomposition)
A matrix of eigenvectors, which are the principal component vectors diagonal matrix of eigenvalues where is the eigenvalue
Then, rank the eigenvectors by eigenvalues, highest to lowest , Eigenvector with largest (smallest) eigenvalue captures the most (least) variation
By choosing k eigenvalues, we realized dimension reductionNew dimensions are orthogonal, thus transformed features have 0 covarianceReconstruction from compressed representationChoice of K
Reader question:Please explain 「cheat sheet」 in this: 「I even brought a cheat sheet withA cheat sheet, you see, is a slip of paper on which are written answers for a test.
In machine learning, transfer learning aims to transfer knowledge acquired in one problem domain, i.e. the source domain, onto another domain, i.e. the target domain.
Unlike normal classification tasks where class labels are mutually exclusive, multi-label classification requires specialized machine learning algorithms that support predicting multiple mutually