Machine Learning
Machine Learning is the art of programming computers so they can learn from data.
In practice, ML is mostly done in Python and tasks usually involve
- Scikit-Learn which has a large collection of canned algorithms (start here),
- TensorFlow for deep learning
ML algorithms can be grouped into these categories:
The data may arrive 2 different ways.
- In Batch Learning, the system is trained using all available data.
- In Online Learning, the system is trained incrementally by feeding it data in mini-batches.
The algorithm may generalize in 2 different ways.
- In Instance-Based Learning, the system learns examples by rote, then generalizes to new cases using a similarity measure.
- In Model-Based Learning, a model is built from a set of examples, then the model is used to make predictions.
Model training can go wrong in several ways. ML Challenges
Feature engineering involves feature selection, feature extraction and feature creation.
To train the model, data is divided into ML Data Sets.
As a general rule, the project will follow a pretty standard ML Workflow.
A binary classifier distinguishes between 2 classes.
cross_val_score()
splits the dataset into K-folds, then evaluates predictions made on each using a model trained on the remaining folds.
cross_val_predict()
gets the actual predictions of the K-folds.
confusion_matrix()
creates a matrix of true and false predictions, with true values on the diagonal.
precision = TP / (TP + FP)
recall (a.k.a sensitivity, true positive rate) = TP / (TP + FN)
F₁ = TP / (TP + (FN + FP)/2)
precision_recall_curve()
computes precision and recall for all possible threasholds.
Another way to select a good precision/recall tradeoff is to plot precision vs. recall.
ROC = receiver operating characteristic
TPR = true positive rate (= recall)
TNR = true negative rate (= specificity)
FPR, FNR = false positive rate, false negative rate
An ROC curve plots sensitivity (recall) vs. 1 - specificity.
roc_curve()
computes the ROC curve.
A good ROC AUC (area under curve) has a value close to 1, whereas a random classifier has a value of 0.5
roc_auc_score()
computes the ROC AUC.
OvO = one vs. one
OvA = one vs. all, one vs. rest
Multilabel classification outputs multiple binary labels.
Multioutput classification output multiple multiclass labels.
Linear regression model prediction
$$\hat{y}=\theta_0+\theta_1x_1+\theta_2x_2+\cdots +\theta_nx_n$$
Vectorized form
$$\hat{y}=h_\theta(\textbf{x})=\theta^T\cdot \textbf{x}$$
Cost function of the linear regression model
$$MSE(\textbf{X},h_\theta)=\frac{1}{m}\sum_{i=1}^{m}\left(\theta^T\cdot \textbf{x}^{(i)}-y^{(i)}\right)^2$$
Normal equation
$$\hat{\theta}=\left(\textbf{X}^T\cdot\textbf{X}\right)^{-1}\cdot\textbf{X}^T\cdot\textbf{y}$$
X_b = np.c_[np.ones((100, 1)), X]
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
MSE for a linear regression model is a convex function, meaning that a line segment connecting any 2 points on the curve never crosses the curve. This implies there are no local minima, just one global minimum.
Preprocess the data with Scikit-Learn’s StandardScalar
Batch Gradient Descent
Partial derivatives of the cost function
$$\frac{\delta}{\delta\theta_j}=\frac{2}{m}\sum_{i=1}^{m}\left(\theta^T\cdot\textbf{x}^{(i)}-y^{(i)}\right)x_j^{(i)}$$
Gradient vector of the cost function
$$\nabla_\theta MSE(\theta)=\begin{pmatrix} \frac{\delta}{\delta\theta_0}MSE(\theta) \ \frac{\delta}{\delta\theta_1}MSE(\theta) \ \vdots \ \frac{\delta}{\delta\theta_n}MSE(\theta) \end{pmatrix} =\frac{2}{m}\textbf{X}^T\cdot(\textbf{X}\cdot\theta-\textbf{y})$$
Gradient descent step
$$\theta^{(\text{next step})}=\theta-\eta\nabla_\theta MSE(\theta)$$
Stochastic Gradient Descent
sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
Regularized Linear Models
- Ridge regression adds a regularization term to the cost function, forcing the learning algorithm to keep the weights as small as possible.
$$J(\theta)=MSE(\theta)+\alpha\frac{1}{2}\sum_{i=1}^{n}\theta_i^2$$ $$\hat{\theta}=\left(\textbf{X}^T\cdot\textbf{X}+\alpha\textbf{A}\right)^{-1}\cdot\textbf{X}^T\cdot \textbf{y}$$
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([1.5](/notes/))
- Lasso regression tends to completely eliminate the weights of the least important features.
$$J(\theta)=MSE(\theta)+\alpha\sum_{i=1}^{n}\left|\theta_i\right|$$
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([1.5](/notes/))
- Elastic Net is a combination of the two.
$$J(\theta)=MSE(\theta)+r\alpha\sum_{i=1}^{n}\left|\theta_i\right|+\frac{1-r}{2}\alpha\frac{1}{2}\sum_{i=1}^{n}\theta_i^2$$
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([1.5](/notes/))
- Early stopping
Logistic regression
estimates the probability that an instance belongs to a particular class.
$$\hat{p}=h_\theta(\textbf{x})=\sigma\left(\theta^T\cdot\textbf{x}\right)$$
The logistic regression loss function is convex
$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\text{log}\left(\hat{p}^{(I)}\right)+\left(1-y^{(i)}\right)\text{log}\left(1-\hat{p}^{(I)}\right)\right]$$
and its derivative is
$$\frac{\delta}{\delta\theta_j}=\frac{1}{m}\sum_{i=1}^{m}\left(\sigma\left( \theta^T\cdot \textbf{x}^{(i)}-y^{(I)} \right) \right)x_j^{(I)}$$
To train a logistic regression model
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression
log_reg.fit(X, y)
Softmax Regression / Multinomial Logistic Regression
Softmax score for class k
$$s_k(\textbf{x})=\left(\theta^{(k)}\right)^T\cdot\textbf{x}$$
Softmax function
$$\hat{p}_k=\sigma(\textbf{s}(\textbf{x}))_k=\frac{exp(s_k(\textbf{x}))}{\sum_{j=1}^{K}exp(s_j(\textbf{x}))}$$
Softmax prediction
$$\hat{y}=\underset{k}{\text{argmax }}\sigma(\textbf{s}(\textbf{x}))_k=\underset{k}{\text{argmax }}s_k(\textbf{x})=\underset{k}{\text{argmax }}\left(\left(\theta^{(k)}\right)^T\cdot\textbf{x}\right)$$
Cross entropy cost function
$$J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}y_k^{(i)}\text{log}\left(\hat{p}_k^{(i)}\right)$$
Cross entropy gradient vector
$$\nabla_{\theta^{(k)}}J(\Theta)=\frac{1}{m}\sum_{i=1}^{m}\left(\hat{p}_k^{(i)}-y_k^{(i)}\right)\textbf{x}^{(i)}$$
softmax_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", C=10)
softmax_reg.fit(X, y)
softmax_reg.predict([5, 2](/notes/))
softmax_reg.predict_proba([5, 2](/notes/))
Support Vector Machines
Support vectors are data instances on the “fog lines” of the decision boundary “street”.
Soft margin classification allows some margin violations, controlled by the C
parameter.
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline([
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge", random_state=42)),
])
svm_clf.fit(X, y)
Nonlinear SVM Classification
polynomial_svm_clf = Pipeline([
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge", random_state=42))
])
polynomial_svm_clf.fit(X, y)
The SVC class implements the kernel trick, whatever that is, which runs higher degree polynomial efficiently.
poly_kernel_svm_clf = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])
poly_kernel_svm_clf.fit(X, y)
Gaussian radial bias function (RBF)
$$\phi_\gamma(\textbf{x},l)=\text{exp}\left(-\gamma\left|\textbf{x}-l\right|^2\right)$$
Increasing gamma
makes the bell-shape curve narrower.
rbf_kernel_svm_clf = Pipeline([
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])
rbf_kernel_svm_clf.fit(X, y)
SVM Regression
SVMs also support linear and nonlinear regression.
svm_reg = LinearSVR(epsilon=1.5, random_state=42)
svm_reg.fit(X, y)
svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1, gamma="auto")
svm_poly_reg.fit(X, y)
Decision Trees
iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X, y)
A decision tree is a white box model that can be visualized.
from sklearn.tree import export_graphviz
export_graphviz(
tree_clf,
out_file=image_path("iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True
)
Decision tree prediction
tree_clf.predict_proba([5, 1.5](/notes/))
tree_clf.predict([5, 1.5](/notes/))
Regularization hyperparameters
max_depth
min_samples_split
min_samples_leaf
min_weight_fraction_leaf
max_leaf_nodes
max_features
Regression Trees
tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X, y)
Weaknesses of decision trees
- Sensitive to orientation of training data. PCA can help.
- Sensitive to small variations of training data.
Ensemble Methods
An ensemble can be a strong learner even if each classifier is a weak learner, if there are enough of them and they are diverse enough.
A hard voting classifier predicts the class that gets the most votes.
A soft voting classifier makes predictions based on the averages of class probabilities.
# set voting='soft' for soft voting
log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", random_state=42)
voting_clf = VotingClassifier(
estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
voting='hard')
Bagging (bootstrap aggregating) is training classifiers on different subsets of data with replacement.
Pasting is the same thing without replacement.
# set bootstrap=False for pasting
# set oob_score=True for out-of-bag evaluation
# set max_features and bootstrap_features to sample random subspaces
bag_clf = BaggingClassifier(
DecisionTreeClassifier(random_state=42), n_estimators=500,
max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
Random patches sample both training instances and features.
A random forest is an ensemble of decision trees.
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])
The ExtraTreesClassifier
(extra = extremely randomized) uses random thresholds for each feature and has higher bias & lower variance than the RandomForestClassifier
.
The RandomForestClassifier
has a feature_importances_
variable which is handy for selecting features.
Boosting
Boosting (hypothesis boosting) combines weak learners into a strong learner by training learners sequentially.
AdaBoost (adaptive boosting) trains each learner on instances that its predecessor underfitted.
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)
Gradient boosting tries to fit the new predictor to the residual errors made by the previous predictor.
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X, y)
Stacking (stacked generalization)
Stacking trains a model to make a final prediction, instead of hard or soft voting.
The final predictor is called a blender.
The blender is typically trained on a hold-out data set.
Dimensionality Reduction
Principle Component Analysis (PCA) identifies the hyperplane that lies closest to the data, then projects the data onto it.
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)
The principle components can then be accessed with the components_
variable. Also interesting is the explained_variance_ratio_
variable.
To find the minimum components to preserve a given variance, use
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)
Reconstruct the original set with
X_recovered = pca.inverse_transform(X_reduced)
PCA loads the entire dataset into memory. For large datasets, use Incremental PCA.
n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
inc_pca.partial_fit(X_batch)
X_reduced = inc_pca.transform(X_train)
When d is much smaller than n, randomized PCA can give much faster results.
rnd_pca = PCA(n_components=154, svd_solver="randomized", random_state=42)
X_reduced = rnd_pca.fit_transform(X_train)
Kernel PCA
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)
Other
- Locally Linear Embedding (LLE) is a manifold learning technique that looks at how each instance relates to its neighbors, then tries to preserve the relationship in lower dimensions.
- Multidimensional Scaling (MDS) attempts to preserve the distances between instances
- Isomap creates a graph between instances, then reduces dimensionality while trying to preserve geodesic distances between instances
- t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces dimensionality while keeping similar instances together and dissimilar instances apart. Used mostly for visualization.
- Linear discriminant analysis (LDA) is a classifier that learns the most discriminative axes between the classes.