Machine Learning

Links:

Sample code

Data sets:

UC Irvine Machine Learning Repository

Kaggle

Open Data on AWS

http://dataportals.org/

OPENDATAMONITOR

Quandl

Wikipedia's list of datasets

Quora

Datasets’ Discord server

In supervised learning, the training data is labelled.

Typical supervised learning tasks include classification and predicting a target numeric value.

Some of the most important supervised learning algorithms include

  • k-nearest neighbors
  • linear regression
  • logistical regression
  • support vector machines
  • decision trees and random forests
  • neural networks

In unsupervised learning, the training data is unlabelled.

Typical unsupervised learning tasks include

  • clustering,
  • visualization,
  • dimensionality reduction / feature extraction,
  • anomaly detection and
  • association rule learning.

Some of the most important unsupervised learning algorithms include

  • clustering – k-means – hierarchical cluster analysis (HCA) – expectation maximization
  • visualization and dimensionality reduction – principle component analysis (PCA) – kernel PCA – locally-linear embedding (LLE) – t-distributed stochastic neighbor embedding (t-SNE)
  • association rule learning – Apriori – Eclat

In semisupervised learning, some of the training data is labelled.

In reinforcement learning, an agent observes the environment and receives rewards for actions performed.

In batch learning, the system is trained using all available data.

In online learning, the system is trained incrementally by feeding it data in mini-batches.

In instance-based learning, the system learns examples by heart, then generalizes to new cases using a similarity measure.

In model-based learning, a model is built from a set of examples, then the model is used to make predictions.

Small data sets often suffer from sampling noise, and large data sets can still suffer from sampling bias.

Feature engineering involves feature selection, feature extraction and feature creation.

To fix overfitting you can

  • simplify the model
  • gather more training data
  • fix data errors and remove outliers

Regularization constrains a model to make it simpler and reduce the risk of overfitting.

Underfitting occurs when your model is too simple to learn the underlying structure of the data.

To fix underfitting you can

  • select a more powerful model
  • perform feature engineering
  • reduce regularization

To see how well a model will generalize to new cases, the data is split into a training set and a test set.

It is common to use 80% of the data for training and hold out 20% for testing.

In cross-validation, a validation set is randomly held out from the training set during training.

The No Free Lunch Theorem states that there is no model that is guaranteed to work best on a given dataset. The only way to know for sure is to evaluate them all.

  1. Look at the big picture.
  • Frame the problem – How will the model benefit the company? – What current solutions exist, if any? – Supervised, unsupervised, or reinforcement learning? – Classification, regression, or something else? – Batch learning or online learning?
  • Select a performance measure (RMSE ‖∙‖2, MAE ‖∙‖1, or ?)
  • Check any assumptions
  1. Get the data.
  • Familiarize yourself with it (pandas)
  • Plot histograms (matplotlib)
  • Create a test set
  1. Discover and visualize the data to gain insights.
  • Scatter plots (pandas)
  • Correlations plots
  1. Prepare the data.
  • Clean data – Delete rows with nulls (dropna) – Delete attributes with nulls (drop) – Replace nulls with default value (fillna or imputer)
  • Convert text to integers (factorize() to integer or one-hot encode)
  • Feature scaling (min-max scaling or standardization)
  1. Select a model and train it.
  • Cross-validation (cross_val_score)
  1. Fine-tune your model.
  • GridSearchCV
  • RandomizedSearchCV
  1. Present your solution.
  2. Launch, monitor and maintain your system.