Data Science

Classification -> categorical outcomes Regression -> numerical outcomes

  • Supervised learning
    • Classification
    • Regression
  • Unsupervised learning
    • Classification
  • Reinforcement learning

Absolute Trick (to move a line towards a point) Given y = w₁x + w₂, point (p, q) and learning rate α y = (w₁ + pα)x + (w₂ + α)

Square Trick y = (w₁ + p(q - q’)α)x + (w₂ + (q - q’)α)

Mean Absolute Error:

$$Error = \frac{1}{m}\sum_{i=1}^m{|y - \hat y|}$$ Mean Squared Error: $$Error = \frac{1}{2m}\sum_{i=1}^m{(y - \hat y)^2}$$

# Example: calculate MSE explicitly to update line (don't actually use this)
def MSEStep(X, y, W, b, learn_rate = 0.005):
    y_pred = np.matmul(X, W) + b
    error = y - y_pred
    W_new = W + learn_rate * np.matmul(error, X)
    b_new = b + learn_rate * error.sum()
    return W_new, b_new

L1 regularization adds the polynomial coefficients and adds them to the error, penalizing complexity. L2 regularization adds the squares of the coefficients.

λ is the coefficient used to tune L1 & L2 regularization.

L1 regularization

  • Computationally inefficient (unless data is sparse)
  • Better for sparse outputs
  • Feature selection (drives less relevant columns to 0) L2 regularization
  • Computationally efficient
  • Better for non-sparse outputs
  • No feature selection

Standardizing is completed by taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column. Normalizing scales data between 0 and 1.

When Should I Use Feature Scaling?

  • When your algorithm uses a distance based metric to predict.
  • When you incorporate regularization.

Linear boundary: w₁x₁ + w₂x₂ + b = 0 Wx + b = 0 W = (w₁, w₂) x = (x₁, x₂)

Perceptron: ŷ = 1 if Wx + b ≥ 0 0 if Wx + b < 0

Entropy in a set for 2 classes and multi-class:

$$entropy = -\frac{m}{m+n}log_2(\frac{m}{m+n})-\frac{n}{m+n}log_2(\frac{n}{m+n})$$ $$entropy = -\sum_{i=1}^n p_i\log_2(p_i)$$

As entropy increases, knowledge decreases, and vice versa.

Information gain is a change in entropy.

When you split a dataset, the information gain is the difference between the entropy of the parent and the average entropy of the children.

$$InformationGain = Entropy(Parent) - (\frac{m}{m+n}Entropy(Child₁) + \frac{n}{m+n}Entropy(Child₂))$$

The decision tree algorithm looks a the possible splits that each column gives, calculates the information gain, and picks the largest one.

Common hyperparameters for decision trees:

  • Maximum depth
  • Minimum number of samples per leaf
  • Minimum number of samples per split
  • Maximum number of features

Naïve Bayes assumes that conditions are independent.

P(spam|‘easy’,‘money’) ∝ P(’easy’|spam)⋅P(‘money’|spam)⋅P(spam)

Error in Support Vector Machines: ERROR = C⋅CLASSIFICATION_ERROR + MARGIN_ERROR

$$Margin = \frac{2}{|W|} \qquad Error = |W|^2$$

SVMs can have linear, polynomial or radial basis function (RBF) kernels.

In an RBF kernel, large γ is similar to having a large value of C, that is your algorithm will attempt to classify every point correctly.

By combining algorithms, we can often build models that perform better by meeting in the middle in terms of bias and variance.

The introduction of randomness combats the tendency of these algorithms to overfit. There are two main ways that randomness is introduced:

  • Bootstrap the data - that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
  • Subset the features - in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.

A random forest builds multiple decision trees from multiple subsets of features, and then takes a vote.

Bagging takes random samples of data and splits each subset, and then combines the splits.

Adaboost tries a model, adds weight to misclassified points, creates a new model etc., then combines the models.

Bias and Variance:

  • High Bias, Low Variance models tend to underfit data, as they are not flexible. Linear models fall into this category of models.
  • High Variance, Low Bias models tend to overfit data, as they are too flexible. Decision trees fall into this category of models.

Ensemble Models: There were two randomization techniques you saw to combat overfitting:

  • Bootstrap the data - that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
  • Subset the features - in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.

Thou shalt never use your testing data for training.

Confusion Matrix: Type 1 and Type 2 Errors Type 1 Error (Error of the first kind, or False Positive): In the medical example, this is when we misdiagnose a healthy patient as sick. Type 2 Error (Error of the second kind, or False Negative): In the medical example, this is when we misdiagnose a sick patient as healthy.

Precision: TP / (TP + FP) - Think: murder trial Recall: TP / (TP + FN) - Think: parachute manufacturer

$$HarmonicMean= \frac{2xy}{x+y} \qquad F_1 = 2\cdot\frac{Precision\times Recall}{Precision+Recall}$$ $$F_\beta = (1+N^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{N^2 \cdot \text{Precision} + \text{Recall}} = \frac{\text{Precision} \cdot \text{Recall}}{\frac{N^2}{1+N^2}\text{Precision} + \frac{1}{1+N^2}\text{Recall}}$$

The area under the Receiver Operating Characteristic (ROC) Curve is 1 for a perfect split, and .5 for a random split.

Classification metrics:

  • Accuracy
  • Precision
  • Recall
  • Fβ Score
  • ROC Curve & AUC

Regression metrics:

  • Mean absolute error
  • MSE
  • r2 score = 1 - MSE / (score of a horizontal line)

Underfitting = high bias Overfitting = high variance

Model Complexity Graph

  • Underfitting gives a high error on both training and cross-validation sets
  • Overfitting gives a low error on the training set and high error on the cross-validation set

K-fold cross validation (see scikit-learn)

A model with high bias converges to a high error as the number of training points increases A model that’s just right converges to a low error A model with high variance does not converge

Some Supervised Learning Models available in scikit-learn

  • Gaussian Naive Bayes (GaussianNB)
  • Decision Trees
  • Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
  • K-Nearest Neighbors (KNeighbors)
  • Stochastic Gradient Descent Classifier (SGDC)
  • Support Vector Machines (SVM)
  • Logistic Regression

Backpropagation:

$$\delta^h_j = \sum{W_{jk}\delta^0_kf’(h_j)}$$ $$\Delta w_{ij} = \eta \delta^h_jx_i$$

2 popular methods for unsupervised machine learning:

  • Clustering
  • Dimensionality Reduction

In the K-means algorithm ‘k’ represents the number of clusters you have in your dataset.

When you have no idea how many clusters exist in your dataset, a common strategy for determining k is the elbow method. In the elbow method, you create a plot of the number of clusters (on the x-axis) vs. the average distance of the center of the cluster to each point (on the y-axis). This plot is called a scree plot

To use KMeans, you need to follow three steps:

  1. Instantiate your model.
  2. Fit your model to the data.
  3. Predict the labels for the data.

The starting points of the centroids can actually make a difference as to the final results you obtain from the k-means algorithm.

The best set of clusters is then the clustering that creates the smallest average distance from each point to its corresponding centroid.

Feature scaling - there are two ways that are most common:

  • Normalizing or Max-Min Scaling - this type of scaling moves variables between 0 and 1.
  • Standardizing or Z-Score Scaling - this type of scaling creates variables with a mean of 0 and standard deviation of 1.

I. Clustering

  • Visual Inspection of your data.
  • Pre-conceived ideas of the number of clusters.
  • The elbow method, which compares the average distance of each point to the cluster center for different numbers of centers.

II. K-Means You saw the k-means algorithm for clustering data, which has 3 steps:

  1. Randomly place k-centroids amongst your data. Then repeat the following two steps until convergence (the centroids don’t change): 2. Look at the distance from each centroid to each point. Assign each point to the closest centroid. 3. Move the centroid to the center of the points assigned to it.

III. Concerns with K-Means

  1. Concern: The random placement of the centroids may lead to non-optimal solutions. Solution: Run the algorithm multiple times and choose the centroids that create the smallest average distance of the points to the centroids.
  2. Concern: Depending on the scale of the features, you may end up with different groupings of your points. Solution: Scale the features using Standardizing, which will create features with mean 0 and standard deviation 1 before running the k-means algorithm.

Hierarchical Clustering

  • The first step is to assume that each point is already a cluster

  • the next step would be to calculate the distances between each point and each other points

    • choose the smallest distance between two clusters
    • group those two points into a cluster
  • Single link looks at the closest two points in the two clusters (not in sklearn)

  • Complete link looks at the farthest two points in the two clusters

  • Average link looks at the distance between every point and every other point in the other cluster and averages them

  • Ward’s method

Advantages of HC

  • hierarchical representations are informative and they provide us with an additional ability to visualize the clustering structure of the datasets
  • especially useful when the data contains real hierarchical relationships inside of it

Disadvantages of HC

  • sensitive to noise and outliers so you’re going to have to clean up the data set from any noise and outliers beforehand
  • O(n²)

Density-based Clustering, DBSCAN (Density-based Spatial Clustering of Applications with Noise)

Advantages of DBSCAN

  • don’t need to specify the number of clusters
  • flexibility in the shapes and sizes of clusters it’s able to find
  • robust in that it’s able to deal with noise and outliers in the data

Disadvantages of DBSCAN

  • not guaranteed to return the same clustering
  • it has difficulties finding clusters of varying densities

Gaussian Mixture Model (GMM)

Expectation-Maximization for Gaussian Mixtures:

  1. Initialize k Gaussian distributions
  2. Soft-cluster data - “expectation”
  3. Re-estimate the Gaussians - “maximization”
  4. Evaluate the log-likelihood to check for convergence
  5. Repeat from step 2 until converged

Advantages of GMM

  • Soft-clustering (sample membership of multiple clusters)
  • Cluster shape flexibility

Disadvantages of GMM

  • Sensitive to initialization values
  • Possible to converge to a local optimum
  • Slow convergence rate

Principle Component Analysis (PCA) I. The amount of variance explained by each component. This is called an eigenvalue. II. The principal components themselves, each component is a vector of weights. In this case, the principal components help us understand which pixels of the image are most helpful in identifying the difference between between digits. Principal components are also known as eigenvectors.

Dimensionality Reduction and Latent Features

  • Principal Component Analysis is a technique that is used to reduce the dimensionality of your dataset. The reduced features are called principal components, which can be thought of as latent features. These principal components are simply a linear combination of the original features in your dataset.
  • these components have two major properties:
    • They aim to capture the most amount of variability in the original dataset.
    • They are orthogonal to (independent of) one another. Interpreting Results
  • The variance (eigenvalue) explained by each component. Visualize this with scree plots to understand how many components you might keep based on how much information was being retained.
  • The components (eigenvectors) give us an idea of which original features were most related to why a component was able to explain certain aspects about the original datasets.

Random Projection

Independent Component Analysis (ICA)

Scikit Learn uses FastICA

Three strategies for working with missing values include:

  • We can remove (or “drop”) the rows or columns holding the missing values.
  • We can impute the missing values.
  • We can build models that work around them, and only use the information provided.

The CRISP-DM Process (Cross Industry Process for Data Mining)

  1. Business Understanding
  2. Data Understanding
  3. Prepare Data
  4. Data Modeling
  5. Evaluate the Results
  6. Deploy

GitHub README should have:

  1. Installation instructions
  2. Project Motivation
  3. File Descriptions
  4. How to Interact with your project
  5. Licensing, Authors, Acknowledgements, etc.

Questions to Ask Yourself When Conducting a Code Review: Is the code clean and modular?

  • Can I understand the code easily?
  • Does it use meaningful names and whitespace?
  • Is there duplicated code?
  • Can you provide another layer of abstraction?
  • Is each function and module necessary?
  • Is each function or module too long?

Is the code efficient?

  • Are there loops or other steps we can vectorize?
  • Can we use better data structures to optimize any steps?
  • Can we shorten the number of calculations needed for any steps?
  • Can we use generators or multiprocessing to optimize any steps?

Is documentation effective?

  • Are in-line comments concise and meaningful?
  • Is there complex code that’s missing documentation?
  • Do function use effective docstrings?
  • Is the necessary project documentation provided?

Is the code well tested?

  • Does the code high test coverage?
  • Do tests check for interesting cases?
  • Are the tests readable?
  • Can the tests be made more efficient?

Is the logging effective?

  • Are log messages clear, concise, and professional?
  • Do they include all relevant and useful information?
  • Do they use the appropriate logging level?

The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.

  • Text Processing: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
  • Feature Extraction: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
  • Modeling: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.

Text data from different sources is prepared with the following text processing steps:

  1. Cleaning to remove irrelevant items, such as HTML tags

  2. Normalizing by converting to all lowercase and removing punctuation

  3. Splitting text into words or tokens

  4. Removing words that are too common, also known as stop words

  5. Identifying different parts of speech and named entities

    • Part of Speech (POS) and Named Entity Recognition (NER)
    • NLTK
  6. Converting words into their dictionary forms, using lemmatization and then stemming

Bag of Words

  • collect all of the unique words in your corpus into a vocabulary
  • create Document-Term Matrix, where each row is a sentence and each column is a word
  • each entry in the matrix is a word frequency
  • compare sentences in the matrix using cosine similarity

$$\cos(\theta) = \frac{a \cdot b}{|a| \cdot |b|}$$

TF-IDF (term frequency * inverse document frequency)

Word Embeddings

  • Word2Vec
  • GloVe
  • t-SNE

If you have a lot of control over features, then you have an experiment. If you have no control over the features, then you have an observational study. If you have some control, then you have a quasi-experiment.

In a between-subjects experiment, each unit only participates in, or sees, one of the conditions being used in the experiment. If an individual completes all conditions, rather than just one, this is known as a within-subjects design.

Randomization still has a part in the within-subjects design in the order in which individuals complete conditions.

In a simple random sample, each individual in the population has an equal chance of being selected. In a stratified random sample, we need to first divide the entire population into disjoint groups, or strata. Then, from each group, you take a simple random sample.

Evaluation metrics are the metrics by which we compare groups. Invariant metrics are metrics that we hope will not be different between groups.

If we aren’t able to control all features or there is a lack of equivalence between groups, then we may be susceptible to confounding variables.

Construct validity is tied to how well one’s goals are aligned to the evaluation metrics used to evaluate it. Internal validity refers to the degree to which a causal relationship can be derived from an experiment’s results. External validity is concerned with the ability of an experimental outcome to be generalized to a broader population.

Biases in experiments are systematic effects that interfere with the interpretation of experimental results, mostly in terms of internal validity.

  • Sampling biases are those that cause our observations to not be representative of the population.
    • Studies that use surveys to collect data often have to deal with the self-selection bias.
    • Survivor bias is one where losses or dropout of observed units is not accounted for in an analysis.
  • A novelty effect is one that causes observers to change their behavior simply because they’re seeing something new.
  • Order bias may occur when the order in which conditions are completed could have an effect on participant responses.
    • A primacy effect is one that affects early conditions, perhaps biasing them to be recalled better or to serve as anchor values for later conditions.
    • A recency effect is one that affects later conditions, perhaps causing bias due to being fresher in memory or task fatigue.
  • Experimenter bias is where the presence or knowledge of the experimenter can affect participants’ behaviors or performance.
    • The double-blind design hides condition information from both the administrator and participant in order to have a strong rein on experimenter-based biases.
# Find p-value (two-sided test) - analytic approach
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]
p = 0.5
sd = np.sqrt(p * (1-p) * n_obs)
z = ((n_control + 0.5) - p * n_obs) / sd
print(z)
print(2 * stats.norm.cdf(z))

# Find p-value (two-sided test) - simulation approach
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]
p = 0.5
n_trials = 200_000
samples = np.random.binomial(n_obs, p, n_trials)
np.logical_or(samples <= n_control, samples >= (n_obs - n_control)).mean()

# Perform hypothesis test (one-sided test) - analytic approach
p_click = data.groupby('condition').mean()['click']
n_control = data.groupby('condition').size()[0]
n_exper = data.groupby('condition').size()[1]
p_null = data['click'].mean()
se_p = np.sqrt(p_null * (1-p_null) * (1/n_control + 1/n_exper))
z = (p_click[1] - p_click[0]) / se_p
print(z)
print(1-stats.norm.cdf(z))

# Perform hypothesis test (one-sided test) - simulation approach
n_control = data.groupby('condition').size()[0]
n_exper = data.groupby('condition').size()[1]
p_null = data['click'].mean()
n_trials = 200_000
ctrl_clicks = np.random.binomial(n_control, p_null, n_trials)
exp_clicks = np.random.binomial(n_exper, p_null, n_trials)
samples = exp_clicks / n_exper - ctrl_clicks / n_control
print((samples >= (p_click[1] - p_click[0])).mean())

SMART Experiment Design

  • Specific: Make sure the goals of your experiment are specific.
  • Measurable: Outcomes must be measurable using objective metrics
  • Achievable: The steps taken for the experiment and the goals must be realistic.
  • Relevant: The experiment needs to have purpose behind it.
  • Timely: Results must be obtainable in a reasonable time frame.

References