Data Science
Links:
A Kaggle Master Explains Gradient Boosting
DensityBased Clustering Validation
Speaker Verification Using Adapted Gaussian Mixture Models
Independent component analysis: algorithms and applications
Nonparametric discovery of human routines from sensor data
Random projection in dimensionality reduction: Applications to image and text data
Dynamic Principal Component Analysis in Multivariate TimeSeries Segmentation
SVMs  Stanford's CS229 Lecture notes
Independent Component Analysis of Electroencephalographic Data
Adaptive background mixture models for realtime tracking
A tutorial on Principal Components Analysis
Application of the Gaussian mixture model in pulsar astronomy
Robust PCA for Anomaly Detection in Cyber Networks
Technical Notes On Using Data Science & Artificial Intelligence
A Short Introduction to Boosting
Anomaly detection in temperature data using dbscan algorithm
TDD is Essential for Good Data Science Here's Why
Yes, you should understand backprop
Traffic Classification Using Clustering Algorithms
Random Projections for kmeans Clustering
Applying Independent Component Analysis to Factor Model in Finance
Experiments with a New Boosting Algorithm
Faces recognition example using eigenfaces and SVMs
https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
Visualizing KMeans Clustering
Four Ways Data Science Goes Wrong and How Test Driven Data Analysis Can Help
Four Ways Data Science Goes Wrong and How TestDriven Data Analysis Can Help
What is an intuitive explanation of Gradient Boosting?
A lecture from Stanford's CS231n course
Eigenvectors and eigenvalues (YouTube)
Classification > categorical outcomes Regression > numerical outcomes
 Supervised learning
 Classification
 Regression
 Unsupervised learning
 Classification
 Reinforcement learning
Absolute Trick (to move a line towards a point) Given y = w₁x + w₂, point (p, q) and learning rate α y = (w₁ + pα)x + (w₂ + α)
Square Trick y = (w₁ + p(q  q’)α)x + (w₂ + (q  q’)α)
Mean Absolute Error:
$$Error = \frac{1}{m}\sum_{i=1}^m{y  \hat y}$$ Mean Squared Error: $$Error = \frac{1}{2m}\sum_{i=1}^m{(y  \hat y)^2}$$
# Example: calculate MSE explicitly to update line (don't actually use this)
def MSEStep(X, y, W, b, learn_rate = 0.005):
y_pred = np.matmul(X, W) + b
error = y  y_pred
W_new = W + learn_rate * np.matmul(error, X)
b_new = b + learn_rate * error.sum()
return W_new, b_new
L1 regularization adds the polynomial coefficients and adds them to the error, penalizing complexity. L2 regularization adds the squares of the coefficients.
λ is the coefficient used to tune L1 & L2 regularization.
L1 regularization
 Computationally inefficient (unless data is sparse)
 Better for sparse outputs
 Feature selection (drives less relevant columns to 0) L2 regularization
 Computationally efficient
 Better for nonsparse outputs
 No feature selection
Standardizing is completed by taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column. Normalizing scales data between 0 and 1.
When Should I Use Feature Scaling?
 When your algorithm uses a distance based metric to predict.
 When you incorporate regularization.
Linear boundary: w₁x₁ + w₂x₂ + b = 0 Wx + b = 0 W = (w₁, w₂) x = (x₁, x₂)
Perceptron: ŷ = 1 if Wx + b ≥ 0 0 if Wx + b < 0
Entropy in a set for 2 classes and multiclass:
$$entropy = \frac{m}{m+n}log_2(\frac{m}{m+n})\frac{n}{m+n}log_2(\frac{n}{m+n})$$ $$entropy = \sum_{i=1}^n p_i\log_2(p_i)$$
As entropy increases, knowledge decreases, and vice versa.
Information gain is a change in entropy.
When you split a dataset, the information gain is the difference between the entropy of the parent and the average entropy of the children.
$$InformationGain = Entropy(Parent)  (\frac{m}{m+n}Entropy(Child₁) + \frac{n}{m+n}Entropy(Child₂))$$
The decision tree algorithm looks a the possible splits that each column gives, calculates the information gain, and picks the largest one.
Common hyperparameters for decision trees:
 Maximum depth
 Minimum number of samples per leaf
 Minimum number of samples per split
 Maximum number of features
Naïve Bayes assumes that conditions are independent.
P(spam'easy’,‘money’) ∝ P(‘easy'spam)⋅P(‘money'spam)⋅P(spam)
Error in Support Vector Machines: ERROR = C⋅CLASSIFICATION_ERROR + MARGIN_ERROR
$$Margin = \frac{2}{W} \qquad Error = W^2$$
SVMs can have linear, polynomial or radial basis function (RBF) kernels.
In an RBF kernel, large γ is similar to having a large value of C, that is your algorithm will attempt to classify every point correctly.
By combining algorithms, we can often build models that perform better by meeting in the middle in terms of bias and variance.
The introduction of randomness combats the tendency of these algorithms to overfit. There are two main ways that randomness is introduced:
 Bootstrap the data  that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
 Subset the features  in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.
A random forest builds multiple decision trees from multiple subsets of features, and then takes a vote.
Bagging takes random samples of data and splits each subset, and then combines the splits.
Adaboost tries a model, adds weight to misclassified points, creates a new model etc., then combines the models.
Bias and Variance:
 High Bias, Low Variance models tend to underfit data, as they are not flexible. Linear models fall into this category of models.
 High Variance, Low Bias models tend to overfit data, as they are too flexible. Decision trees fall into this category of models.
Ensemble Models: There were two randomization techniques you saw to combat overfitting:
 Bootstrap the data  that is, sampling the data with replacement and fitting your algorithm and fitting your algorithm to the sampled data.
 Subset the features  in each split of a decision tree or with each algorithm used an ensemble only a subset of the total possible features are used.
Thou shalt never use your testing data for training.
Confusion Matrix: Type 1 and Type 2 Errors Type 1 Error (Error of the first kind, or False Positive): In the medical example, this is when we misdiagnose a healthy patient as sick. Type 2 Error (Error of the second kind, or False Negative): In the medical example, this is when we misdiagnose a sick patient as healthy.
Precision: TP / (TP + FP)  Think: murder trial Recall: TP / (TP + FN)  Think: parachute manufacturer
$$HarmonicMean= \frac{2xy}{x+y} \qquad F_1 = 2\cdot\frac{Precision\times Recall}{Precision+Recall}$$ $$F_\beta = (1+N^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{N^2 \cdot \text{Precision} + \text{Recall}} = \frac{\text{Precision} \cdot \text{Recall}}{\frac{N^2}{1+N^2}\text{Precision} + \frac{1}{1+N^2}\text{Recall}}$$
The area under the Receiver Operating Characteristic (ROC) Curve is 1 for a perfect split, and .5 for a random split.
Classification metrics:
 Accuracy
 Precision
 Recall
 Fβ Score
 ROC Curve & AUC
Regression metrics:
 Mean absolute error
 MSE
 r2 score = 1  MSE / (score of a horizontal line)
Underfitting = high bias Overfitting = high variance
Model Complexity Graph
 Underfitting gives a high error on both training and crossvalidation sets
 Overfitting gives a low error on the training set and high error on the crossvalidation set
Kfold cross validation (see scikitlearn)
A model with high bias converges to a high error as the number of training points increases A model that's just right converges to a low error A model with high variance does not converge
Some Supervised Learning Models available in scikitlearn
 Gaussian Naive Bayes (GaussianNB)
 Decision Trees
 Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
 KNearest Neighbors (KNeighbors)
 Stochastic Gradient Descent Classifier (SGDC)
 Support Vector Machines (SVM)
 Logistic Regression
Backpropagation:
$$\delta^h_j = \sum{W_{jk}\delta^0_kf’(h_j)}$$ $$\Delta w_{ij} = \eta \delta^h_jx_i$$
2 popular methods for unsupervised machine learning:
 Clustering
 Dimensionality Reduction
In the Kmeans algorithm ‘k’ represents the number of clusters you have in your dataset.
When you have no idea how many clusters exist in your dataset, a common strategy for determining k is the elbow method. In the elbow method, you create a plot of the number of clusters (on the xaxis) vs. the average distance of the center of the cluster to each point (on the yaxis). This plot is called a scree plot
To use KMeans, you need to follow three steps:
 Instantiate your model.
 Fit your model to the data.
 Predict the labels for the data.
The starting points of the centroids can actually make a difference as to the final results you obtain from the kmeans algorithm.
The best set of clusters is then the clustering that creates the smallest average distance from each point to its corresponding centroid.
Feature scaling  there are two ways that are most common:
 Normalizing or MaxMin Scaling  this type of scaling moves variables between 0 and 1.
 Standardizing or ZScore Scaling  this type of scaling creates variables with a mean of 0 and standard deviation of 1.
I. Clustering
 Visual Inspection of your data.
 Preconceived ideas of the number of clusters.
 The elbow method, which compares the average distance of each point to the cluster center for different numbers of centers.
II. KMeans You saw the kmeans algorithm for clustering data, which has 3 steps:
 Randomly place kcentroids amongst your data. Then repeat the following two steps until convergence (the centroids don't change): 2. Look at the distance from each centroid to each point. Assign each point to the closest centroid. 3. Move the centroid to the center of the points assigned to it.
III. Concerns with KMeans
 Concern: The random placement of the centroids may lead to nonoptimal solutions. Solution: Run the algorithm multiple times and choose the centroids that create the smallest average distance of the points to the centroids.
 Concern: Depending on the scale of the features, you may end up with different groupings of your points. Solution: Scale the features using Standardizing, which will create features with mean 0 and standard deviation 1 before running the kmeans algorithm.
Hierarchical Clustering

The first step is to assume that each point is already a cluster

the next step would be to calculate the distances between each point and each other points
 choose the smallest distance between two clusters
 group those two points into a cluster

Single link looks at the closest two points in the two clusters (not in sklearn)

Complete link looks at the farthest two points in the two clusters

Average link looks at the distance between every point and every other point in the other cluster and averages them

Ward's method
Advantages of HC
 hierarchical representations are informative and they provide us with an additional ability to visualize the clustering structure of the datasets
 especially useful when the data contains real hierarchical relationships inside of it
Disadvantages of HC
 sensitive to noise and outliers so you're going to have to clean up the data set from any noise and outliers beforehand
 O(n²)
Densitybased Clustering, DBSCAN (Densitybased Spatial Clustering of Applications with Noise)
Advantages of DBSCAN
 don't need to specify the number of clusters
 flexibility in the shapes and sizes of clusters it's able to find
 robust in that it's able to deal with noise and outliers in the data
Disadvantages of DBSCAN
 not guaranteed to return the same clustering
 it has difficulties finding clusters of varying densities
Gaussian Mixture Model (GMM)
ExpectationMaximization for Gaussian Mixtures:
 Initialize k Gaussian distributions
 Softcluster data  “expectation”
 Reestimate the Gaussians  “maximization”
 Evaluate the loglikelihood to check for convergence
 Repeat from step 2 until converged
Advantages of GMM
 Softclustering (sample membership of multiple clusters)
 Cluster shape flexibility
Disadvantages of GMM
 Sensitive to initialization values
 Possible to converge to a local optimum
 Slow convergence rate
Principle Component Analysis (PCA) I. The amount of variance explained by each component. This is called an eigenvalue. II. The principal components themselves, each component is a vector of weights. In this case, the principal components help us understand which pixels of the image are most helpful in identifying the difference between between digits. Principal components are also known as eigenvectors.
Dimensionality Reduction and Latent Features
 Principal Component Analysis is a technique that is used to reduce the dimensionality of your dataset. The reduced features are called principal components, which can be thought of as latent features. These principal components are simply a linear combination of the original features in your dataset.
 these components have two major properties:
 They aim to capture the most amount of variability in the original dataset.
 They are orthogonal to (independent of) one another. Interpreting Results
 The variance (eigenvalue) explained by each component. Visualize this with scree plots to understand how many components you might keep based on how much information was being retained.
 The components (eigenvectors) give us an idea of which original features were most related to why a component was able to explain certain aspects about the original datasets.
Random Projection
Independent Component Analysis (ICA)
Scikit Learn uses FastICA
Three strategies for working with missing values include:
 We can remove (or “drop”) the rows or columns holding the missing values.
 We can impute the missing values.
 We can build models that work around them, and only use the information provided.
The CRISPDM Process (Cross Industry Process for Data Mining)
 Business Understanding
 Data Understanding
 Prepare Data
 Data Modeling
 Evaluate the Results
 Deploy
GitHub README should have:
 Installation instructions
 Project Motivation
 File Descriptions
 How to Interact with your project
 Licensing, Authors, Acknowledgements, etc.
Questions to Ask Yourself When Conducting a Code Review: Is the code clean and modular?
 Can I understand the code easily?
 Does it use meaningful names and whitespace?
 Is there duplicated code?
 Can you provide another layer of abstraction?
 Is each function and module necessary?
 Is each function or module too long?
Is the code efficient?
 Are there loops or other steps we can vectorize?
 Can we use better data structures to optimize any steps?
 Can we shorten the number of calculations needed for any steps?
 Can we use generators or multiprocessing to optimize any steps?
Is documentation effective?
 Are inline comments concise and meaningful?
 Is there complex code that's missing documentation?
 Do function use effective docstrings?
 Is the necessary project documentation provided?
Is the code well tested?
 Does the code high test coverage?
 Do tests check for interesting cases?
 Are the tests readable?
 Can the tests be made more efficient?
Is the logging effective?
 Are log messages clear, concise, and professional?
 Do they include all relevant and useful information?
 Do they use the appropriate logging level?
The 3 stages of an NLP pipeline are: Text Processing > Feature Extraction > Modeling.
 Text Processing: Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
 Feature Extraction: Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
 Modeling: Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.
Text data from different sources is prepared with the following text processing steps:

Cleaning to remove irrelevant items, such as HTML tags

Normalizing by converting to all lowercase and removing punctuation

Splitting text into words or tokens

Removing words that are too common, also known as stop words

Identifying different parts of speech and named entities
 Part of Speech (POS) and Named Entity Recognition (NER)
 NLTK

Converting words into their dictionary forms, using lemmatization and then stemming
Bag of Words
 collect all of the unique words in your corpus into a vocabulary
 create DocumentTerm Matrix, where each row is a sentence and each column is a word
 each entry in the matrix is a word frequency
 compare sentences in the matrix using cosine similarity
$$\cos(\theta) = \frac{a \cdot b}{a \cdot b}$$
TFIDF (term frequency * inverse document frequency)
Word Embeddings
 Word2Vec
 GloVe
 tSNE
If you have a lot of control over features, then you have an experiment. If you have no control over the features, then you have an observational study. If you have some control, then you have a quasiexperiment.
In a betweensubjects experiment, each unit only participates in, or sees, one of the conditions being used in the experiment. If an individual completes all conditions, rather than just one, this is known as a withinsubjects design.
Randomization still has a part in the withinsubjects design in the order in which individuals complete conditions.
In a simple random sample, each individual in the population has an equal chance of being selected. In a stratified random sample, we need to first divide the entire population into disjoint groups, or strata. Then, from each group, you take a simple random sample.
Evaluation metrics are the metrics by which we compare groups. Invariant metrics are metrics that we hope will not be different between groups.
If we aren't able to control all features or there is a lack of equivalence between groups, then we may be susceptible to confounding variables.
Construct validity is tied to how well one's goals are aligned to the evaluation metrics used to evaluate it. Internal validity refers to the degree to which a causal relationship can be derived from an experiment's results. External validity is concerned with the ability of an experimental outcome to be generalized to a broader population.
Biases in experiments are systematic effects that interfere with the interpretation of experimental results, mostly in terms of internal validity.
 Sampling biases are those that cause our observations to not be representative of the population.
 Studies that use surveys to collect data often have to deal with the selfselection bias.
 Survivor bias is one where losses or dropout of observed units is not accounted for in an analysis.
 A novelty effect is one that causes observers to change their behavior simply because they're seeing something new.
 Order bias may occur when the order in which conditions are completed could have an effect on participant responses.
 A primacy effect is one that affects early conditions, perhaps biasing them to be recalled better or to serve as anchor values for later conditions.
 A recency effect is one that affects later conditions, perhaps causing bias due to being fresher in memory or task fatigue.
 Experimenter bias is where the presence or knowledge of the experimenter can affect participants’ behaviors or performance.
 The doubleblind design hides condition information from both the administrator and participant in order to have a strong rein on experimenterbased biases.
# Find pvalue (twosided test)  analytic approach
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]
p = 0.5
sd = np.sqrt(p * (1p) * n_obs)
z = ((n_control + 0.5)  p * n_obs) / sd
print(z)
print(2 * stats.norm.cdf(z))
# Find pvalue (twosided test)  simulation approach
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]
p = 0.5
n_trials = 200_000
samples = np.random.binomial(n_obs, p, n_trials)
np.logical_or(samples <= n_control, samples >= (n_obs  n_control)).mean()
# Perform hypothesis test (onesided test)  analytic approach
p_click = data.groupby('condition').mean()['click']
n_control = data.groupby('condition').size()[0]
n_exper = data.groupby('condition').size()[1]
p_null = data['click'].mean()
se_p = np.sqrt(p_null * (1p_null) * (1/n_control + 1/n_exper))
z = (p_click[1]  p_click[0]) / se_p
print(z)
print(1stats.norm.cdf(z))
# Perform hypothesis test (onesided test)  simulation approach
n_control = data.groupby('condition').size()[0]
n_exper = data.groupby('condition').size()[1]
p_null = data['click'].mean()
n_trials = 200_000
ctrl_clicks = np.random.binomial(n_control, p_null, n_trials)
exp_clicks = np.random.binomial(n_exper, p_null, n_trials)
samples = exp_clicks / n_exper  ctrl_clicks / n_control
print((samples >= (p_click[1]  p_click[0])).mean())
SMART Experiment Design
 Specific: Make sure the goals of your experiment are specific.
 Measurable: Outcomes must be measurable using objective metrics
 Achievable: The steps taken for the experiment and the goals must be realistic.
 Relevant: The experiment needs to have purpose behind it.
 Timely: Results must be obtainable in a reasonable time frame.