Home » Data Science

# Data Science

## Ascertainment Bias

Ascertainment bias is the systematic difference in the identification of individuals in a study, or the data collected. It results in a distortion in measuring the true frequency of a phenomenon in the population. “When the chance of a person being sampled, or feature being observed, depends on some background…

## Bootstrapping

Bootstrapping is a very popular resampling method with replacement. It assigns measures of accuracy to sample estimates. Bootstrapping allows the estimation of the sampling distribution of nearly any statistic. “A way of generating confidence intervals and the distribution of test statistics through sampling the observed data rather than through assuming…

## Confusion Matrix

What is a confusion matrix? The confusion matrix (or “error matrix“) is a table that is used to describe the performance of a classification model by comparing its predictions to a data set of which the true values are known. In a binary classification task, the confusion matrix is a…

## Data Leakage

What is data leakage? Within the field of machine learning, data leakage is a term used to describe how data from outside the training data set is used to create the model. This is a problem because, within machine learning, our goal is to develop a model that is good…

## Data Shift

What is Data Shift? Data shift— or dataset shift, model drift, data drift– is the phenomenon that describes the change in input data in your model (over time), relative to the data it was trained on. It is one of the most common reasons for degrading model accuracy. That’s why…

## Linear Regression

What is linear regression? A linear regression is a linear approach to model the relationship between a dependent variable and one or more explanatory variables — the independent variables. We can make a distinction between: Simple linear regression: has one explanatory variable Multiple linear regression: has multiple explanatory variables In…

## Performance Metrics in Machine Learning

Performance Metrics Performance metrics tell you something about the performance of a machine learning model. Each metric has a specific focus. Because of the confusion matrix’ nature, a lot of metrics have a close sibling. Equally confusing is that many performance metrics have multiple synonyms, depending on the context. Given…

## Performance Metrics: Accuracy

What is the Accuracy? The Accuracy is a performance metric that tells you the fraction of the predictions that were correct, without distinguishing between positive and negative predictions. The Accuracy can be a very misleading metric when the data set is unbalanced (when the prevalence is either very high or very…

## Performance Metrics: Balanced Accuracy

What is Balanced Accuracy? Balanced Accuracy is a performance metric to evaluate a binary classifier. Why not use regular accuracy? Balanced accuracy is a better instrument for assessing models that are trained on data with very imbalanced target variables. I.e. very high, or very low prevalence. This will result in…

## Performance Metrics: Classification Success Index

What is the Classification Success Index? The Classification Success Index (CSI) is a (fairly uncommon) measure for evaluating classifiers. The CSI focuses exclusively on the positive class. It is calculated as follows: The terms (1-PPV) and (1-TPR) correspond to the proportions of type I and type II errors. The measure…