Data Science

Ascertainment Bias

by roelpi
August 22, 2020August 22, 2020

Ascertainment bias is the systematic difference in the identification of individuals in a study, or the data collected. It results in a distortion in measuring the true frequency of a phenomenon in the population. “When the chance of a person being sampled, or feature being observed, depends on some background…

by roelpi
August 22, 2020December 12, 2020

Bootstrapping is a very popular resampling method with replacement. It assigns measures of accuracy to sample estimates. Bootstrapping allows the estimation of the sampling distribution of nearly any statistic. “A way of generating confidence intervals and the distribution of test statistics through sampling the observed data rather than through assuming…

by roelpi
December 14, 2020

What is a confusion matrix? The confusion matrix (or “error matrix“) is a table that is used to describe the performance of a classification model by comparing its predictions to a data set of which the true values are known. In a binary classification task, the confusion matrix is a…

by roelpi
September 27, 2020

What is data leakage? Within the field of machine learning, data leakage is a term used to describe how data from outside the training data set is used to create the model. This is a problem because, within machine learning, our goal is to develop a model that is good…

by roelpi
November 23, 2020October 4, 2022

What is Data Shift? Data shift— or dataset shift, model drift, data drift– is the phenomenon that describes the change in input data in your model (over time), relative to the data it was trained on. It is one of the most common reasons for degrading model accuracy. That’s why…

by roelpi
August 20, 2020August 20, 2020

What is linear regression? A linear regression is a linear approach to model the relationship between a dependent variable and one or more explanatory variables — the independent variables. We can make a distinction between: Simple linear regression: has one explanatory variable Multiple linear regression: has multiple explanatory variables In…

by roelpi
December 14, 2020December 28, 2020

Performance Metrics Performance metrics tell you something about the performance of a machine learning model. Each metric has a specific focus. Because of the confusion matrix’ nature, a lot of metrics have a close sibling. Equally confusing is that many performance metrics have multiple synonyms, depending on the context. Given…

by roelpi
December 17, 2020December 30, 2020

What is the Accuracy? The Accuracy is a performance metric that tells you the fraction of the predictions that were correct, without distinguishing between positive and negative predictions. The Accuracy can be a very misleading metric when the data set is unbalanced (when the prevalence is either very high or very…

by roelpi
December 18, 2020January 2, 2021

What is Balanced Accuracy? Balanced Accuracy is a performance metric to evaluate a binary classifier. Why not use regular accuracy? Balanced accuracy is a better instrument for assessing models that are trained on data with very imbalanced target variables. I.e. very high, or very low prevalence. This will result in…

by roelpi
December 28, 2020

What is the Classification Success Index? The Classification Success Index (CSI) is a (fairly uncommon) measure for evaluating classifiers. The CSI focuses exclusively on the positive class. It is calculated as follows: The terms (1-PPV) and (1-TPR) correspond to the proportions of type I and type II errors. The measure…

Data Science

Ascertainment Bias

Bootstrapping

Confusion Matrix

Data Leakage

Data Shift

Linear Regression

Performance Metrics in Machine Learning

Performance Metrics: Accuracy

Performance Metrics: Balanced Accuracy

Performance Metrics: Classification Success Index