The many ways to calculate the mode in Python

I was curious how many ways there are to calculate the mode of a 1-D numpy array in Python. Apparently, quite a lot. Although all roads lead to Rome, some will take you there faster. That goes for many things in computer science: also for this seemingly trivial question.

Let’s load the packages I will be using throughout this blog post, and create a simple array with dummy data where the mode is clearly 1.

import numpy as np
from scipy import stats
from collections import Counter

a = np.array([1,2,3,4,1,3,1,5,1,6])

SciPy’s mode()

What triggered me to go on this quest to find the fastest mode function, was that SciPy seemed to be extremely slow at it. The mean was 112 µs for this array, but it was more than a couple of second on my real data set.

%timeit stats.mode(a).mode[0]

SciPy’s find_repeats()

Okay, 112 µs. Can we do faster? Sure we can! SciPy also has a find_repeats function, which checks which values of the array occur more than once, it also provided the count and orders it. This is way faster: 22 µs. That’s odd, because the values of the array are cast to float and I manually recast them to int.

%timeit int(stats.find_repeats(a)[0][0])

NumPy’s unique() & argmax()

I created a lambda function that takes the unique values and their respective counts of an array. It takes the argmax() of the counts, and uses the returned value as index for the values. Surprisingly: only 18 µs.

mymode = lambda x : x[0][x[1].argmax()]
%timeit mymode(np.unique(a, return_counts=True))

Statistics’ mode()

We can also try the “boring” statistics package, which has a mode() function. Surprisingly, it’s almost three times faster than the previous solution: 8 µs.

%timeit statistics.mode(a)

Watch out, because you’ll run into an error if you have no unique mode, e.g. two unique values are equally common.

StatisticsError: no unique mode; found 2 equally common values

Counter()

Here is a cool solution that uses Counter() from collections. It’s a dictionary subclass, an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. That’s cool, because those counts are one step closer to the mode.

This solution is ridiculously faster than SciPy’s mode(): just north of 6 µs. That’s more than 17 times faster than what we started with.

%timeit Counter(a).most_common()[0][0]

There we go. From now on, we don’t use SciPy’s mode() for a 1-D array. Another problem solved!

2 thoughts on “The many ways to calculate the mode in Python”

feryii June 13, 2023 at 3:55 pm

Thanks! info on Python

Suivre le téléphone February 10, 2024 at 1:49 pm

urveillez votre téléphone de n’importe où et voyez ce qui se passe sur le téléphone cible. Vous serez en mesure de surveiller et de stocker des journaux d’appels, des messages, des activités sociales, des images, des vidéos, WhatsApp et plus. Surveillance en temps réel des téléphones, aucune connaissance technique n’est requise, aucune racine n’est requise. https://www.mycellspy.com/fr/tutorials/

The many ways to calculate the mode in Python

SciPy’s mode()

SciPy’s find_repeats()

NumPy’s unique() & argmax()

Statistics’ mode()

Counter()

2 thoughts on “The many ways to calculate the mode in Python”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error