Recently I had to read in a folder full of large Excel (.xlsx) files. I did as I usually did and use the xlsx library. However, reading in the largest files produced an error: “Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, : java.lang.OutOfMemoryError: Java heap space” I don’t know what cause… Read More »Error in .jcall(“RJavaTools”…) when importing large xlsx files in R
It happened too many times before I write this blog post. Oftentimes, when I read Excel (xls or xlsx) files into R, I encounter this strange phenomenon where dates are converted to a 5 number integer. Here is how to fix it. For example, 02/01/2017 (dd/MM/yyyy) would be converted to… Read More »Date conversion from Excel to R
Something that took me a while to do properly in ggplot2 is adding the percentage sign as a suffix to your tick labels, controlling decimals and at the same time still being able to set the limits of your axis. I’ll show an example using the iris data set. Let’s… Read More »Add percentages to your axes in R’s ggplot2 (and set the limits)
In this blog post, I explain why a certain error in BigQuery arises and how you can get rid of it. Although I have abandoned the comma join syntax a while ago, I do happen to use it within the context of arrays in Google BigQuery (all the cool kids… Read More »Solve ‘RIGHT JOIN must be parenthesized when following a comma join’ in BigQuery
‘Outnumbered‘ is a book that I have been expecting for the past couple of years. Its premise: this whole algorithm, data science and AI revolution that people talk about on cocktail parties seems amazing and overwhelming, but under the hood it is rather ‘meh’. The author, David Sumpter, is a… Read More »“Outnumbered”: are algorithms just ‘meh’?
In het boek Outnumbered gaat auteur David Sumpter op zoek naar de beperkingen van de algoritme-hype. In het hoofdstuk Impossibly unbiased beschrijft hij hoe algoritmes fouten kunnen maken. Hij kwam terecht bij justitie, in de VS. Zo publiceerde ProPublica in 2016 een artikel dat een gevoelige snaar raakte bij data… Read More »Over racistische machines
When you are clustering, what you are actually trying to do is to find groups of objects so that they are similar to one another, and different from the object of other groups. In other words, you want to minimize the intra-cluster distance and maximize the inter-cluster distance. Clustering algorithms… Read More »Optimizing the number of clusters using Tibshirani’s gap statistic
Historically, browsers have had a great deal of control over the online experience of end users. Since their genesis, several different browsers have competed for dominant market share. With the arrival of Intelligent Tracking Prevention (ITP), it seems that big tech is now using browser standards to target each other’s… Read More »An almost complete overview of Apple WebKit’s Intelligent Tracking Prevention
Another management book? Not just a management book, but the story of Satya Nadella. Who? The son of a marxist economist and a drama professor but mostly known as Microsoft’s less famous, less rich, but current CEO. A brief review. Nadella’s book ‘Hit Refresh’ is a triptych. The first part… Read More »“Hit Refresh”: Or how one man’s personal life helped Microsoft rediscover its soul
Although every statistics book will tell you not to go looking for statistical significance, sadly that’s still what happens in many analyses and scientific research. Often forgotten, is to check for statistical power. Here’s a refresher, and how to do it in R. Remember statistical significance? A finding is statistically… Read More »Statistical power, it matters. Even in R.