In this blog post, I elaborate on setting axis limits in a plot, generated by ggplot2. There are two ways: one where you pretend the data outside the limits doesn’t exist (using lims), and one where you respect that the data outside the limits exists (using coord_cartesian).
The documentation for the lims, xlim and ylim functions state the following about values outside its limits:
This is a shortcut for supplying the
limits
argument to the individual scales. Note that, by default, any values outside the limits will be replaced withNA
.
And this is what the documentation says about coord_cartesian:
Setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will.
Hadley Wickham, one of the most important figures in the R community, wrote about it in his book:
Here’s an example. First, we create some dummy data, an X and a Y that are closely correlated. We also add some outliers to the data. Lastly, we plot it, without setting any limits on the axes.
library(ggplot2)
library(data.table)
set.seed(10)
normal_data_x <- rnorm(100,3,2)
normal_data_y <- normal_data_x + runif(100,-2,2)
outliers_x <- runif(25,8,10)
outliers_y <- outliers_x ^ runif(25,1,2)
d <- data.table(x = c(normal_data_x,outliers_x),y = c(normal_data_y,outliers_y))
ggplot(d,aes(x = x,y = y)) +
geom_point() +
geom_smooth(method = 'lm')
This is what the data looks like. Two strongly correlated series, when X is smaller than 7. And on the right you can see the outliers. I also added a linear smoother to demonstrate my point later on. What we see:
- All the data is visible, even the outliers.
- This smoother is based on all the data, even the outliers.
We can limit our X and Y axes using the xlim and ylim function as follows.
ggplot(d,aes(x = x,y = y)) +
geom_point() +
geom_smooth(method = 'lm') +
xlim(-2,7) + ylim(-1,12)
We now observe:
- We no longer observe the outliers
- The smoother is based on the data without the outliers.
Finally, we can limit our X and Y axes using the coord_cartesian function.
ggplot(d,aes(x = x,y = y)) +
geom_point() +
geom_smooth(method = 'lm') +
coord_cartesian(xlim=c(-2,7), ylim = c(-1,12))
As you can see, now:
- Once again, we no longer observe our outliers.
- However, we respect that outliers exist and the smoother is based on all the data.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Great success!
Your article helped me a lot, is there any more related content? Thanks! https://accounts.binance.com/tr/register-person?ref=FIHEGIZ8
Pingback: https://www.blackhatlinks.com
Your point of view caught my eye and was very interesting. Thanks. I have a question for you.