In R, it often happens that I need to calculate the share of each column by row. In this very simple example I would like to update the following table:
apples | oranges | bananas | pineapples |
2 | 4 | 6 | 3 |
1 | 0 | 9 | 2 |
5 | 1 | 2 | 3 |
and I would like to be:
apples | oranges | bananas | pineapples |
0.13 | 0.27 | 0.4 | 0.2 |
0.08 | 0 | 0.75 | 0.17 |
0.5 | 0.1 | 0.2 | 0.2 |
As you can see, every cell now contains the share its absolute value accounts for in the row. Using data.table, there is an easy way to do this.
rsums <- rowSums(fruit])
fruit <- fruit[,lapply(.SD,function(x) {x / rsums})]
rm(rsums)
In the data.table package, the .SD acronym stands for “subset of the data”. By doing a lapply over .SD, without specifying .SDcols, we are applying the function over all the columns. However, if you would only want to apply the function to apples and oranges, one would use:
rsums <- rowSums(fruit])
fruit <- fruit[,lapply(.SD,function(x) {x / rsums}), .SDCols = c('apples','oranges')]
rm(rsums)
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Good luck!
Pingback: https://www.blackhatlinks.com
Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?