Thursday, May 21, 2015

You are never too young (or to old) to be thinking about data visualization!

Over the last couple of weeks, I have been working with my son on his first science fair project. It has been lots of fun, working with him to develop a question, a prediction, and design a meaningful experiment. Collecting the data was also great - as you can (hopefully) see in the pictures below, we were investigating how rubber ball bounced at different temperatures. When it came time to "writing" up his results, we decided that the best approach would be to plot all his data (using stickers to represent the heights of the first bounce, of balls dropped from a height of 2m). While that may seem obvious, for many scientists, there is a great adversion to plotting raw data. It would be far more likely to see a "professional" version of this experiment present results using bar plots of mean values (and either SE, SD or 95%CI error bars). This is very unfortunate, as bar plots forsake a great deal of useful visual information about the distribution of the data. By coincidence, this was also (one of) the take-home message(s) of a recent paper:

Weissgerber, T. L., Milic, N. M., Winham, S. J., & Garovic, V. D. (2015). Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biology: e1002128.

that we has chosen to read in this week's Long Lab journal club. Overall, I thought the authors did a commendable job, and it is evident from the paper's metadata that their message is reaching a large audience. While I am in favour of anything that turns the tide against bar plots, I do wish they would have given boxplots as much publicity as the univariate scatterplots that were heavily featured in the manuscript. I suspect that as the sample sizes in the literature they were surveying (physiology) tended to have small sample sizes  According to the authors "the minimum and maximum sample sizes for any group show in a figure ... were 4... and 10 respectively". These results are presented in panel C of supplemental figure S2*

I have nothing against univariate scatterplots. In fact, for small sample sizes (say <30 elements/group), directly plotting data reveals a great deal about the distribution of the data. However, after a certain point the usefulness of this approach starts to wain, as there will be more overlap in points. In such cases, a box-plot is a more desirable solution. Not only is as aesthetic, but is also clearly indicates meaningful visual information to the reader about the centrality, the skew and the distribution of the data. *I suspect that is why, when Weissgerber et al. presented their data of their hundreds of figures, they did so using a box-plot.

"let me tell you about the wonders of data visualization"

