Do you speak data?
November 26, 2015
Here at Panaseer Labs, we spend a lot of time working out how best to communicate what data is telling us. We are big advocates of data-driven decision making, but what does this actually entail? Although data itself can’t lie, the way it is gathered, analysed and/or presented can mislead. This could be intentional (the old “50% of statistics are made up on the spot” scenario) or because the audience misinterpret the findings (which is not their fault!). Data scientists must remember that data is not a lingua franca and we must think carefully about who will be consuming our analysis. Do they really know the difference between the median, mode and mean but more importantly is it relevant for the message we wish to convey?
Even if the data analysis is audience-appropriate and a fair representation of the true picture, we still have to consider how our brains process the information as it is presented to us. Much research has been undertaken into human decision-making and it turns out we are not as logical as we may like to believe. Surprisingly, many factors such as the phrasing of information (e.g. quoting the success rate rather than the failure rate) can affect our decisions (“Thinking, Fast and Slow”, Daniel Kahneman). Even the way our visual perception works can lead to the success or failure of a visualisation. We’ve all been fooled by optical illusions, which work by exploiting our brains. Take a look at the image above, did you realise all the Panaseer logos are the same colour? In plots we use “visual encoding” to represent data e.g. height could be mapped to position along the x-axis and name could be mapped to colour. However, not all visual encodings are created equal. Our perception of quantitative data (like height) is most accurate when it is encoded as position and least accurate when it is encoded as colour. The figure below, taken from “Automating the Design of Graphical Presentations of Relational Information”, Jock Mackinlay 1986, shows the ranking for quantitative perceptual tasks.
Nominal data, however, (like peoples’ names) can be perceived accurately when encoded as either position or colour. Looking at the featured image on the left-hand side of this page, the two paragraphs of text are identical in content. However, colouring key letters within the second paragraph draws the message out with more impact.
Another factor to consider is providing context for data. How do we know if a number is high or low? For example, in order to decide whether a person is “short” or “tall”, we also need to know their age (a height that we may consider “tall” for a toddler would be evaluated differently for an adult). Our decision is then based on our pre-existing contextual knowledge of the average height of people of this age. There is a balance to achieve, however, particularly for large, complex datasets. We’ve all seen presentations crowded with tables and pie charts that leave us overwhelmed rather than well-informed. Good visualisations are easy to interpret quickly, which means including as much relevant information as possible and displaying it efficiently. This concept is nicely defined as maximising the data to ink ratio (1 – the fraction of the visualisation that can be removed without loss of information) (“The Visual Display of Quantitative Information”, Edward Tufte).
We should not encourage people to make decisions based on data if the message presented could be ambiguous, or hard to understand, for any of the reasons discussed above. In this scenario the wrong decision could be taken for the right reasons, undermining people’s faith in the process. If we want people to move from over-reliance on expert opinion to trusting in their data, we need to achieve a few broad goals:
- The data analysis should be relevant. The audience are provided with an appropriate answer to the question that prompted the analysis in the first place. If you can’t think of the question your analysis is addressing, then why are you doing it?
- The data analysis must be reasonably transparent. The audience can appreciate the context and/or limits of the analysis.
- The visualisations must be easy to consume and interpret correctly. The hard work is done by the data scientists, not the audience.
In future posts we will discuss in more detail how we approach these issues in the context of Security Intelligence at Panaseer. In the meantime, why not share with us any examples of visualisations you particularly like or dislike in the comments or @panaseer_team (extra points for cyber-related ones). Don’t forget to tell us why!