d3.js: Data Driven Visualisation Part I

Overview

In the interest of full disclosure, what follows is based off a training session given by Tobias Burri.

Before diving into the topic of Data Visualisation, it's worth asking why. What is it about the topic that makes it worthy of discussion? Fundamentally, this boils down to two points:

  1. Data is often rife with complexity
  2. Observations derived from data may also be difficult to convey

Intuitively, it has been stated that  well encoded visualisations can lead to more accurate judgement and perceptions of patterns in data. Therefore, finding a proper visual encoding to comprehend and convey meaning from data is of paramount importance to the discovery process.

The graphical form that involves elementary perceptual tasks that lead to more accurate judgement than another graphical form (with the same quantitative information) will result in a better organization in increase the chances of a correct perception of patterns and behavior.
— Cleveland & McGill (1984)

d3.js (Data-Driven Documents), which is the tool we will employ here, is merely a low-level mechanism we will use to understand, encode and convey meaning from our data.

Visual Encoding Techniques

A good representation is often one that is both simple and aligned with 'human perception'. 

An example which tackles these criteria is shown below. It is a diagram by the famous Florence Nightingale demonstrating the different causes of mortality in the Army grouped by causes and months over a two year period.

Upon inspection, the message Florence was trying to convey is immediately apparent. There is no 'chart-junk' (coined by Edward R. Tufte) masking the data's message, the graph type is familiar, and the visualisation is easy to interpret through colour & size comparisons which naturally lend themselves to human interpretation. The result of all this ensures that the data is correctly interpreted and results are clearly demonstrated.

Causes of mortality in the army in the East by Florence Nightingale.

Causes of mortality in the army in the East by Florence Nightingale.

The choice of encoding will be affected by quirks of the dataset and its intended message. Cleveland & McGill (1984) proposed guidelines for choosing an encoding based on the function of the display. Perhaps you are trying to convey positional data, lengths, angles, area(s), shading, curves or volumetric data. Cleveland & McGill (1984) took an in-depth look at how well the human brain can make comparisons and detect differences between encodings; the key idea is that you want to choose an encoding which suits the dataset, intended use and makes it easy for the human brain to interpret.

As a final counter-example, consider the table below. If I asked you to consider which of Sweden or Spain had higher fertility rates between a given period, is this easy to derive?

Perhaps it is not us at fault, but rather the choice of encoding. Is the following representation easier to reason about? The point is that there is no one size fits all solution. You must consider the nature of the data and message that you are trying to convey in order to choose an effective encoding.

Abstraction vs. Control

In the case of visualisation, there is often a trade-off between abstraction and control. For example, Tableau sits on the abstract end of the spectrum; it provides many features out of the box with a degree of customisation. d3 sits on the other end of the spectrum and makes direct use of low-level HTML elements such as SVG and Canvas to exert maximum control with (some) abstraction. Again, there is no one size fits all solution; each approach has merits and pitfalls.

Elements of d3.js

Below is an example of a simple SVG element to render a rectangle in the browser:

<html>
<svg width='500' height='500'>
<rect y='10' x='10' height='100px' width='100px' fill='red'></rect>
</svg>
</html>

Result:

And the equivalent using d3.js...

Right away, we notice the declarative approach of d3; we express what we want to select & manipulate rather than how to do it.

d3.select('body') // Select body element of HTML
           .append('svg') // Append SVG element to HTML body
           .attr('height', 500) // Set attributes of SVG element
           .attr('width', 500);
var rect = svg.append('rect') // Add rect element to SVG element
               .attr('height', 100)
               .attr('width', 100)
               .attr('fill', 'red')
               .attr('x', 10)
               .attr('y', 10);

The following sections will tackle the elements which form the core of d3.

Selections

A selection is simply an array of elements pulled from an HTML document. Selections allow you to pick out and manipulate groups of elements in an HTML document that are of interest in a declarative fashion. The key difference with respect to traditional approaches is that we don't have to express convoluted logic to loop over elements of the DOM and interact with elements one-by-one. Instead, we can use selections to modify groups of DOM elements without the need for recursion or iteration.

We have already seen one example where we selected the body element and appended an SVG rectangle element. Once you have the desired set of selections in hand, we can apply various operators to manipulate the DOM elements. The main ones to be aware of are: append, remove, attr and data for adding, removing, modifying and data binding respectively.

Selections can also be performed on: class (".x"), unique identifier ("#y"), attribute ("[color=red]"), or by containment ("parent child").

Joins

This is perhaps the most fundamental part of d3 and underpins the title 'Data-Driven Documents'. The key point is that every element in d3 (e.g. an SVG rectangle) can be bound to a corresponding data item.

Suppose we have an HTML document with two rect elements as per the below:

<svg width='500' height='60'>
<rect y='10' x='10' height='50px' width='50px' fill='red'></rect>
<rect y='10' x='100' height='50px' width='50px' fill='red'></rect>
</svg>

What would the following d3 do...?

var array = [30, 40, 20];
var rect = svg.selectAll('rect')
    .data(array) // Bind array to rect selection
    .attr('fill', 'blue');

The call to data(array) here 'binds' the array to the selection 'rects'. This means that each element (30, 40, 20 ...) is paired with a corresponding rect in the selection. The result of this example is that 30 is bound to the first rectangle, 40 to the second rectangle and their fill attribute is modified to blue. 

A natural question is what happens when the array length does not equal the number of rect elements in the document? In fact this example has surplus elements in the array which are not bound to an object... Perhaps we want to treat these elements differently. In order to distinguish between the two we can use the enter() operator:

var array = [30, 40, 20];
var rect = svg.selectAll('rect')
    .data(array)
    .enter() // Reference surplus array-empty pairs
    .append('rect')
    .attr('fill', 'blue')
    .attr('width', ...)
    .attr('height', ...)
    .attr('x', ...)
    .attr('y', ...);

Now the behaviour has changed. The enter() selection will select the surplus pairs (20 in this case). The result of this is that new rectangles are appended for each surplus element in the array with the fill colour blue. 

Conversely, when there are more HTML elements than data elements, we can use the exit() selection. For example, if we have two rectangles and only one data element we can use the exit() selection to reference the surplus rectangle:

var array = [30];
var rect = svg.selectAll('rect')
        .data(array)
        .exit() // Reference second rectangle 
        .remove() // Remove it

One last important note is that binding occurs by index by default - this means there is no object constancy. If elements are removed, the new element in a given index will bind to the original DOM element. To alleviate this issue, you can specify a second parameter to the data function to define how to bind data to elements (for example by key).

Scales & Axes

d3 supports the use of scales to project a domain into a set range. For example:

var array = [1000, 5000, 10000];
var c = d3.scale.linear() // Create linear scale 
                 .domain([1000,10000]) // Set source domain
                 .range(['green', 'red']); // Set projection range

// Set colour to element d of array and project using c.
svg.selectAll('rect').data().enter()
  .attr('fill', function(d) {return c(d)}) 

Here the source array is scaled (linearly) from the domain of 1000-10000 to the range of green to red. What's special here is that d3 is intelligent enough to interpolate the domain as colours. The result is that our rect elements fill colour can be mapped to a range of colours depending on the dataset without expressing explicit RGB values.

It's all well and good having selections, joins and scaling at hand - but how can we generate axes for our charts? A simple example is shown below:

// Assumes a scale 'x' is previously defined
var axis = d3.svg.axis().scale(x).orient('bottom'); 

// Append a new element, translate it and add the axis
svg.append('g')
        .attr('transform', 'translate(0,50)')
        .call(axis);

In order to avoid part of the axes/data rendering off-screen, it is conventional to append a new element within a margin as follows:

External Data - A real world example

In order to do anything interesting with d3, we will likely need to deal with external data such as a CSV file. In order to do so we will need a simple HTTP server to expose the file for d3 to access it; we will use a simple Python SimpleHTTPServer for this purpose.

To tie together the material thus far, we will deal with a real world example on data from the titanic. If you want to follow along, you will need to make an account on Kaggle, download the dataset and place it in a new directory before proceeding. You will also need an installation of Python handy to run a simple HTTP server.

Before running the example, ensure you have launched an HTTP server in the new directory:

python -m SimpleHTTPServer 8888

Source code to load & visualise the dataset:

// Load titanic csv from HTTP server
d3.csv("titanic.csv", function(data) {

// Set margins as per convention
var margin = {top: 30, right: 30, bottom: 20, left: 50};
var height = 700 - margin.top  - margin.bottom;
var width = 1200  - margin.left - margin.right;
  
var svg = d3.select('body')
.append('svg') // Append svg element to body
.attr('width', width + margin.left + margin.right)
.attr('height', height + margin.top + margin.bottom)
.append('g') // Append g element to svg using margin convention
.attr('transform', 'translate('+ margin.left +',' + margin.top +')');

// Setup linear scale factor
var x = d3.scale.linear()
    .domain([0,80])
    .range([0,width])

// Setup log scaling on y-axis
var y = d3.scale.log()
    .domain([6,500])
    .range([height, 0])

// Setup ordinal scaling for colours
var c = d3.scale.ordinal()
    .domain([1,2,3])
    .range(['#ff4d4d','#00cc66','orange'])                             

// Build axes
var axisX = d3.svg.axis().scale(x).orient('bottom')
var axisY = d3.svg.axis().scale(y).orient('left')

svg.selectAll('circle')
    .data(data) // Bind data to circle elements
    .enter() // Reference unpaired array elements
    .append('g') // Add g element
    .append('circle') // Add circle to g

    // Set attributes from data using scales
    .attr('cx', function(d) {return  x(d.Age)}) 
    .attr('cy', function(d) {return y(d.Fare)})
    .attr('r', function(d) {return 3 + 2*d.Survived})
    .attr('fill', function(d) {return c(d.Pclass)})

// Append x-axis
svg.append('g')
    .attr('class', 'axis')
    .attr('transform', 'translate( 0,' + height +')')
    .call(axisX)

// Append y-axis
svg.append('g')
    .attr('class', 'axis')
    .call(axisY)
});

The example above will plot the following:

Age is represented on the x-axis and fare price on the y-axis. Note that these figures have been scaled linearly on the x-axis, logarithmically on the y-axis and categorically for the colours. The colours represent the class of passenger (red = 1st, green = 2nd and orange = 3rd). Finally, the larger circles indicate that a passenger survived whilst smaller circles denote a casualty.

How difficult is it for our brain to decipher insights using this encoding? Intuitively, 1st class passengers are seen to pay a higher fare price whilst 2nd and 3rd class passengers are primarily in the lower region of the graph. What's more interesting is that we can see a potential pattern emerging which appears to show 1st class passengers correlating with a high survival rates; there appears to be fewer small red circles in the upper region of the graph. Of course this is merely an observation and has not taken into consideration the ratio of 1st to 2nd and 3rd class passengers, but it does serve as a good example of encoding data in a way that is easy to interpret.

Transitions & Event Handling

Not only do we want to represent visualisations in a declarative and data-driven manner, but we'd also like to enable dynamic visualisations. For example, when a user clicks a button, trigger an animation. Transitions enable us to do this and can largely be thought of as selections over time.

A simple transition is demonstrated below:

...And the code needed to perform the transition:

<div id="viz2">
    <svg height="300" width="800">
  <circle cx="400" cy="50" r="40" stroke="black" stroke-width="3" fill="blue" />
  <circle cx="300" cy="50" r="40" stroke="black" stroke-width="3" fill="blue" />
    </svg>
</div>
<script>
var viz = d3.select("#viz2").select("svg");
repeat();

function repeat() {
var circle = viz.selectAll('circle') // Select the cirles
    .transition() // Start a transition
    .duration(1500) // Set duration for the transition
    .attr('cy', 250) // Set end attributes
    .attr('fill','red')
    .each('end', function(){ // Function to call at end of transition
          viz.selectAll('circle') 
            .transition()
            .duration(1500)
            .attr('cy', 50)
            .attr('fill','blue')
            .each('end', repeat);
    });
}
</script>

If you want to react to user events such as mouse clicks you can make use of the .on() operator. Examples of events that are supported include:

  • 'click' 
  • 'dblclick' 
  • 'mouseover' 
  • 'mousemove' 'mouseout' 
  • 'dragstart' 
  • 'drag'
  • 'dragend'
  • 'zoom'
  • ... and other Javascript event types

Conclusion

I hope you found this post informative. To recap, we've touched on what makes a good visualisation, techniques to effectively encode data and on the basics of the d3.js library. We looked at selections, joins, scales, axes, loading data, transitions & events with examples of how to apply them in practice.

If you're interested in hearing more about data visualisation and other topics in the data analytics space, stay tuned.

 

References

  • Cleveland, William S. & McGill, Robert "Graphical perception: Theory, experimentation, and application to the development of graphical methodsJournal of the American statistical association Vol. 79 1984.