Multivariate Data

Before presenting and discussing different visualization options, we first explain what exactly we consider as multivariate data. We also introduce a specific dataset that we use as our running example throughout this tutorial.

Multivariate data, as focused in this tutorial, concerns any data that can be represented as:

Since all items of one dataset are described along the same variables, it is important that the items belong to the same class of objects. Otherwise, it would be hard to apply the same variables. For instance, mixing persons with countries in one dataset, it would be hard to find any attributes that apply to both classes of items alike.

Aside from identifiers and names, we can discern between two main data types for variables:

Hence, these two types encode quantitative and qualitative information. While numeric variables can be sorted in increasing or decreasing order, categorical variables often cannot. Please note that sometimes categorical variables are encoded as numbers (e.g., a numeric country code) — analyzing them as numeric variables would result in invalid results.

Our Sample Dataset

We have selected a dataset that is easy to understand, yet of high relevance to many disciplines. It lists the countries of the world as data items and characterizes them along social and economic properties in 2010 as variables. The data is based on free material from GAPMINDER.ORG, CC-BY LICENSE.

The table below shows the first six rows of the dataset. You can see that it contains mostly numeric variables, but also a column with an identifier (the country name) and one with a categorical variable (the region).

The dataset contains the following variables:

You can also observe some missing values (nullnot available). It is a common problem in all real-world data that, due to different reasons, there are gaps and quality issues in the data. We intentionally kept the countries with such missing data to literally draw a more realistic picture of the data. Please note, however, that the visualizations in the following sections will silently omit data points with missing values for the displayed variable.