Visualizing Eurostat data

[2]:
import numpy as np
import pandas as pd

import geopandas as gpd

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm

import geoplot as gplt
import geoplot.crs as gcrs

from tqdm.auto import tqdm

from rwrap import eurostat
[3]:
sns.set_context('talk')

Retrieving geographical information

The NUTS classification subdivides each member state into regions at three different levels, covering NUTS 1, 2 and 3 from larger to smaller areas (https://ec.europa.eu/eurostat/web/nuts/background). Eurostat provides this geographical data in various formats (https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units, looks for “NUTS”).

The filename of each dataset follows a specific format: <theme>_<spatialtype>_<resolution>_<year>_<projection>_<subset>.<extension> (check https://ec.europa.eu/eurostat/cache/GISCO/distribution/v2/nuts/nuts-2016-files.html for more information).

To make things easier, we use the eurostat package to retrieve this data automatically.

[5]:
df_geo = eurostat.get_eurostat_geospatial(
    output_class='sf', resolution=60, nuts_level=2, year=2016
)
df_geo.head()
[5]:
id CNTR_CODE NUTS_NAME LEVL_CODE FID NUTS_ID geometry geo
0 DE24 DE Oberfranken 2 DE24 DE24 POLYGON ((11.48157 50.43162, 11.48290 50.40161... DE24
1 NO03 NO Sør-Østlandet 2 NO03 NO03 POLYGON ((10.60072 60.13161, 10.57604 60.11572... NO03
2 HU11 HU Budapest 2 HU11 HU11 POLYGON ((18.93274 47.57117, 19.02713 47.58717... HU11
3 HU12 HU Pest 2 HU12 HU12 POLYGON ((18.96591 47.02896, 18.89943 47.07825... HU12
4 HU21 HU Közép-Dunántúl 2 HU21 HU21 POLYGON ((18.68843 47.57707, 18.70448 47.52134... HU21

Example

To get a feeling for the data, we can plot the geographical composition of an arbitrary country.

[7]:
ax = gplt.polyplot(df_geo[df_geo['CNTR_CODE'] == 'DE'])
ax.set_aspect(1.4)
../_images/mining_visualization_Eurostat_Visualizations_11_0.png

Retrieving statistical data

Eurostat is a great source for data related to Europe. An overview of the main tables can be found on https://ec.europa.eu/eurostat/web/regions/data/main-tables.

Select datasets

[8]:
eurostat.search_eurostat('age').head()
[8]:
title code type last update of data last table structure change data start data end values
1 Manure storage facilities by NUTS 3 regions aei_fm_ms dataset 05.02.2019 27.02.2020 2000 2010 None
2 Population on 1 January by age, sex and NUTS 2... demo_r_d2jan dataset 13.10.2020 16.06.2020 1990 2019 None
3 Population on 1 January by age group, sex and ... demo_r_pjangroup dataset 18.06.2020 14.05.2020 1990 2019 None
4 Population on 1 January by age group, sex and ... demo_r_pjangrp3 dataset 13.10.2020 14.05.2020 2014 2019 None
5 Population on 1 January by broad age group, se... demo_r_pjanaggr3 dataset 16.06.2020 16.06.2020 1990 2019 None
[9]:
dataset_list = [
    'tgs00101',  # life expectancy
    'tgs00026',  # exposable income
    'tgs00036',  # primary income
    'tgs00010',  # unemployment rate
    'tgs00112',  # touristic bed number
]

Download and aggregate data

To make the subsequent analysis easier, we download all data at once and store it in a single dataframe. Additionally, we store associated meta data in another dataframe.

[11]:
df_list = []
meta_list = []

for dataset in tqdm(dataset_list):
    # get dataset description
    dataset_name = eurostat.label_eurostat_tables(dataset)[0]
    print(f'Parsing {dataset} ({dataset_name})')

    # retrieve data
    df_data = eurostat.get_eurostat(dataset, time_format='raw')
    df_data = eurostat.label_eurostat(df_data, code=['geo'], fix_duplicated=True)

    # index rows
    common_columns = ['geo_code', 'geo']
    df_data['idx'] = df_data[common_columns].apply(
        lambda row: '_'.join(row.values.astype(str)), axis=1
    )

    # identify meta information
    info_columns = (
        set(df_data.columns)
        - set(common_columns)
        - {'idx', 'time', 'values', 'info', 'info_short'}
    )

    meta_cols = [
        ';'.join(
            f'{k}={row._asdict()[k]}'
            for k in sorted(row._asdict().keys())
            if k != 'Index'
        )
        for row in df_data[info_columns].itertuples()
    ]
    meta_idx, meta_label = pd.factorize(meta_cols)

    df_data['meta'] = [
        f'{dataset}_{time}_{x}' for x, time in zip(meta_idx, df_data['time'])
    ]

    # pivot data
    df_piv = df_data.pivot(index='idx', columns='meta', values='values')

    # store data
    df_list.append(df_piv)

    meta_list.extend(
        [
            {
                'dataset': dataset,
                'name': dataset_name,
                'meta_idx': idx,
                'meta_label': label,
            }
            for idx, label in zip(np.unique(meta_idx), meta_label)
        ]
    )
Parsing tgs00101 (Life expectancy at birth by sex and NUTS 2 region)
Parsing tgs00026 (Disposable income of private households by NUTS 2 regions)
Parsing tgs00036 (Primary income of private households by NUTS 2 regions)
Parsing tgs00010 (Unemployment rate by NUTS 2 regions)
Parsing tgs00112 (Number of establishments and bed-places by NUTS 2 regions)

Each row corresponds to a geographic region, and each column to a certain statistic.

[13]:
df_all = pd.concat(df_list, axis=1)
df_all.head()
[13]:
meta tgs00101_2007_0 tgs00101_2007_1 tgs00101_2007_2 tgs00101_2008_0 tgs00101_2008_1 tgs00101_2008_2 tgs00101_2009_0 tgs00101_2009_1 tgs00101_2009_2 tgs00101_2010_0 ... tgs00112_2015_0 tgs00112_2015_1 tgs00112_2016_0 tgs00112_2016_1 tgs00112_2017_0 tgs00112_2017_1 tgs00112_2018_0 tgs00112_2018_1 tgs00112_2019_0 tgs00112_2019_1
idx
AT11_Burgenland (AT) 83.3 76.2 79.7 83.4 76.7 80.0 84.2 77.5 80.9 83.9 ... 28414.0 436.0 28252.0 427.0 28293.0 427.0 27551.0 418.0 27404.0 412.0
AT12_Niederösterreich 82.7 77.0 79.9 82.8 77.1 80.0 82.7 77.4 80.1 83.3 ... 68399.0 1459.0 68714.0 1429.0 67465.0 1434.0 67868.0 1435.0 68246.0 1447.0
AT13_Wien 82.0 76.7 79.5 82.4 77.0 79.9 82.1 76.5 79.4 82.2 ... 73901.0 604.0 74771.0 1072.0 75462.0 1212.0 78672.0 1735.0 79245.0 2070.0
AT21_Kärnten 83.6 77.8 80.8 84.1 77.7 81.0 83.7 77.7 80.8 84.0 ... 142508.0 2765.0 141191.0 2695.0 140473.0 2717.0 140448.0 2720.0 140165.0 2716.0
AT22_Steiermark 83.6 77.6 80.7 83.7 77.5 80.6 83.5 77.6 80.6 83.9 ... 112499.0 2432.0 116262.0 2471.0 118592.0 2471.0 125850.0 2481.0 128815.0 2521.0

5 rows × 120 columns

Extra information for each column is stored in an additional dataframe.

[14]:
df_meta = pd.DataFrame(meta_list)
df_meta.head()
[14]:
dataset name meta_idx meta_label
0 tgs00101 Life expectancy at birth by sex and NUTS 2 region 0 age=Less than 1 year;sex=Females;unit=Year
1 tgs00101 Life expectancy at birth by sex and NUTS 2 region 1 age=Less than 1 year;sex=Males;unit=Year
2 tgs00101 Life expectancy at birth by sex and NUTS 2 region 2 age=Less than 1 year;sex=Total;unit=Year
3 tgs00026 Disposable income of private households by NUT... 0 direct=Balance;na_item=Disposable income, net;...
4 tgs00036 Primary income of private households by NUTS 2... 0 direct=Balance;na_item=Balance of primary inco...
[15]:
def get_meta(col_name, extended=False):
    dataset, time, idx = col_name.split('_')
    res = df_meta[(df_meta['dataset'] == dataset) & (df_meta['meta_idx'] == int(idx))]
    assert res.shape[0] == 1
    res = res.iloc[0]

    if extended:
        return f'{res["name"]}\n({res["meta_label"]})'
    else:
        return res["name"]

Overview

To get a rough overview of the data, we can plot the clustered correlation matrix.

[16]:
df_corr = df_all.corr()
[17]:
non_nan_datasets = set(df_corr.dropna(axis=0).index) & set(
    df_corr.dropna(axis=1).columns
)
df_corr = df_corr.loc[non_nan_datasets, non_nan_datasets]
[18]:
# assign unique color to each data source
cluster_names = df_all.columns.str.split('_').str[0].unique()
cluser_colors = cm.rainbow(np.linspace(0, 1, len(cluster_names))).tolist()
cluster_map = {n: c for n, c in zip(cluster_names, cluser_colors)}
cluster_colors = pd.DataFrame(
    [(col, cluster_map[col.split('_')[0]]) for col in df_all.columns]
).set_index(0)[1]
[19]:
g = sns.clustermap(
    df_corr,
    xticklabels=False,
    yticklabels=False,
    cmap='vlag_r',
    row_colors=cluster_colors,
    col_colors=cluster_colors,
)
../_images/mining_visualization_Eurostat_Visualizations_30_0.png

Data preparation

In order to combine both geographical and statistical data, we merge the data…

[20]:
df_all['geo_code'] = [idx.split('_')[0] for idx in df_all.index]
[21]:
df_merged = df_geo.merge(df_all, left_on='NUTS_ID', right_on='geo_code')
df_merged.head()
[21]:
id CNTR_CODE NUTS_NAME LEVL_CODE FID NUTS_ID geometry geo tgs00101_2007_0 tgs00101_2007_1 ... tgs00112_2015_1 tgs00112_2016_0 tgs00112_2016_1 tgs00112_2017_0 tgs00112_2017_1 tgs00112_2018_0 tgs00112_2018_1 tgs00112_2019_0 tgs00112_2019_1 geo_code
0 DE24 DE Oberfranken 2 DE24 DE24 POLYGON ((11.48157 50.43162, 11.48290 50.40161... DE24 82.2 76.7 ... 890.0 40555.0 876.0 40804.0 869.0 40049.0 863.0 41574.0 865.0 DE24
1 NO03 NO Sør-Østlandet 2 NO03 NO03 POLYGON ((10.60072 60.13161, 10.57604 60.11572... NO03 82.3 77.7 ... 464.0 143032.0 455.0 142724.0 421.0 151654.0 413.0 NaN NaN NO03
2 HU11 HU Budapest 2 HU11 HU11 POLYGON ((18.93274 47.57117, 19.02713 47.58717... HU11 78.7 72.4 ... 342.0 57210.0 359.0 NaN NaN 56213.0 375.0 58711.0 379.0 HU11
3 HU12 HU Pest 2 HU12 HU12 POLYGON ((18.96591 47.02896, 18.89943 47.07825... HU12 77.9 69.8 ... 229.0 16348.0 232.0 NaN NaN 17816.0 248.0 16927.0 240.0 HU12
4 HU21 HU Közép-Dunántúl 2 HU21 HU21 POLYGON ((18.68843 47.57707, 18.70448 47.52134... HU21 77.8 69.4 ... 610.0 63641.0 625.0 65846.0 651.0 70627.0 685.0 69919.0 692.0 HU21

5 rows × 129 columns

…and dissolve it to reach a resolution which leads to a nice visualization.

[22]:
%%time
df_diss = df_merged.dissolve(by='CNTR_CODE', aggfunc='mean')
CPU times: user 612 ms, sys: 17.3 ms, total: 630 ms
Wall time: 663 ms
[23]:
df_diss.head()
[23]:
geometry LEVL_CODE tgs00101_2007_0 tgs00101_2007_1 tgs00101_2007_2 tgs00101_2008_0 tgs00101_2008_1 tgs00101_2008_2 tgs00101_2009_0 tgs00101_2009_1 ... tgs00112_2015_0 tgs00112_2015_1 tgs00112_2016_0 tgs00112_2016_1 tgs00112_2017_0 tgs00112_2017_1 tgs00112_2018_0 tgs00112_2018_1 tgs00112_2019_0 tgs00112_2019_1
CNTR_CODE
AT POLYGON ((15.99624 46.83540, 15.98885 46.82563... 2 83.366667 77.611111 80.566667 83.622222 77.977778 80.866667 83.600000 77.844444 ... 110385.000000 2257.222222 111271.333333 2291.000000 112577.666667 2320.555556 116181.888889 2388.222222 115356.444444 2439.000000
BE POLYGON ((6.02490 50.18278, 5.97197 50.17189, ... 2 82.581818 76.909091 79.790909 82.536364 76.627273 79.600000 82.672727 77.190909 ... 33499.272727 726.363636 33497.090909 746.363636 33982.000000 779.090909 35438.727273 837.363636 35962.272727 877.363636
BG POLYGON ((26.56154 41.92627, 26.58121 41.90292... 2 76.466667 69.400000 72.833333 76.850000 69.683333 73.133333 77.183333 70.000000 ... 53744.166667 533.666667 54710.666667 555.166667 58120.666667 557.666667 55932.833333 576.333333 56917.666667 610.666667
CH POLYGON ((9.15938 46.16960, 9.12055 46.13610, ... 2 84.557143 79.614286 82.185714 84.785714 79.942857 82.471429 84.742857 80.014286 ... NaN NaN 97329.571429 5902.714286 NaN NaN 94951.857143 5579.571429 NaN NaN
CY POLYGON ((33.75237 34.97711, 33.70164 34.97605... 2 82.100000 77.600000 79.800000 82.900000 78.200000 80.600000 83.500000 78.500000 ... 85414.000000 788.000000 84239.000000 785.000000 85965.000000 796.000000 87240.000000 802.000000 90188.000000 816.000000

5 rows × 122 columns

Visualization

We visualize the data using choropleth maps. By doing so, each geographical region is colored according to a statistic of interest.

In the following, we will use a single entry from each downloaded dataset.

[24]:
time = 2016
hue_selection = [
    f'{row.dataset}_{time}_{row.meta_idx}'
    for row in df_meta.groupby('dataset').apply(lambda x: x.sample(n=1)).itertuples()
]
hue_selection
[24]:
['tgs00010_2016_0',
 'tgs00026_2016_0',
 'tgs00036_2016_0',
 'tgs00101_2016_0',
 'tgs00112_2016_0']

Single country

We can get a precise overview of individual regions within a single country.

[25]:
size = int(np.ceil(np.sqrt(len(hue_selection))))  # size of plot grid

fig, axes = plt.subplots(nrows=size, ncols=size, figsize=(20, 20))
[ax.axis('off') for ax in axes.ravel()]

for hue, ax in zip(hue_selection, axes.ravel()):
    sub = df_merged[(df_merged['CNTR_CODE'] == 'DE')]

    gplt.choropleth(sub, hue=hue, legend=True, ax=ax)
    ax.set_aspect(1.4)
    ax.set_title(get_meta(hue), fontsize=10)
../_images/mining_visualization_Eurostat_Visualizations_40_0.png

Europe

To get a more global picture, we can also plot the aggregated data per country.

[26]:
for hue in hue_selection:
    ax = gplt.choropleth(
        df_diss,
        hue=hue,
        legend=True,  # k=5,
        figsize=(16, 12),
        extent=(
            -25,
            30,
            45,
            75,
        ),  # (min_longitude, min_latitude, max_longitude, max_latitude)
    )
    ax.set_aspect(1.4)
    ax.set_title(get_meta(hue))
../_images/mining_visualization_Eurostat_Visualizations_42_0.png
../_images/mining_visualization_Eurostat_Visualizations_42_1.png
../_images/mining_visualization_Eurostat_Visualizations_42_2.png
../_images/mining_visualization_Eurostat_Visualizations_42_3.png
../_images/mining_visualization_Eurostat_Visualizations_42_4.png