📈Guide to Using the Estonian Statistical Office's Database via R

Mar 6, 2024·
Nicolas Reigl, PhD
Nicolas Reigl, PhD
· 12 min read

Disclaimer

This guide is an independent work and is not endorsed by Statistics Estonia. The content presented here reflects the author’s personal experience and interpretation of accessing Estonian statistical data using R and the pxweb package.

It does not represent the official opinions, documentation, or guidelines of Statistics Estonia. While Statistics Estonia previously provided documentation on using pxweb, those materials seem to have been since removed. The methods described in this guide are based on publicly available information and the general functionality of the pxweb R package.

For official guidance on Statistics Estonia’s data access, please refer to their website: https://andmed.stat.ee.


Introduction

This guide provides an overview of how to access the Estonian Statistical Office database using R.

To interact with the database, I recommend the pxweb R package.

PXWEB is an API system developed by Statistics Sweden and adopted by various national statistical institutions (NSIs) to provide structured access to public statistics. This API allows users to download and use statistical data directly from government agencies without needing to navigate a web browser.

The pxweb R package enables seamless integration with any PXWEB API, allowing users to query, retrieve, and reference statistical data efficiently.


Setup

The pxweb package is the only required package for downloading data.
The additional packages listed below are optional and used for visualization and additional functionality.

# Load packages
library(pxweb)
library(tidyverse)
library(quarto)
library(gt)
library(plotly)

Data Access

The pxweb package enables access to data in various ways, two of which I will discuss in detail1:

  1. Interactive Data Access:
    Use the pxweb_interactive() command to make selections in the database based on existing subdivisions. You’ll be prompted to choose the broadest groups of topics, then clarifications, and finally the method of presenting the results.

  2. Direct Query:
    Utilize the pxweb_get() function to directly query a table from the database. You can specify query parameters such as for example, NACE categories, stock or flow values or the time period for the data.

1 Interactive Data Access

Using the pxweb_interactive() command, it is possible to make selections in the database according to the existing subdivisions. First, you are asked to select the broadest categories of topics (see the following console output screenshot), then the subcategories and specific variable filters, and finally the method of presenting the results.

For Statistics Estonia, the top-level database address is:

Then start the interactive data explorer

pxweb_interactive("https://andmed.stat.ee/api/v1/en/stat")
============================================================================================================================
 R PXWEB: Content of 'andmed.stat.ee'
          at '/api/v1/en/stat'
============================================================================================================================
 [ 1 ] : Environment
 [ 2 ] : Economy
 [ 3 ] : Population
 [ 4 ] : Social life
 [ 5 ] : Multidomain statistics
 [ 6 ] : Population and Housing Census
 [ 7 ] : Discontinued datasets
============================================================================================================================
Enter your choice:
('esc' = Quit, 'b' = Back, 'i' = Show id)
1:

Users can navigate the menu to select their desired options from the available choices. Upon selecting the relevant data, the interactive data explorer provides the functionality to transform the interactive query into executable code. This code can be conveniently stored in an R file, facilitating reproducibility, automatic updates, and report generation.

============================================================================================================================
 R PXWEB: Content of 'andmed.stat.ee'
          at '/api/v1/en/stat/majandus/transport/maanteetransport/TS50.PX'
   TABLE: TS50: NATIONAL AND INTERNATIONAL ROAD FREIGHT TRANSPORT
VARIABLE: Aasta, Veo liik, [[NĂ€itaja]]
============================================================================================================================
 [1 ] : Freight, thousand tonnes
 [2 ] : Standard error of freight, thousand tonnes
 [3 ] : Relative standard error of freight, %
 [4 ] : Freight turnover, thousand tonne-km

 [6 ] : Relative standard error of freight turnover, %
 [7 ] : Run of loaded vehicles, thousand km
 [8 ] : Standard error of run of loaded vehicles, thousand km
 [9 ] : Relative standard error of run of loaded vehicles, %
============================================================================================================================
Enter one or more choices:
Separate multiple choices by ',' and intervals of choices by ':'
('esc' = Quit, 'b' = Back, '*' = Select all, 'a' = Show all, 'i' = Show id)
1: *

Do you want to print code to query and download data?
Enter your choice:
('esc' = Quit, 'y' = Yes, 'n' = No)
1: y
Do you want to print query in json format (otherwise query is printed as an R list)?
Enter your choice:
('esc' = Quit, 'y' = Yes, 'n' = No)
1: n
Do you want to download the data?
Enter your choice:
('esc' = Quit, 'y' = Yes, 'n' = No)
1: y
Do you want to return a the data as a data.frame?
Enter your choice:
('esc' = Quit, 'y' = Yes, 'n' = No)
1: y
Do you want to print citation for the data?
Enter your choice:
('esc' = Quit, 'y' = Yes, 'n' = No)
1: n

After selecting the desired options, the generated code snippet is printed in the console.

# PXWEB query
pxweb_query_list <-
  list("Aasta"=c("2010","2011","2012","2013","2014","2015","2016","2017","2018","2019","2020","2021","2022"),
       "Veo liik"=c("1"),
       "NĂ€itaja"=c("1","2","3","4","5","6","7","8","9"))

# Download data
px_data <-
  pxweb_get(url = "https://andmed.stat.ee/api/v1/en/stat/majandus/transport/maanteetransport/TS50.PX",
            query = pxweb_query_list)

# Convert to data.frame
px_data_frame <- as.data.frame(px_data, column.name.type = "text", variable.value.type = "text")

From here on, we can work with the dataset in R.

2 Direct Query

If you already know the table ID, you can query the data directly. In this example, we retrieve quarterly data on service exports and imports by economic activity from the VKT20 table, which is part of the foreign trade section under the economy category.2

If pxweb_get() is used without a query, it returns either:

  • A PXWEB LEVELS object (if the URL points to a dataset directory).
  • A PXWEB METADATA object (if the URL points to a specific table).

Here’s an example of retrieving a PXWEB LEVELS object:

# Get PXWEB levels
pxlist1 <- pxweb_get(url = "https://andmed.stat.ee/api/v1/en/statsql/VKT20")
pxlist1
PXWEB METADATA
Service value, million euros by flow, economic activity and quarter
variables:
 [[1]] FLOW: flow
 [[2]] NACE_R2: economic activity
 [[3]] ContentsCode: indicator
 [[4]] TIME: quarter

This table follows a long data format and consists of four key dimensions:

  1. FLOW – Indicates the type of transaction: Balance, Imports, or Exports.
  2. NACE_R2 – Specifies the economic activity category.
  3. ContentsCode – Represents the values in million euros, structured in a long format, stacked by the other three dimensions.
  4. TIME – Contains date values as character strings.

To inspect the values of a specific field variable, we can check the “FLOW” column:

pxlist1$variables[[1]]
$code
[1] "FLOW"

$text
[1] "flow"

$values
[1] "EXP" "IMP" "BAL"

$valueTexts
[1] "Exports" "Imports" "Balance"

$elimination
[1] FALSE

$time
[1] FALSE

Creating Data Queries

To download data, we need both the table URL and a query specifying the required data fields. If no query is supplied, the API returns only metadata about the dataset.3

How to Identify Query Variables

The required list variables can be found in two ways:

  1. Programmatically – The pxlist1 object contains metadata outlining the required fields: Flow, NACE_R2, ContentsCode, and Time.
  2. Via the API – Click “API query for this table” on the Statistics Estonia data portal to view field arguments.

Handling Missing Field Variables

Sometimes, not all variables are listed on the Statistics Estonia website, especially for high-dimensional datasets in long format. If a required variable is missing, the query may fail. However, the error message typically indicates the missing element, allowing users to adjust the query accordingly.

Using the Wildcard Operator

To retrieve all available values for a given field, use the wildcard operator ("*") instead of specifying each value manually.


Example Query

In this example, we retrieve:

  • Exports and Imports: (“FLOW” = c(“EXP”, “IMP”))
    • All economic categories: ("NACE_R2" = "*")
    • All available content types: ("ContentsCode" = "*") (This field refers to units such as “Service value, million euros,” stored in long format.)
    • All available time periods: ("TIME" = "*")
px_query_list1 <- list(
  "FLOW" = c("EXP", "IMP"),
  "NACE_R2" = "*",
  "ContentsCode" = "*",
  "TIME" = "*"
)

We can now pass the query list to the function call:

px_VKT20 <-
  pxweb_get(url = "https://andmed.stat.ee/api/v1/en/statsql/VKT20",
            query = px_query_list1)

The function returns a pxweb_data object, which stores the downloaded data in a list format.


Structure and Properties of the pxweb_data Object

Before using the data, we can inspect the structure of the downloaded object.

The retrieved data is stored as a nested list within the pxweb_data object.

class(px_VKT20)
[1] "pxweb_data" "list"

Calling the object provides a minimal summary of the dimensions of the object.

px_VKT20
PXWEB DATA
With 4 variables and 10656 observations.

Potential comments which are stored as metadata in the list can be extracted.

px_data_comments <- pxweb_data_comments(px_VKT20)
px_data_comments_df <- as.data.frame(px_data_comments)

We can extract, store and cite the data-query information.

pcite <- pxweb_cite(px_VKT20)
Statistics Estonia (2025). "Service value, million euros by flow,
economic activity and quarter." [Data accessed 2025-03-07
12:36:27.231629 using pxweb R package 0.17.0],
<https://andmed.stat.ee/api/v1/en/statsql/VKT20>.

A BibTeX entry for LaTeX users is

  @Misc{,
    title = {Service value, million euros by flow, economic activity and quarter},
    author = {{Statistics Estonia}},
    organization = {Statistics Estonia},
    address = {Tallinn, Estonia},
    year = {2025},
    url = {https://andmed.stat.ee/api/v1/en/statsql/VKT20},
    note = {[Data accessed 2025-03-07 12:36:27.231629 using pxweb R package 0.17.0]},
  }
Kindly cite the pxweb R package as follows:

  Mans Magnusson, Markus Kainu, Janne Huovari, and Leo Lahti
  (rOpenGov).  pxweb: R tools for PXWEB API.  URL:
  http://github.com/ropengov/pxweb

A BibTeX entry for LaTeX users is

  @Misc{,
    title = {pxweb: R tools for PX-WEB API},
    author = {Mans Magnusson and Markus Kainu and Janne Huovari and Leo Lahti},
    year = {2019},
  }

Working with the Downloaded Data

The downloaded PXWEB data can be converted into either a data frame or a character matrix:

  • A character matrix retains the raw data as text.
  • A data frame returns a structured tidy format4, where missing values (e.g., "..") are automatically converted to NA.

Using the arguments variable.value.type and column.name.type, we can specify whether we want coded values or descriptive text for column names and data values.

Let’s extract the data, setting all columns as text, and reformat them if necessary later.

VKT20.df <- as.data.frame(px_VKT20, column.name.type = "text", variable.value.type = "text")

Alternatively, we can extract column labels and values as coded values instead of text.
This approach is often preferred during data cleaning.

VKT20_code.df <- as.data.frame(px_VKT20, column.name.type = "code", variable.value.type = "code")

For example, if working with NACE codes, we can merge the NACE column from the coded dataset with the text-based dataset.
Since both data frames originate from the same object and have identical dimensions, this process is straightforward.

VKT20.df$nace <- VKT20_code.df$NACE_R2

Once the data is in a dataframe format, we can analyze it using our preferred R workflow.

# Create a temp dataframe with a subset of sectors, and reformat the date column
temp <- VKT20.df |>
  mutate(period = yq(quarter)) |>
    filter(str_detect(`economic activity`, "Total|Manufacturing|Transportation|Information and communication"))

# Alternative: filter by code

temp <- VKT20.df |>
  mutate(period = yq(quarter)) |>
    filter(nace %in% c("TOTAL", "C", "H", "J"))

Summary statistics:

summary_stats <- temp %>%
  group_by(`economic activity`, flow) %>%
  summarise(
    Mean = mean(`Service value, million euros`, na.rm = TRUE),
    Median = median(`Service value, million euros`, na.rm = TRUE),
    SD = sd(`Service value, million euros`, na.rm = TRUE),

    Min = min(`Service value, million euros`, na.rm = TRUE),
    Max = max(`Service value, million euros`, na.rm = TRUE),
    N = n()
  )
# Create gt table
# Reorder the rows to have "Total - all activities" first
summary_stats_reordered <- summary_stats %>%
  arrange(`economic activity` != "Total - all activities")

# Create gt table with formatted numerical columns
summary_table <- summary_stats_reordered %>%
  gt() %>%
  fmt_number(columns = where(is.numeric),
             decimals = 2) |>
  tab_header(title = "") |>   opt_stylize(style = 1, color = 'gray')

summary_table

Summary Statistics of Imports and Exports of Services by Economic Activity

MeanMedianSDMinMaxN
Total - all activities
Exports1,889.961,582.30689.861,074.403,267.0048.00
Imports1,446.901,160.15589.03797.102,535.9048.00
Information and communication
Exports412.38247.70333.80100.101,122.9048.00
Imports251.68123.10227.8762.90766.2048.00
Manufacturing
Exports141.16138.4537.6378.60229.8048.00
Imports79.3675.7518.6550.70118.5048.00
Transportation and storage
Exports480.50429.15110.51326.20699.5048.00
Imports241.71205.5588.63144.50404.3048.00

Extract the URL for documenting the data source in the plot:

source_url <- px_VKT20$url

Plot the data

# Calculate Year-on-Year growth rates
temp_g <- temp |>
  group_by(flow, `economic activity`) %>%
  mutate(
    growth_rate = (
      `Service value, million euros` - lag(`Service value, million euros`, 4)
    ) / lag(`Service value, million euros`, 4) * 100
  )

temp_g$`economic activity` <- factor(temp_g$`economic activity`,
                                     levels = c("Total - all activities",
                                                "Manufacturing",
                                                "Transportation and storage",
                                                "Information and communication"))

p1 <- ggplot(data = temp_g, aes(x = period, y = growth_rate, color = `economic activity`)) +
  geom_line(linewidth = 1.5) +
  facet_wrap(~ flow, nrow = 2) +
  labs(
    x = "",
    y = "Percent",
    color = "Economic Activity",
    caption = paste("Source:", source_url)
  ) +
  ggtitle("Year-on-Year Growth Rates of Imports and Exports of Services by Economic Activity") +
  theme_minimal(base_size = 20) +
  theme(
    legend.position = "bottom",
    legend.direction = "vertical",
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    legend.key.width = unit(1, "cm"),
    legend.spacing.y = unit(0.4, 'cm'),
    axis.text = element_text(size = 14),
    axis.title = element_text(size = 14),
    plot.caption = element_text(size = 10, hjust = 0.5)
  )

p1
Growth Rates of Imports and Exports of Services
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Tallinn
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] plotly_4.10.4   gt_0.11.1       quarto_1.4.4    lubridate_1.9.4
 [5] forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.4
 [9] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1
[13] tidyverse_2.0.0 pxweb_0.17.0

loaded via a namespace (and not attached):
 [1] sass_0.4.9        generics_0.1.3    xml2_1.3.6        stringi_1.8.4
 [5] hms_1.1.3         digest_0.6.37     magrittr_2.0.3    evaluate_1.0.3
 [9] grid_4.4.0        timechange_0.3.0  fastmap_1.2.0     jsonlite_1.9.0
[13] processx_3.8.6    backports_1.5.0   ps_1.9.0          httr_1.4.7
[17] viridisLite_0.4.2 scales_1.3.0      lazyeval_0.2.2    cli_3.6.4
[21] rlang_1.1.5       munsell_0.5.1     withr_3.0.2       yaml_2.3.10
[25] tools_4.4.0       tzdb_0.4.0        checkmate_2.3.2   colorspace_2.1-1
[29] curl_6.2.1        vctrs_0.6.5       R6_2.6.1          lifecycle_1.0.4
[33] htmlwidgets_1.6.4 pkgconfig_2.0.3   pillar_1.10.1     later_1.4.1
[37] gtable_0.3.6      glue_1.8.0        data.table_1.17.0 Rcpp_1.0.14
[41] xfun_0.51         tidyselect_1.2.1  rstudioapi_0.17.1 knitr_1.49
[45] htmltools_0.5.8.1 rmarkdown_2.29    compiler_4.4.0

  1. Additional options include for example specifying the query in JSON and then importing the JSON file into R. ↩︎

  2. Like other data providers, Statistics Estonia periodically updates table names and structures, which may affect existing code. For example, a recent revision changed annual and quarterly labor market tables. These changes are usually well-documented. ↩︎

  3. The list variables are simply the table columns that store dataset values. ↩︎

  4. The tidy format refers to a data structure where each variable is a column, each observation is a row, and each value is a cell. See Hadley Wickham’s Tidy Data principles for more details. ↩︎