Accessing Norwegian official statistical tables using the JSON API from Statistics Norway with R

JSON (JavaScript Object Notation) has become a de facto standard for data exchange, but the structuring may in some cases be difficult to handle. Therefore, a "light"-version especially suited for statistical data, called JSON-stat, has become quite popular for many applications. Among others, Statistics Norway chose this as the primary delivery format for data from Statistikkbanken, "bank of statistical data", also using JSON as query format.

Basically, a JSON query is HTTP POSTed to the API, using the table number as identification, and a JSON-stat formatted response is returned. A GET request to the same address provides a version of the metadata for the table. The data may then be converted to R data frames etc.

With the necessary R libraries installed and loaded, all operations can quite easily be performed from the command line in R, and by the standard user interface for Statistikkbanken, one may obtain the JSON query needed. For routine use though, it will in most cases be far easier to use a standardized set of functions for the basic data traffic, and one such small and simple set is presented here.

Basically, just two functions are normally used: getMetaData(tableId) for fetching table descriptions, and getJSONData(tableId,query,naming) for submitting a query and converting the response to an R data frame. A third function is provided for access to the raw data returned.

options(encoding="UTF-8")
library(httr)
# rjstat is used for converting SSB JSON -> Data frame
library(rjstat)
# jsonlite is used mainly for converting metadata 
library(jsonlite)
#source("ssb-json-functions.R")

# Fetches the meta data for a table, with number tableId, as a JSON structure
getMetaData <- function(tableId) {
  getUrl <- paste("http://data.ssb.no/api/v0/no/table/",tableId,sep="")
  d.tmp<-GET(getUrl)
  content(d.tmp,"text")
}

# getRawJSONData - Fetches SB-data by POST-request  tableId: Number of SB-table queryData: JSON-formatted query 
getRawJSONData <- function(tableId,queryData) {
#  POST query request
   d.tmp <- POST(paste("http://data.ssb.no/api/v0/no/table/",tableId,sep=""), body = queryData, encode = "json", verbose())
# Returns contents of d.tmp as JSON-formatted text 
   content(d.tmp, "text")
}

# getJSONData - Fetches SB-data by POST-request. Returns a data table for further processing
getJSONData <- function(tableId,queryData,naming="id") {
# Fetches the content processed by fromJSONstat
   sbtable <- fromJSONstat(getRawJSONData(tableId,queryData),naming=naming)
# Only dataset is used from sbtable and returned
   ds <- sbtable[[1]]
   ds
}

If no handling of metadata is to be performed, the two packages httr and rjstat will suffice, but for more flexible use of data and metadata jsonlite comes in handy. I have chosen to leave the verbose setting in the HTTP POST request "on", as it provides a bit more insight into what is happening.

Please note that at Statistics Norway, an new package ApiData by researcher Øyvind Langsrud, with a single user interface function ApiData, is now (March 2017) in internal use for these purposes. By the time you read this, it may be publicly available, and its use will be demonstrated in subsequent posts.

A typical JSON query and its execution

We will use a rather large table as an example, 07902, mortality data 1966-2017 (as per March 2018). The data are given by gender (variable Kjonn), age (AlderX), time (Tid) and four different contents types (ContentsCode).

I have wrapped the query in an R function rather than putting it into an R variable, as I have found it slightly easier to maintain and develop that way. YMMV. This query will retrieve the whole table, using the selection type "selection": {"filter": "all","values":["*"] } . The query has a query part and a response part - the latter is rather trivial "response": {"format": "json-stat"} and common for all json-stat queries on the site. To retrieve the whole table, it is not necessary to specify all variables, but I find complete specifications safer and over time simpler to work with - and they are easier to narrow down.

getQueryData07902 <- function() {
'{
  "query": [
  {
    "code":"Kjonn",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"AlderX",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"ContentsCode",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"Tid",
    "selection": {
    "filter": "all",
    "values":["*"] }
   }   ],
  "response": {
  "format": "json-stat"
  } 
}'
}

For repeated execution, the POST request may be wrapped into a function, returning a quite long R data frame:

getAllMortalityData07902 <- function(){
    getJSONData("07902",getQueryData07902())
}

> t07902 <- getAllMortalityData07902()
> length(t07902[,1])
[1] 66768

The first few lines look as follows, with the actual data in the value variable:

> t07902[1:5,]
   Kjonn AlderX    ContentsCode  Tid value
1      0    000 LevendePerTusen 1966 1e+05
2      0    000 LevendePerTusen 1967 1e+05
3      0    000 LevendePerTusen 1968 1e+05
4      0    000 LevendePerTusen 1969 1e+05
5      0    000 LevendePerTusen 1970 1e+05

Before we can use this, we must filter and transform it. This can be done in a lot of ways, and many tools are available. We shall start out quite simply, but first we save the data frame, to become independent of the connection. We also load it again, to make sure the saving works. It will be read from file into an object with the original name, here t07902.

> save(t07902,file="../data/t07902-2018.RData")
> load("../data/t07902-2018.RData")

Because all the transformations necessary for later use can be gathered in a single R function, there is, generally, not much to gain by transforming before saving. Unless the data are to be used in only one way.

Simple transformations

To give a fairly simple example of one way working from scratch, we will reconstruct a version of the 2017 mortality table from our data frame. We first filter out the 2017 data, next we pick the 4 different variables according to their ContentsCode values. Then we prepare a data frame with the necessary background data, and we merge this with the variables to create a mortality table data frame. We might also omit Tid here, as it is trivially 2017. For identification when we are handling several periods at once, keeping it could however be a precautionary measure.

One basic way, step by step

In the data frame, the data are sorted by sex and age, but it might be a bit risky to rely solely on this implicit order instead of filtering data by values.

We delete columns by setting them to NULL: b2017$value <- NULL ; b2017$ContentsCode <- NULL ;

> t2017 <-  t07902[t07902$Tid==2017,]
> table(t2017$ContentsCode)

             Dode Dodssannsynlighet   ForvGjenLevetid   LevendePerTusen 
              321               321               321               321 
> 
> lx <-  t2017[t2017$ContentsCode=="LevendePerTusen",5]
> ex <-  t2017[t2017$ContentsCode=="ForvGjenLevetid",5]
> qx <-  t2017[t2017$ContentsCode=="Dodssannsynlighet",5]
> dx <-  t2017[t2017$ContentsCode=="Dode",5]
> t2017d <-  t2017[t2017$ContentsCode=="Dode",]
> b2017 <- t2017d ;
> b2017$value <- NULL ; b2017$ContentsCode <- NULL ;
> dt2017 <- data.frame(b2017,lx,qx,dx,ex)

> dt2017[1:5,]
    Kjonn AlderX  Tid     lx    qx  dx    ex
104     0    000 2017 100000 2.236 224 82.63
312     0    001 2017  99776 0.150  15 81.81
520     0    002 2017  99761 0.165  16 80.82
728     0    003 2017  99745 0.115  11 79.84
936     0    004 2017  99733 0.113  11 78.85

>

We can make a simple plot, for example of life expectancy, to check the transformations and look at the data.

> dt2017All <- dt2017[dt2017$Kjonn==0,]
> dt2017M <- dt2017[dt2017$Kjonn==1,]
> dt2017F <- dt2017[dt2017$Kjonn==2,]
> plot(as.numeric(dt2017All$AlderX),dt2017All$ex)
> points(as.numeric(dt2017All$AlderX),dt2017M$ex,type="l",col=4)
> points(as.numeric(dt2017All$AlderX),dt2017F$ex,type="l",col=2)

Making a function of it

This exercise would hardly be something you want to repeat many times to get mortality tables for several years, or when new data are available. But by making a function of it, with year and data frame as parameters, handling may become fairly efficient. Provided the data structure does not change, updating or creating analyses for a new year may be done quite efficiently. And with little thought about the data formats and transformations involved.

mortTableYear <- function(year,df=t07902) {
    tYr <-  df[df$Tid==year,] ; tYrD <-  tYr[tYr$ContentsCode=="Dode",] ;
    bYr <- tYrD ; bYr$value <- NULL ; bYr$ContentsCode <- NULL ; bYr$Tid <- NULL ;
    cCodes <- c("Dode","Dodssannsynlighet","ForvGjenLevetid","LevendePerTusen") 
    mData <- matrix(0,nrow=length(bYr[,1]),ncol=4)
    for (i in 1:4) mData[,i] <- tYr[tYr$ContentsCode==cCodes[i],5] 
    dfYr <- data.frame(bYr,mData)
    names(dfYr) = c("Kjonn","AlderX","dx","qx","ex","lx") ;
    dfYr 
}

> m2016 <- mortTableYear(2016)
> m2016[1:5,]
     Kjonn AlderX  dx    qx    ex     lx
103      0    000 216 2.158 82.42 100000
311      0    001  18 0.183 81.59  99784
519      0    002  13 0.132 80.61  99766
727      0    003   5 0.049 79.62  99753
935      0    004   5 0.048 78.62  99748

Generally, there is little reason to work with an awkward data organization of R-data if the problem at hand is not a one-off case. In most cases, transformations are relatively easy to set up and apply. For example, we don't need to decide if we want to organize the data by years or by variables, we can do both.

Here is a function version of the above plot, complete with parameter for png-saving. The function-within-a-function pattern is efficient for factoring out common features of function calls, and defining and handling a local context. Local data in an outer function are "global" in the inner functions, in a kind of Poor Man's Object Orientation.

testPlotEX <- function(df,savePng=0) {
    plotter <- function(y,newP=0,col=1) {
        if (newP==1) plot(as.numeric(dfA$AlderX),y,type="l",xlab="Age",ylab="Expected years left",col=col)
        else points(as.numeric(dfA$AlderX),y,type="l",col=col)
        legend(65,75,col=c(1,2,4),lty=c(1,1,1),legend=c("All","Women","Men"))
    }

    X11()
    if (savePng>0) png(file="life_expect_0.png") ;
    dfA <- df[df$Kjonn==0,] ;  dfM <- df[df$Kjonn==1,] ; dfF <- df[df$Kjonn==2,] ;
    plotter(dfA$ex,newP=1) ;  plotter(dfM$ex,col=4) ;  plotter(dfF$ex,col=2) ;
    if (savePng>0) dev.off() ;
}

>
> testPlotEX(dt2017,savePng=0) 
> testPlotEX(dt2017,savePng=1) 
>

NB! This is work in progress :-)

R and friends - Computational strolls

Tuesday, 20 March 2018