Wednesday, 21 March 2018

More JSON from Statistics Norway: Metadata, query creation and execution

We have presented the basics of fetching data with the JSON-based API of Statistics Norway's data bank, and transforming them. The next steps are to look a bit more closely into metadata, and how to use more tools for transforming the fetched data. We will not repeat much from last post, but the full source code is available from Github.

Minimalist metadata handling

Together with the three functions presented in the last post, the following 5 functions constitute a bare-bones and simple suite for handling metadata and data in the API. The first, getValuesAndLabels(tableId) returns the metadata for a given table as an R data frame, augmented with a column for variable selection information.

When this information is filled in, the data frame can be handed to createQueryDF(metaDF) for creation of a JSON query. A query data frame is generated and undergoes a little bit of regexp patching in queryFromDF(df), to produce a runnable query.

The function createQueryFromDF(markedValueLabelsDF) carries out the two mentioned steps, producing a query from a marked-up data frame. A frequently used application of this is the function createQueryForAllVars(tableId), which returns a query for all the variables in the table.

Thus, the function calls

> t07902 <- getJSONData("07902",createQueryForAllVars("07902"))
> t04681 <- getJSONData("04681",createQueryForAllVars("04681"))
> t07902 <- getJSONData("03013",createQueryForAllVars("03013"))
will fetch all of tables 07902,04681 and 03013, respectively. This way, we don't really need to handle the metadata in advance if we are content with getting, possibly, haystacks of data.

options(encoding="UTF-8")
# source("ssb_mortality_table_testing.R")
library(httr)
# rjstat is used for converting SSB JSON -> Data frame
library(rjstat)
# jsonlite is used mainly for converting metadata 
library(jsonlite)

# From JSON metadata table structure w/no subtable converted to dataframe by jsonlite
# Adds column "Slct" for selection marking, Parameter tableId: Number of SB-table

getValuesAndLabels <- function(tableId) {
    mDF <- fromJSON(getMetaData(tableId))
    varNms <- mDF[[2]][[1]]  ; varLbls <- mDF[[2]][[1]] ; varNmb <- length(varNms) ;
    valAndLbl <- list()
    for (i in 1:varNmb) {
        xdfi <- data.frame(mDF[[2]][3][[1]][i],  mDF[[2]][4][[1]][i],0)
        names(xdfi) <- c(varNms[i],paste(varNms[i],"Label",sep=""),"Slct")
        valAndLbl[[varNms[i]]] <- xdfi
    }
    valAndLbl
}

#  Creates search data frame from marked-up meta data frame metaDF 

createQueryDF <- function(metaDF){
    srchDF <- fromJSON(getQueryData99999())
    srchDFq  <-  srchDF$query
    nQueryRows <- length(srchDFq$code) ;
    nVar <- length(metaDF) ; nSrchVar <- 0 ;
    
    for (i in 1:nVar){
        nValues <-  length(metaDF[[i]][,3]) ;
        allSlct <- metaDF[[i]][1,3] ; nmbSlct <-  sum(metaDF[[i]][,3]) ;
        topSlct <- metaDF[[i]][nValues,3]
        if (allSlct==10 || nmbSlct>0) {   # Variable included in search
            nSrchVar <-  nSrchVar + 1 ;
            varNm <- names(metaDF[[i]])[1] ;
            srchDFq$code[nSrchVar] <- varNm ;
        }
        if (allSlct==10) {  # All
            srchDFq$selection[nSrchVar,1] <- "all" ;
            srchDFq$selection[nSrchVar,2] <- "*" ;
        }
        else if (nmbSlct>0&topSlct<2) { # Some values, put them in a list
            mDi <- metaDF[[i]] ; mDiSlct <- mDi[mDi[,3]==1,1] ;
            mDiSlct <- paste("\"",mDiSlct,"\"",sep="")
            srchDFq$selection[nSrchVar,1] <- "item" ;
            srchDFq$selection[nSrchVar,2] <- paste(mDiSlct,collapse=",") ;
         }
        else if (topSlct>1) { # Newest topSlct values
            srchDFq$selection[nSrchVar,1] <- "top" ;
            srchDFq$selection[nSrchVar,2] <- as.character(topSlct) ;
        }
    }
    srchDF$query <- srchDFq[1:nSrchVar,]
    # print(length(srchDF$query$code))
    fjL <- list(format="json-stat")
    srchDF$response<-fjL
    srchDF
}

# Patch the generated query by removing the last [] pair
# This way, we can create a correct query from a data frame (really list/frame hybrid) we generate

queryFromDF <- function(df){
    jfJ0 <- toJSON(df)
    jfJ0 <- gsub("\\\\","",jfJ0)
    jfJ0 <- gsub("\\[\"\"","\\[\"",jfJ0)
    jfJ0 <- gsub("\"\"\\]","\"\\]",jfJ0)
    jfJ0 <- gsub("\\]([^]]*)$","\\1",jfJ0)
    jfJ0 <- gsub("\\[([^[]*)$","\\1",jfJ0)
    jfJ0
}    

#  createQueryFromDF - Creates  json search query from marked-up metadata 
#  Parameter: markedValueLabelsDF - list of data frames marked for selection

createQueryFromDF <- function(markedValueLabelsDF){
   sDF <- createQueryDF(markedValueLabelsDF)
   queryFromDF(sDF)
}

# createQueryForAllVars - Creates a query for an entire table 

createQueryForAllVars <- function(tableId)  {
    vlDF <- getValuesAndLabels(tableId) 
    nVar <- length(vlDF) ; 
    for (i in 1:nVar) vlDF[[i]][1,3] <- 10 ;
    createQueryFromDF(vlDF)
}

Example of use on the mortality table

Extracting one of the four mortality table variables (qx,dx,lx,ex) for all available years can be done by setting up a query limited only on ContentsCode - which is what the function pickMortalityVars(mdVL,contCode) does. It is called by the function getMortalityData(contCode), which returns a data frame with values for all years for one variable, e.g. life expectancy (lx).

The next step is to transform the data frame to a suitable format. A table with N+2 colummns; Sex, Age and data for N years, is handy, but there is quite a bit of looping needed if we develop the transformation from scratch. This is a case for the reshape package, the functions melt and cast can quite easily be employed here. They are used in transformToColumnsData(eData), where the variables are "melted" by sex, age and time, and "cast" according to sex+age~time. After some adaptation with colummns, the result is exactly what we want.

pickMortalityVars <- function(mdVL,contCode=3){
    mdVL$Kjonn[1,3] <- 10 ; # All
    mdVL$AlderX[1,3] <- 10 ; # All
    mdVL$ContentsCode[contCode,3] <- 1 ; 
    mdVL$Tid[1,3] <- 10 ; # All
    mdVL
}

transformToColumnsData <- function(eData){
    eData$expLT <- eData$value # To avoid default naming collision on "value"
    eData$value <- NULL ;  # Drop value column
    meltExpectation <- melt(eData,id=c("Kjonn","AlderX","Tid")) ;
    meltExpectation$expLT <- NULL ;  # Drop trivial column
    colExp <- cast(meltExpectation,Kjonn+AlderX~Tid)
    colExp 
}

getMortalityData <- function(contCode){
    metaData <- getValuesAndLabels("07902") ;     
    mdVL <- pickMortalityVars(metaData,contCode=contCode) ;
    eQuery <- createSearchFromDF(mdVL) ;
    eData <- getJSONData("07902",eQuery)
    eData$ContentsCode <- NULL # Drop trivial column
    eData
}

getMortalityDataTable <- function(contCode){
    eData <- getMortalityData(contCode)
    transformToColumnsData(eData)
}

getExpectedLifeTimeTable <- function(){
    getMortalityDataTable(3)
}

> ex19662017 <- getExpectedLifeTimeTable()  
> ex19662017[1:5,45:54]
    2008  2009  2010  2011  2012  2013  2014  2015  2016  2017
1  80.68 80.86 81.04 81.25 81.45 81.66 82.09 82.28 82.42 82.63
2  79.90 80.12 80.26 80.44 80.65 80.86 81.29 81.47 81.59 81.81
3  78.92 79.13 79.29 79.46 79.67 79.88 80.30 80.49 80.61 80.82
4  77.94 78.15 78.29 78.46 78.68 78.88 79.31 79.50 79.62 79.84
5  76.95 77.16 77.30 77.47 77.69 77.89 78.32 78.51 78.62 78.85


Our API access functions are intentionally simplistic. While we might be able to drop the (few) transformation steps for the query with more "rich" API access functions, the main parts of the work are defining the query correctly and transforming the result to a suitable form. Which I find is almost always the harder part when fetching data with this API.

An application example: Plotting life expectancy 1966-2017

OK, but isn't full access to all the data overkill? It may be if lots of the data are of little relevance to the point to be made. In this case, the 50+ years' perspective we get by using all the data can provide new information if we are used to shorter perspective. In particular, we can get more information on what changes and what does not, and the nature of the changes.

We plot curves of the life expectancy at 67, 75, 80, 85, 90 and 95 years for all, women and men. We add the regression lines, which fit quite well for most of the curves - the really notable exceptions being men 67 and 75 years, who had almost constant life expectancy before 1980, but have catched up during the last decades. We save the regression coefficients.

mkplot0 <- function(createPng=0) {

    pointsAndRegr <- function(eL,yrs=years,ages=c(67,eAges),col=1,newPlot=0,main="All") {
       rCoef <- matrix(0,nrow=length(ages),ncol=2) ; 
       for (i in 1:length(ages)) {   
        tS <-unlist(ages[i]+eL[(ages[i]+1),3:54]) ;
        rL  <- lm(tS~yrs) ;  regrL <- rL$coef ;       
        if (ages[i]==67)  plot(yrs,tS,type="l",ylim=c(75,100),lty=1,lwd=2,xlab="Year",ylab="Life expect",main=main,col=col)  
        else  points(yrs,tS,type="l",lty=2,lwd=2,col=col) ;
        abline(regrL[1],regrL[2],lty=4,col=col )
        rCoef[i,] <- regrL ;
       }
      rCoef  
    }

    computeCrossings <- function(rCoef) {
        nCat <- length(rCoef[,2]) ;
        for (i in 1:(nCat-1)) {
            xi <- (rCoef[i,1]-rCoef[i+1,1])/(rCoef[i+1,2]-rCoef[i,2])
            yi <- rCoef[i,1] + rCoef[i,2]*xi
            print(paste("xi: ",xi,"  yi: ",yi)) ;
        }
    }
    
    pointsAndCurves <- function(rCoef,ages=c(67,eAges),col=1,newPlot=0,main="Life expectancy increases") {
        if (newPlot==1) {
            plot(ages,100*rCoef[,2],ylim=c(0,10),xlab="Age",ylab="E(Increase 100 yrs)",main=main,col=col) ;
            points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;
        }
        else  {
            points(ages,100*rCoef[,2],col=col) ;
            points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;  
        }
    }

    relativePensions <- function(eL,yrs=years,d0=eLA[68,42],col=1,newPlot=0,main="Pension level with constant savings"  )  {
        if (newPlot==1) {
            plot(yrs,d0/eL[68,3:54],type="l",ylim=c(0.8,1.4),lty=1,lwd=2,xlab="Year",ylab="Relative pension level",main=main,col=col)
            abline(1,0)
        }
        else  points(yrs,d0/eL[68,3:54],type="l",lty=1,col=col,lwd=2)
    }
    
    X11(width=4,height=12)  ;
    if (createPng>0) png(file="life_expect_1.png",width=480,height=1200) ;
        
    par(mfrow=c(3,1)) ;  years <- 1966:2017 ;  eAges <- c(75,80,85,90,95) ;
  # Plot life  expectations for all   
    eLA <- eLTTable[eLTTable[,1]==0,] ;  rCoefA <- pointsAndRegr(eLA) ;
  # For men   
    eLM <- eLTTable[eLTTable[,1]==1,] ;    rCoefM <- pointsAndRegr(eLM,col=4,main="Men") ;
  # For women    
    eLF <- eLTTable[eLTTable[,1]==2,] ;     rCoefF <- pointsAndRegr(eLF,col=2,main="Women") ;
 
    if (createPng>0)  dev.off() ;
    else  X11(width=10,height=6) ;
    if (createPng>0) png(file="life_expect_2.png",width=800,height=480) ;
    par(mfrow=c(1,2)) ; 
  # 100 yr increases life expectancy  
    pointsAndCurves(rCoefA,col=1,newPlot=1) ;  pointsAndCurves(rCoefM,col=4) ;  pointsAndCurves(rCoefF,col=2) ;
  # Relative pensions 1966-2017
    relativePensions(eLA,col=1,newPlot=1) ; relativePensions(eLM,col=4) ; relativePensions(eLF,col=2) ;
    computeCrossings(rCoefA) ;
    if (createPng>0)  dev.off() ;
}


> mkplot0()
[1] "xi:  2088.83246231796   yi:  92.5295089875549"
[1] "xi:  2102.95157141019   yi:  93.5460896638841"
[1] "xi:  2134.54157204277   yi:  95.1915583982768"
[1] "xi:  2198.87945837088   yi:  97.2971644475882"
[1] "xi:  2321.47806129824   yi:  99.334490928148"
>

Trendline crossing: Inconsistency of extrapolations

Some reflection over the trendlines in fig 1 uncovers a basic problem with linear extrapolations. Because the trendline at 67 is clearly steeper than at 75, some time in the "future" those lines will cross, and from then, the expected life span for 67-year persons will be higher than for 75-year olds! That time is in 2088, at 92.5 years, as shown by the output of the computeCrossings(rCoefA) function. Crossings for higher ages will occur later, but eventually, all curves will cross.

The "optimistic" interpretation of this is that there is a future life expectancy increase to happen at high ages which does not manifest today, and there is every reason to expect that will occur. But we must expect the size of this increase to be quite modest (yearly life expectancy increase at 90 is <2/100 yrs, and fairly constant) and not at all to be able to fully counteract the trendline crossing problem. Therefore, the only sensible interpretation is that the rates of life expectancy increases at all but the highest ages probably will decrease over time. The question is not if, but when.

Fig 1. Life expectations 1966-2017. Curves for all, men (blue) and women (red), bottom to top, 67, 75, 80, 85, 90, 95 years of age, with regression lines.

Fig 2.
A: Increases in expected life span per 100 years by age 67-95, based on 1966-2017 regression lines. Curves for all, men (blue) and women (red). The average curve gets ever close to the female with age, because mostly women survive to high ages.
B: Relative levels of pension, based on expected life years at normal retirement age of 67, 1966-2017 and constant amount of funds. (The principle in the current Norwegian system.) Normalized relative to average expected life span at 67 in 2005. (When the system was designed.) Curves for all, men (blue) and women (red). The actual pensions are gender-neutral in Norway, so this illustrates the basic redistribution effect of that principle.

No comments:

Post a Comment