More JSON from Statistics Norway: Metadata, query creation and execution
We have presented the basics of fetching data with the JSON-based API of Statistics Norway's data bank, and transforming them. The next steps are to look a bit more closely into metadata, and how to use more tools for transforming the fetched data. We will not repeat much from last post, but the full source code is available from Github.
Minimalist metadata handling
Together with the three functions presented in the last post, the following 5 functions constitute a bare-bones and simple suite for
handling metadata and data in the API. The first, getValuesAndLabels(tableId)
returns the metadata for a given table as an R data frame, augmented with a column for variable selection information.
When this information is filled in, the data frame can be handed to
createQueryDF(metaDF)
for creation of a JSON query. A query data frame is generated and undergoes a little bit of regexp patching in queryFromDF(df)
, to produce a runnable query.
The function createQueryFromDF(markedValueLabelsDF)
carries out the two mentioned steps,
producing a query from a marked-up data frame. A frequently used application of this is the function
createQueryForAllVars(tableId)
, which returns a query for all the variables in the table.
Thus, the function calls
will fetch all of tables 07902,04681 and 03013, respectively.
This way, we don't really need to handle the metadata in advance if we are content with getting, possibly, haystacks of data.
> t07902 <- getJSONData("07902",createQueryForAllVars("07902"))
> t04681 <- getJSONData("04681",createQueryForAllVars("04681"))
> t07902 <- getJSONData("03013",createQueryForAllVars("03013"))
options(encoding="UTF-8")
# source("ssb_mortality_table_testing.R")
library(httr)
# rjstat is used for converting SSB JSON -> Data frame
library(rjstat)
# jsonlite is used mainly for converting metadata
library(jsonlite)
# From JSON metadata table structure w/no subtable converted to dataframe by jsonlite
# Adds column "Slct" for selection marking, Parameter tableId: Number of SB-table
getValuesAndLabels <- function(tableId) {
mDF <- fromJSON(getMetaData(tableId))
varNms <- mDF[[2]][[1]] ; varLbls <- mDF[[2]][[1]] ; varNmb <- length(varNms) ;
valAndLbl <- list()
for (i in 1:varNmb) {
xdfi <- data.frame(mDF[[2]][3][[1]][i], mDF[[2]][4][[1]][i],0)
names(xdfi) <- c(varNms[i],paste(varNms[i],"Label",sep=""),"Slct")
valAndLbl[[varNms[i]]] <- xdfi
}
valAndLbl
}
# Creates search data frame from marked-up meta data frame metaDF
createQueryDF <- function(metaDF){
srchDF <- fromJSON(getQueryData99999())
srchDFq <- srchDF$query
nQueryRows <- length(srchDFq$code) ;
nVar <- length(metaDF) ; nSrchVar <- 0 ;
for (i in 1:nVar){
nValues <- length(metaDF[[i]][,3]) ;
allSlct <- metaDF[[i]][1,3] ; nmbSlct <- sum(metaDF[[i]][,3]) ;
topSlct <- metaDF[[i]][nValues,3]
if (allSlct==10 || nmbSlct>0) { # Variable included in search
nSrchVar <- nSrchVar + 1 ;
varNm <- names(metaDF[[i]])[1] ;
srchDFq$code[nSrchVar] <- varNm ;
}
if (allSlct==10) { # All
srchDFq$selection[nSrchVar,1] <- "all" ;
srchDFq$selection[nSrchVar,2] <- "*" ;
}
else if (nmbSlct>0&topSlct<2) { # Some values, put them in a list
mDi <- metaDF[[i]] ; mDiSlct <- mDi[mDi[,3]==1,1] ;
mDiSlct <- paste("\"",mDiSlct,"\"",sep="")
srchDFq$selection[nSrchVar,1] <- "item" ;
srchDFq$selection[nSrchVar,2] <- paste(mDiSlct,collapse=",") ;
}
else if (topSlct>1) { # Newest topSlct values
srchDFq$selection[nSrchVar,1] <- "top" ;
srchDFq$selection[nSrchVar,2] <- as.character(topSlct) ;
}
}
srchDF$query <- srchDFq[1:nSrchVar,]
# print(length(srchDF$query$code))
fjL <- list(format="json-stat")
srchDF$response<-fjL
srchDF
}
# Patch the generated query by removing the last [] pair
# This way, we can create a correct query from a data frame (really list/frame hybrid) we generate
queryFromDF <- function(df){
jfJ0 <- toJSON(df)
jfJ0 <- gsub("\\\\","",jfJ0)
jfJ0 <- gsub("\\[\"\"","\\[\"",jfJ0)
jfJ0 <- gsub("\"\"\\]","\"\\]",jfJ0)
jfJ0 <- gsub("\\]([^]]*)$","\\1",jfJ0)
jfJ0 <- gsub("\\[([^[]*)$","\\1",jfJ0)
jfJ0
}
# createQueryFromDF - Creates json search query from marked-up metadata
# Parameter: markedValueLabelsDF - list of data frames marked for selection
createQueryFromDF <- function(markedValueLabelsDF){
sDF <- createQueryDF(markedValueLabelsDF)
queryFromDF(sDF)
}
# createQueryForAllVars - Creates a query for an entire table
createQueryForAllVars <- function(tableId) {
vlDF <- getValuesAndLabels(tableId)
nVar <- length(vlDF) ;
for (i in 1:nVar) vlDF[[i]][1,3] <- 10 ;
createQueryFromDF(vlDF)
}
Example of use on the mortality table
Extracting one of the four mortality table variables (qx,dx,lx,ex) for all available years can be done by
setting up a query limited only on ContentsCode - which is what the function
pickMortalityVars(mdVL,contCode)
does.
It is called by the function getMortalityData(contCode)
,
which returns a data frame with values for all years for one variable, e.g. life expectancy (lx).
The next step is to transform the data frame to a suitable format. A table with N+2 colummns; Sex, Age and data for N years,
is handy, but there is quite a bit of looping needed if we develop the transformation from scratch. This is a case
for the reshape package, the functions melt and cast can quite easily be employed here. They are used in
transformToColumnsData(eData)
, where the variables are "melted" by sex, age and time,
and "cast" according to sex+age~time. After some adaptation with colummns, the result is exactly what we want.
Our API access functions are intentionally simplistic. While we might be able to drop the
(few) transformation steps for the query with more "rich" API access
functions, the main parts of the work are defining the query correctly and transforming the result to a suitable
form. Which I find is almost always the harder part when fetching data with this API.
pickMortalityVars <- function(mdVL,contCode=3){
mdVL$Kjonn[1,3] <- 10 ; # All
mdVL$AlderX[1,3] <- 10 ; # All
mdVL$ContentsCode[contCode,3] <- 1 ;
mdVL$Tid[1,3] <- 10 ; # All
mdVL
}
transformToColumnsData <- function(eData){
eData$expLT <- eData$value # To avoid default naming collision on "value"
eData$value <- NULL ; # Drop value column
meltExpectation <- melt(eData,id=c("Kjonn","AlderX","Tid")) ;
meltExpectation$expLT <- NULL ; # Drop trivial column
colExp <- cast(meltExpectation,Kjonn+AlderX~Tid)
colExp
}
getMortalityData <- function(contCode){
metaData <- getValuesAndLabels("07902") ;
mdVL <- pickMortalityVars(metaData,contCode=contCode) ;
eQuery <- createSearchFromDF(mdVL) ;
eData <- getJSONData("07902",eQuery)
eData$ContentsCode <- NULL # Drop trivial column
eData
}
getMortalityDataTable <- function(contCode){
eData <- getMortalityData(contCode)
transformToColumnsData(eData)
}
getExpectedLifeTimeTable <- function(){
getMortalityDataTable(3)
}
> ex19662017 <- getExpectedLifeTimeTable()
> ex19662017[1:5,45:54]
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
1 80.68 80.86 81.04 81.25 81.45 81.66 82.09 82.28 82.42 82.63
2 79.90 80.12 80.26 80.44 80.65 80.86 81.29 81.47 81.59 81.81
3 78.92 79.13 79.29 79.46 79.67 79.88 80.30 80.49 80.61 80.82
4 77.94 78.15 78.29 78.46 78.68 78.88 79.31 79.50 79.62 79.84
5 76.95 77.16 77.30 77.47 77.69 77.89 78.32 78.51 78.62 78.85
An application example: Plotting life expectancy 1966-2017
OK, but isn't full access to all the data overkill? It may be if lots of the data are of little relevance to the point to be made. In this case, the 50+ years' perspective we get by using all the data can provide new information if we are used to shorter perspective. In particular, we can get more information on what changes and what does not, and the nature of the changes.
We plot curves of the life expectancy at 67, 75, 80, 85, 90 and 95 years for all, women and men. We add the regression lines, which fit quite well for most of the curves - the really notable exceptions being men 67 and 75 years, who had almost constant life expectancy before 1980, but have catched up during the last decades. We save the regression coefficients.
mkplot0 <- function(createPng=0) {
pointsAndRegr <- function(eL,yrs=years,ages=c(67,eAges),col=1,newPlot=0,main="All") {
rCoef <- matrix(0,nrow=length(ages),ncol=2) ;
for (i in 1:length(ages)) {
tS <-unlist(ages[i]+eL[(ages[i]+1),3:54]) ;
rL <- lm(tS~yrs) ; regrL <- rL$coef ;
if (ages[i]==67) plot(yrs,tS,type="l",ylim=c(75,100),lty=1,lwd=2,xlab="Year",ylab="Life expect",main=main,col=col)
else points(yrs,tS,type="l",lty=2,lwd=2,col=col) ;
abline(regrL[1],regrL[2],lty=4,col=col )
rCoef[i,] <- regrL ;
}
rCoef
}
computeCrossings <- function(rCoef) {
nCat <- length(rCoef[,2]) ;
for (i in 1:(nCat-1)) {
xi <- (rCoef[i,1]-rCoef[i+1,1])/(rCoef[i+1,2]-rCoef[i,2])
yi <- rCoef[i,1] + rCoef[i,2]*xi
print(paste("xi: ",xi," yi: ",yi)) ;
}
}
pointsAndCurves <- function(rCoef,ages=c(67,eAges),col=1,newPlot=0,main="Life expectancy increases") {
if (newPlot==1) {
plot(ages,100*rCoef[,2],ylim=c(0,10),xlab="Age",ylab="E(Increase 100 yrs)",main=main,col=col) ;
points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;
}
else {
points(ages,100*rCoef[,2],col=col) ;
points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;
}
}
relativePensions <- function(eL,yrs=years,d0=eLA[68,42],col=1,newPlot=0,main="Pension level with constant savings" ) {
if (newPlot==1) {
plot(yrs,d0/eL[68,3:54],type="l",ylim=c(0.8,1.4),lty=1,lwd=2,xlab="Year",ylab="Relative pension level",main=main,col=col)
abline(1,0)
}
else points(yrs,d0/eL[68,3:54],type="l",lty=1,col=col,lwd=2)
}
X11(width=4,height=12) ;
if (createPng>0) png(file="life_expect_1.png",width=480,height=1200) ;
par(mfrow=c(3,1)) ; years <- 1966:2017 ; eAges <- c(75,80,85,90,95) ;
# Plot life expectations for all
eLA <- eLTTable[eLTTable[,1]==0,] ; rCoefA <- pointsAndRegr(eLA) ;
# For men
eLM <- eLTTable[eLTTable[,1]==1,] ; rCoefM <- pointsAndRegr(eLM,col=4,main="Men") ;
# For women
eLF <- eLTTable[eLTTable[,1]==2,] ; rCoefF <- pointsAndRegr(eLF,col=2,main="Women") ;
if (createPng>0) dev.off() ;
else X11(width=10,height=6) ;
if (createPng>0) png(file="life_expect_2.png",width=800,height=480) ;
par(mfrow=c(1,2)) ;
# 100 yr increases life expectancy
pointsAndCurves(rCoefA,col=1,newPlot=1) ; pointsAndCurves(rCoefM,col=4) ; pointsAndCurves(rCoefF,col=2) ;
# Relative pensions 1966-2017
relativePensions(eLA,col=1,newPlot=1) ; relativePensions(eLM,col=4) ; relativePensions(eLF,col=2) ;
computeCrossings(rCoefA) ;
if (createPng>0) dev.off() ;
}
> mkplot0()
[1] "xi: 2088.83246231796 yi: 92.5295089875549"
[1] "xi: 2102.95157141019 yi: 93.5460896638841"
[1] "xi: 2134.54157204277 yi: 95.1915583982768"
[1] "xi: 2198.87945837088 yi: 97.2971644475882"
[1] "xi: 2321.47806129824 yi: 99.334490928148"
>
Trendline crossing: Inconsistency of extrapolations
Some reflection over the trendlines in fig 1 uncovers a basic problem with linear extrapolations. Because the trendline at 67 is clearly steeper than at 75, some time in the "future" those lines will cross, and from then, the expected life span for 67-year persons will be higher than for 75-year olds! That time is in 2088, at 92.5 years,
as shown by the output of the computeCrossings(rCoefA)
function. Crossings for higher ages will occur later, but eventually, all
curves will cross.
The "optimistic" interpretation of this is that there is a future life expectancy increase to happen at high ages which does not manifest today, and there is every reason to expect that will occur. But we must expect the size of this increase to be quite modest (yearly life expectancy increase at 90 is <2/100 yrs, and fairly constant) and not at all to be able to fully counteract the trendline crossing problem. Therefore, the only sensible interpretation is that the rates of life expectancy increases at all but the highest ages probably will decrease over time. The question is not if, but when.
Fig 1. Life expectations 1966-2017. Curves for all, men (blue) and women (red), bottom to top, 67, 75, 80, 85, 90, 95 years of age, with regression lines. |
Fig 2. A: Increases in expected life span per 100 years by age 67-95, based on 1966-2017 regression lines. Curves for all, men (blue) and women (red). The average curve gets ever close to the female with age, because mostly women survive to high ages. B: Relative levels of pension, based on expected life years at normal retirement age of 67, 1966-2017 and constant amount of funds. (The principle in the current Norwegian system.) Normalized relative to average expected life span at 67 in 2005. (When the system was designed.) Curves for all, men (blue) and women (red). The actual pensions are gender-neutral in Norway, so this illustrates the basic redistribution effect of that principle. |