Wednesday, 21 March 2018

More JSON from Statistics Norway: Metadata, query creation and execution

We have presented the basics of fetching data with the JSON-based API of Statistics Norway's data bank, and transforming them. The next steps are to look a bit more closely into metadata, and how to use more tools for transforming the fetched data. We will not repeat much from last post, but the full source code is available from Github.

Minimalist metadata handling

Together with the three functions presented in the last post, the following 5 functions constitute a bare-bones and simple suite for handling metadata and data in the API. The first, getValuesAndLabels(tableId) returns the metadata for a given table as an R data frame, augmented with a column for variable selection information.

When this information is filled in, the data frame can be handed to createQueryDF(metaDF) for creation of a JSON query. A query data frame is generated and undergoes a little bit of regexp patching in queryFromDF(df), to produce a runnable query.

The function createQueryFromDF(markedValueLabelsDF) carries out the two mentioned steps, producing a query from a marked-up data frame. A frequently used application of this is the function createQueryForAllVars(tableId), which returns a query for all the variables in the table.

Thus, the function calls

> t07902 <- getJSONData("07902",createQueryForAllVars("07902"))
> t04681 <- getJSONData("04681",createQueryForAllVars("04681"))
> t07902 <- getJSONData("03013",createQueryForAllVars("03013"))

will fetch all of tables 07902,04681 and 03013, respectively. This way, we don't really need to handle the metadata in advance if we are content with getting, possibly, haystacks of data.

options(encoding="UTF-8")
# source("ssb_mortality_table_testing.R")
library(httr)
# rjstat is used for converting SSB JSON -> Data frame
library(rjstat)
# jsonlite is used mainly for converting metadata 
library(jsonlite)

# From JSON metadata table structure w/no subtable converted to dataframe by jsonlite
# Adds column "Slct" for selection marking, Parameter tableId: Number of SB-table

getValuesAndLabels <- function(tableId) {
    mDF <- fromJSON(getMetaData(tableId))
    varNms <- mDF[[2]][[1]]  ; varLbls <- mDF[[2]][[1]] ; varNmb <- length(varNms) ;
    valAndLbl <- list()
    for (i in 1:varNmb) {
        xdfi <- data.frame(mDF[[2]][3][[1]][i],  mDF[[2]][4][[1]][i],0)
        names(xdfi) <- c(varNms[i],paste(varNms[i],"Label",sep=""),"Slct")
        valAndLbl[[varNms[i]]] <- xdfi
    }
    valAndLbl
}

#  Creates search data frame from marked-up meta data frame metaDF 

createQueryDF <- function(metaDF){
    srchDF <- fromJSON(getQueryData99999())
    srchDFq  <-  srchDF$query
    nQueryRows <- length(srchDFq$code) ;
    nVar <- length(metaDF) ; nSrchVar <- 0 ;
    
    for (i in 1:nVar){
        nValues <-  length(metaDF[[i]][,3]) ;
        allSlct <- metaDF[[i]][1,3] ; nmbSlct <-  sum(metaDF[[i]][,3]) ;
        topSlct <- metaDF[[i]][nValues,3]
        if (allSlct==10 || nmbSlct>0) {   # Variable included in search
            nSrchVar <-  nSrchVar + 1 ;
            varNm <- names(metaDF[[i]])[1] ;
            srchDFq$code[nSrchVar] <- varNm ;
        }
        if (allSlct==10) {  # All
            srchDFq$selection[nSrchVar,1] <- "all" ;
            srchDFq$selection[nSrchVar,2] <- "*" ;
        }
        else if (nmbSlct>0&topSlct<2) { # Some values, put them in a list
            mDi <- metaDF[[i]] ; mDiSlct <- mDi[mDi[,3]==1,1] ;
            mDiSlct <- paste("\"",mDiSlct,"\"",sep="")
            srchDFq$selection[nSrchVar,1] <- "item" ;
            srchDFq$selection[nSrchVar,2] <- paste(mDiSlct,collapse=",") ;
         }
        else if (topSlct>1) { # Newest topSlct values
            srchDFq$selection[nSrchVar,1] <- "top" ;
            srchDFq$selection[nSrchVar,2] <- as.character(topSlct) ;
        }
    }
    srchDF$query <- srchDFq[1:nSrchVar,]
    # print(length(srchDF$query$code))
    fjL <- list(format="json-stat")
    srchDF$response<-fjL
    srchDF
}

# Patch the generated query by removing the last [] pair
# This way, we can create a correct query from a data frame (really list/frame hybrid) we generate

queryFromDF <- function(df){
    jfJ0 <- toJSON(df)
    jfJ0 <- gsub("\\\\","",jfJ0)
    jfJ0 <- gsub("\\[\"\"","\\[\"",jfJ0)
    jfJ0 <- gsub("\"\"\\]","\"\\]",jfJ0)
    jfJ0 <- gsub("\\]([^]]*)$","\\1",jfJ0)
    jfJ0 <- gsub("\\[([^[]*)$","\\1",jfJ0)
    jfJ0
}    

#  createQueryFromDF - Creates  json search query from marked-up metadata 
#  Parameter: markedValueLabelsDF - list of data frames marked for selection

createQueryFromDF <- function(markedValueLabelsDF){
   sDF <- createQueryDF(markedValueLabelsDF)
   queryFromDF(sDF)
}

# createQueryForAllVars - Creates a query for an entire table 

createQueryForAllVars <- function(tableId)  {
    vlDF <- getValuesAndLabels(tableId) 
    nVar <- length(vlDF) ; 
    for (i in 1:nVar) vlDF[[i]][1,3] <- 10 ;
    createQueryFromDF(vlDF)
}

Example of use on the mortality table

Extracting one of the four mortality table variables (qx,dx,lx,ex) for all available years can be done by setting up a query limited only on ContentsCode - which is what the function pickMortalityVars(mdVL,contCode) does. It is called by the function getMortalityData(contCode), which returns a data frame with values for all years for one variable, e.g. life expectancy (lx).

The next step is to transform the data frame to a suitable format. A table with N+2 colummns; Sex, Age and data for N years, is handy, but there is quite a bit of looping needed if we develop the transformation from scratch. This is a case for the reshape package, the functions melt and cast can quite easily be employed here. They are used in transformToColumnsData(eData), where the variables are "melted" by sex, age and time, and "cast" according to sex+age~time. After some adaptation with colummns, the result is exactly what we want.

pickMortalityVars <- function(mdVL,contCode=3){
    mdVL$Kjonn[1,3] <- 10 ; # All
    mdVL$AlderX[1,3] <- 10 ; # All
    mdVL$ContentsCode[contCode,3] <- 1 ; 
    mdVL$Tid[1,3] <- 10 ; # All
    mdVL
}

transformToColumnsData <- function(eData){
    eData$expLT <- eData$value # To avoid default naming collision on "value"
    eData$value <- NULL ;  # Drop value column
    meltExpectation <- melt(eData,id=c("Kjonn","AlderX","Tid")) ;
    meltExpectation$expLT <- NULL ;  # Drop trivial column
    colExp <- cast(meltExpectation,Kjonn+AlderX~Tid)
    colExp 
}

getMortalityData <- function(contCode){
    metaData <- getValuesAndLabels("07902") ;     
    mdVL <- pickMortalityVars(metaData,contCode=contCode) ;
    eQuery <- createSearchFromDF(mdVL) ;
    eData <- getJSONData("07902",eQuery)
    eData$ContentsCode <- NULL # Drop trivial column
    eData
}

getMortalityDataTable <- function(contCode){
    eData <- getMortalityData(contCode)
    transformToColumnsData(eData)
}

getExpectedLifeTimeTable <- function(){
    getMortalityDataTable(3)
}

> ex19662017 <- getExpectedLifeTimeTable()  
> ex19662017[1:5,45:54]
    2008  2009  2010  2011  2012  2013  2014  2015  2016  2017
1  80.68 80.86 81.04 81.25 81.45 81.66 82.09 82.28 82.42 82.63
2  79.90 80.12 80.26 80.44 80.65 80.86 81.29 81.47 81.59 81.81
3  78.92 79.13 79.29 79.46 79.67 79.88 80.30 80.49 80.61 80.82
4  77.94 78.15 78.29 78.46 78.68 78.88 79.31 79.50 79.62 79.84
5  76.95 77.16 77.30 77.47 77.69 77.89 78.32 78.51 78.62 78.85

Our API access functions are intentionally simplistic. While we might be able to drop the (few) transformation steps for the query with more "rich" API access functions, the main parts of the work are defining the query correctly and transforming the result to a suitable form. Which I find is almost always the harder part when fetching data with this API.

An application example: Plotting life expectancy 1966-2017

OK, but isn't full access to all the data overkill? It may be if lots of the data are of little relevance to the point to be made. In this case, the 50+ years' perspective we get by using all the data can provide new information if we are used to shorter perspective. In particular, we can get more information on what changes and what does not, and the nature of the changes.

We plot curves of the life expectancy at 67, 75, 80, 85, 90 and 95 years for all, women and men. We add the regression lines, which fit quite well for most of the curves - the really notable exceptions being men 67 and 75 years, who had almost constant life expectancy before 1980, but have catched up during the last decades. We save the regression coefficients.

mkplot0 <- function(createPng=0) {

    pointsAndRegr <- function(eL,yrs=years,ages=c(67,eAges),col=1,newPlot=0,main="All") {
       rCoef <- matrix(0,nrow=length(ages),ncol=2) ; 
       for (i in 1:length(ages)) {   
        tS <-unlist(ages[i]+eL[(ages[i]+1),3:54]) ;
        rL  <- lm(tS~yrs) ;  regrL <- rL$coef ;       
        if (ages[i]==67)  plot(yrs,tS,type="l",ylim=c(75,100),lty=1,lwd=2,xlab="Year",ylab="Life expect",main=main,col=col)  
        else  points(yrs,tS,type="l",lty=2,lwd=2,col=col) ;
        abline(regrL[1],regrL[2],lty=4,col=col )
        rCoef[i,] <- regrL ;
       }
      rCoef  
    }

    computeCrossings <- function(rCoef) {
        nCat <- length(rCoef[,2]) ;
        for (i in 1:(nCat-1)) {
            xi <- (rCoef[i,1]-rCoef[i+1,1])/(rCoef[i+1,2]-rCoef[i,2])
            yi <- rCoef[i,1] + rCoef[i,2]*xi
            print(paste("xi: ",xi,"  yi: ",yi)) ;
        }
    }
    
    pointsAndCurves <- function(rCoef,ages=c(67,eAges),col=1,newPlot=0,main="Life expectancy increases") {
        if (newPlot==1) {
            plot(ages,100*rCoef[,2],ylim=c(0,10),xlab="Age",ylab="E(Increase 100 yrs)",main=main,col=col) ;
            points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;
        }
        else  {
            points(ages,100*rCoef[,2],col=col) ;
            points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;  
        }
    }

    relativePensions <- function(eL,yrs=years,d0=eLA[68,42],col=1,newPlot=0,main="Pension level with constant savings"  )  {
        if (newPlot==1) {
            plot(yrs,d0/eL[68,3:54],type="l",ylim=c(0.8,1.4),lty=1,lwd=2,xlab="Year",ylab="Relative pension level",main=main,col=col)
            abline(1,0)
        }
        else  points(yrs,d0/eL[68,3:54],type="l",lty=1,col=col,lwd=2)
    }
    
    X11(width=4,height=12)  ;
    if (createPng>0) png(file="life_expect_1.png",width=480,height=1200) ;
        
    par(mfrow=c(3,1)) ;  years <- 1966:2017 ;  eAges <- c(75,80,85,90,95) ;
  # Plot life  expectations for all   
    eLA <- eLTTable[eLTTable[,1]==0,] ;  rCoefA <- pointsAndRegr(eLA) ;
  # For men   
    eLM <- eLTTable[eLTTable[,1]==1,] ;    rCoefM <- pointsAndRegr(eLM,col=4,main="Men") ;
  # For women    
    eLF <- eLTTable[eLTTable[,1]==2,] ;     rCoefF <- pointsAndRegr(eLF,col=2,main="Women") ;
 
    if (createPng>0)  dev.off() ;
    else  X11(width=10,height=6) ;
    if (createPng>0) png(file="life_expect_2.png",width=800,height=480) ;
    par(mfrow=c(1,2)) ; 
  # 100 yr increases life expectancy  
    pointsAndCurves(rCoefA,col=1,newPlot=1) ;  pointsAndCurves(rCoefM,col=4) ;  pointsAndCurves(rCoefF,col=2) ;
  # Relative pensions 1966-2017
    relativePensions(eLA,col=1,newPlot=1) ; relativePensions(eLM,col=4) ; relativePensions(eLF,col=2) ;
    computeCrossings(rCoefA) ;
    if (createPng>0)  dev.off() ;
}


> mkplot0()
[1] "xi:  2088.83246231796   yi:  92.5295089875549"
[1] "xi:  2102.95157141019   yi:  93.5460896638841"
[1] "xi:  2134.54157204277   yi:  95.1915583982768"
[1] "xi:  2198.87945837088   yi:  97.2971644475882"
[1] "xi:  2321.47806129824   yi:  99.334490928148"
>

Trendline crossing: Inconsistency of extrapolations

Some reflection over the trendlines in fig 1 uncovers a basic problem with linear extrapolations. Because the trendline at 67 is clearly steeper than at 75, some time in the "future" those lines will cross, and from then, the expected life span for 67-year persons will be higher than for 75-year olds! That time is in 2088, at 92.5 years, as shown by the output of the computeCrossings(rCoefA) function. Crossings for higher ages will occur later, but eventually, all curves will cross.

The "optimistic" interpretation of this is that there is a future life expectancy increase to happen at high ages which does not manifest today, and there is every reason to expect that will occur. But we must expect the size of this increase to be quite modest (yearly life expectancy increase at 90 is <2/100 yrs, and fairly constant) and not at all to be able to fully counteract the trendline crossing problem. Therefore, the only sensible interpretation is that the rates of life expectancy increases at all but the highest ages probably will decrease over time. The question is not if, but when.

Fig 1. Life expectations 1966-2017. Curves for all, men (blue) and women (red), bottom to top, 67, 75, 80, 85, 90, 95 years of age, with regression lines.

Fig 2.
A: Increases in expected life span per 100 years by age 67-95, based on 1966-2017 regression lines. Curves for all, men (blue) and women (red). The average curve gets ever close to the female with age, because mostly women survive to high ages.
B: Relative levels of pension, based on expected life years at normal retirement age of 67, 1966-2017 and constant amount of funds. (The principle in the current Norwegian system.) Normalized relative to average expected life span at 67 in 2005. (When the system was designed.) Curves for all, men (blue) and women (red). The actual pensions are gender-neutral in Norway, so this illustrates the basic redistribution effect of that principle.

Tuesday, 20 March 2018

Accessing Norwegian official statistical tables using the JSON API from Statistics Norway with R

JSON (JavaScript Object Notation) has become a de facto standard for data exchange, but the structuring may in some cases be difficult to handle. Therefore, a "light"-version especially suited for statistical data, called JSON-stat, has become quite popular for many applications. Among others, Statistics Norway chose this as the primary delivery format for data from Statistikkbanken, "bank of statistical data", also using JSON as query format.

Basically, a JSON query is HTTP POSTed to the API, using the table number as identification, and a JSON-stat formatted response is returned. A GET request to the same address provides a version of the metadata for the table. The data may then be converted to R data frames etc.

With the necessary R libraries installed and loaded, all operations can quite easily be performed from the command line in R, and by the standard user interface for Statistikkbanken, one may obtain the JSON query needed. For routine use though, it will in most cases be far easier to use a standardized set of functions for the basic data traffic, and one such small and simple set is presented here.

Basically, just two functions are normally used: getMetaData(tableId) for fetching table descriptions, and getJSONData(tableId,query,naming) for submitting a query and converting the response to an R data frame. A third function is provided for access to the raw data returned.

options(encoding="UTF-8")
library(httr)
# rjstat is used for converting SSB JSON -> Data frame
library(rjstat)
# jsonlite is used mainly for converting metadata 
library(jsonlite)
#source("ssb-json-functions.R")

# Fetches the meta data for a table, with number tableId, as a JSON structure
getMetaData <- function(tableId) {
  getUrl <- paste("http://data.ssb.no/api/v0/no/table/",tableId,sep="")
  d.tmp<-GET(getUrl)
  content(d.tmp,"text")
}

# getRawJSONData - Fetches SB-data by POST-request  tableId: Number of SB-table queryData: JSON-formatted query 
getRawJSONData <- function(tableId,queryData) {
#  POST query request
   d.tmp <- POST(paste("http://data.ssb.no/api/v0/no/table/",tableId,sep=""), body = queryData, encode = "json", verbose())
# Returns contents of d.tmp as JSON-formatted text 
   content(d.tmp, "text")
}

# getJSONData - Fetches SB-data by POST-request. Returns a data table for further processing
getJSONData <- function(tableId,queryData,naming="id") {
# Fetches the content processed by fromJSONstat
   sbtable <- fromJSONstat(getRawJSONData(tableId,queryData),naming=naming)
# Only dataset is used from sbtable and returned
   ds <- sbtable[[1]]
   ds
}

If no handling of metadata is to be performed, the two packages httr and rjstat will suffice, but for more flexible use of data and metadata jsonlite comes in handy. I have chosen to leave the verbose setting in the HTTP POST request "on", as it provides a bit more insight into what is happening.

Please note that at Statistics Norway, an new package ApiData by researcher Øyvind Langsrud, with a single user interface function ApiData, is now (March 2017) in internal use for these purposes. By the time you read this, it may be publicly available, and its use will be demonstrated in subsequent posts.

A typical JSON query and its execution

We will use a rather large table as an example, 07902, mortality data 1966-2017 (as per March 2018). The data are given by gender (variable Kjonn), age (AlderX), time (Tid) and four different contents types (ContentsCode).

I have wrapped the query in an R function rather than putting it into an R variable, as I have found it slightly easier to maintain and develop that way. YMMV. This query will retrieve the whole table, using the selection type "selection": {"filter": "all","values":["*"] } . The query has a query part and a response part - the latter is rather trivial "response": {"format": "json-stat"} and common for all json-stat queries on the site. To retrieve the whole table, it is not necessary to specify all variables, but I find complete specifications safer and over time simpler to work with - and they are easier to narrow down.

getQueryData07902 <- function() {
'{
  "query": [
  {
    "code":"Kjonn",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"AlderX",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"ContentsCode",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"Tid",
    "selection": {
    "filter": "all",
    "values":["*"] }
   }   ],
  "response": {
  "format": "json-stat"
  } 
}'
}

For repeated execution, the POST request may be wrapped into a function, returning a quite long R data frame:

getAllMortalityData07902 <- function(){
    getJSONData("07902",getQueryData07902())
}

> t07902 <- getAllMortalityData07902()
> length(t07902[,1])
[1] 66768

The first few lines look as follows, with the actual data in the value variable:

> t07902[1:5,]
   Kjonn AlderX    ContentsCode  Tid value
1      0    000 LevendePerTusen 1966 1e+05
2      0    000 LevendePerTusen 1967 1e+05
3      0    000 LevendePerTusen 1968 1e+05
4      0    000 LevendePerTusen 1969 1e+05
5      0    000 LevendePerTusen 1970 1e+05

Before we can use this, we must filter and transform it. This can be done in a lot of ways, and many tools are available. We shall start out quite simply, but first we save the data frame, to become independent of the connection. We also load it again, to make sure the saving works. It will be read from file into an object with the original name, here t07902.

> save(t07902,file="../data/t07902-2018.RData")
> load("../data/t07902-2018.RData")

Because all the transformations necessary for later use can be gathered in a single R function, there is, generally, not much to gain by transforming before saving. Unless the data are to be used in only one way.

Simple transformations

To give a fairly simple example of one way working from scratch, we will reconstruct a version of the 2017 mortality table from our data frame. We first filter out the 2017 data, next we pick the 4 different variables according to their ContentsCode values. Then we prepare a data frame with the necessary background data, and we merge this with the variables to create a mortality table data frame. We might also omit Tid here, as it is trivially 2017. For identification when we are handling several periods at once, keeping it could however be a precautionary measure.

One basic way, step by step

In the data frame, the data are sorted by sex and age, but it might be a bit risky to rely solely on this implicit order instead of filtering data by values.

We delete columns by setting them to NULL: b2017$value <- NULL ; b2017$ContentsCode <- NULL ;

> t2017 <-  t07902[t07902$Tid==2017,]
> table(t2017$ContentsCode)

             Dode Dodssannsynlighet   ForvGjenLevetid   LevendePerTusen 
              321               321               321               321 
> 
> lx <-  t2017[t2017$ContentsCode=="LevendePerTusen",5]
> ex <-  t2017[t2017$ContentsCode=="ForvGjenLevetid",5]
> qx <-  t2017[t2017$ContentsCode=="Dodssannsynlighet",5]
> dx <-  t2017[t2017$ContentsCode=="Dode",5]
> t2017d <-  t2017[t2017$ContentsCode=="Dode",]
> b2017 <- t2017d ;
> b2017$value <- NULL ; b2017$ContentsCode <- NULL ;
> dt2017 <- data.frame(b2017,lx,qx,dx,ex)

> dt2017[1:5,]
    Kjonn AlderX  Tid     lx    qx  dx    ex
104     0    000 2017 100000 2.236 224 82.63
312     0    001 2017  99776 0.150  15 81.81
520     0    002 2017  99761 0.165  16 80.82
728     0    003 2017  99745 0.115  11 79.84
936     0    004 2017  99733 0.113  11 78.85

>

We can make a simple plot, for example of life expectancy, to check the transformations and look at the data.

> dt2017All <- dt2017[dt2017$Kjonn==0,]
> dt2017M <- dt2017[dt2017$Kjonn==1,]
> dt2017F <- dt2017[dt2017$Kjonn==2,]
> plot(as.numeric(dt2017All$AlderX),dt2017All$ex)
> points(as.numeric(dt2017All$AlderX),dt2017M$ex,type="l",col=4)
> points(as.numeric(dt2017All$AlderX),dt2017F$ex,type="l",col=2)

Making a function of it

This exercise would hardly be something you want to repeat many times to get mortality tables for several years, or when new data are available. But by making a function of it, with year and data frame as parameters, handling may become fairly efficient. Provided the data structure does not change, updating or creating analyses for a new year may be done quite efficiently. And with little thought about the data formats and transformations involved.

mortTableYear <- function(year,df=t07902) {
    tYr <-  df[df$Tid==year,] ; tYrD <-  tYr[tYr$ContentsCode=="Dode",] ;
    bYr <- tYrD ; bYr$value <- NULL ; bYr$ContentsCode <- NULL ; bYr$Tid <- NULL ;
    cCodes <- c("Dode","Dodssannsynlighet","ForvGjenLevetid","LevendePerTusen") 
    mData <- matrix(0,nrow=length(bYr[,1]),ncol=4)
    for (i in 1:4) mData[,i] <- tYr[tYr$ContentsCode==cCodes[i],5] 
    dfYr <- data.frame(bYr,mData)
    names(dfYr) = c("Kjonn","AlderX","dx","qx","ex","lx") ;
    dfYr 
}

> m2016 <- mortTableYear(2016)
> m2016[1:5,]
     Kjonn AlderX  dx    qx    ex     lx
103      0    000 216 2.158 82.42 100000
311      0    001  18 0.183 81.59  99784
519      0    002  13 0.132 80.61  99766
727      0    003   5 0.049 79.62  99753
935      0    004   5 0.048 78.62  99748

Generally, there is little reason to work with an awkward data organization of R-data if the problem at hand is not a one-off case. In most cases, transformations are relatively easy to set up and apply. For example, we don't need to decide if we want to organize the data by years or by variables, we can do both.

Here is a function version of the above plot, complete with parameter for png-saving. The function-within-a-function pattern is efficient for factoring out common features of function calls, and defining and handling a local context. Local data in an outer function are "global" in the inner functions, in a kind of Poor Man's Object Orientation.

testPlotEX <- function(df,savePng=0) {
    plotter <- function(y,newP=0,col=1) {
        if (newP==1) plot(as.numeric(dfA$AlderX),y,type="l",xlab="Age",ylab="Expected years left",col=col)
        else points(as.numeric(dfA$AlderX),y,type="l",col=col)
        legend(65,75,col=c(1,2,4),lty=c(1,1,1),legend=c("All","Women","Men"))
    }

    X11()
    if (savePng>0) png(file="life_expect_0.png") ;
    dfA <- df[df$Kjonn==0,] ;  dfM <- df[df$Kjonn==1,] ; dfF <- df[df$Kjonn==2,] ;
    plotter(dfA$ex,newP=1) ;  plotter(dfM$ex,col=4) ;  plotter(dfF$ex,col=2) ;
    if (savePng>0) dev.off() ;
}

>
> testPlotEX(dt2017,savePng=0) 
> testPlotEX(dt2017,savePng=1) 
>

NB! This is work in progress :-)

Monday, 19 March 2018

Prime sieving: Eratosthenes as an R-user

Prime sieving is an ancient and (still) fairly efficient way of finding primes. In addition, it may fit well in as a part of benchmark suits, and different implementations may provide interesting data for performance comparisons.

This in particular applies to R, where the "don't loop if you can avoid it" principle could possibly be challenged in cases where a large number of vector operations is the alternative to looping, and one ends up with a fair number of loops anyway.

The sieve of Eratosthenes (276-194/195 BC) provides a prime (no pun intended) example of this.

The sieving process: Looping

Given a number N, the "classical" version of the sieve marks off the multiples of each prime number less than or equal to sqrt(N) in an array. The lowest unmarked number is the next prime, and when we have done this for all the candidates, we know the remaining unmarked numbers are primes. As a small optimization, for a prime p, we may start with p², since the lower multiples of p have already been marked off by smaller prime factors.

A simple and straightforward implementation is the function sieveLoop:

sieveLoop <- function(MAXN=1000){
  numbers <- seq(1:MAXN) ; 
  maxSmallFactor <- round(sqrt(MAXN)) ;
  currPrime <- 2 ;   numbers[1] <- 0 ;
  while(currPrime <= maxSmallFactor) {
      checkNum <- currPrime*currPrime ;
      while(checkNum<=MAXN){
          numbers[checkNum] <- 0 ;
          checkNum <- checkNum + currPrime ;
      }
      currPrime <- currPrime + 1 ;
      while (numbers[currPrime]==0)  currPrime <- currPrime + 1 ;
  }
  numbers[numbers!=0]
}

In this case, the only place where vector operations are used non-trivially is the return statement, where the non-zero entries are returned in an array, but as that occurs just once per sieving process, the impact on the run-time behaviour should be quite small.

Sieving using vector operations

The function sieveVector uses vector operations instead of loops for the basic sieving process, removing all numbers congruent modulo zero to the current prime, and using the first number in the resulting array as the next prime. That number is then stored in an array of found primes. This way, there is no need for a final sieving either, as the array of remaining numbers when we have sieved up to sqrt(N) must be primes. Thus, we may just concatenate our array of found primes up to sqrt(N) with the remaining numbers > sqrt(N) to get the result.

sieveVector <- function(MAXN=1000){
    numbers <- 3:MAXN
    maxSmallFactor <- round(sqrt(MAXN)) ;
    currPrime <- 2 ; primes <- c(2) 
    while(currPrime <= maxSmallFactor) {
        numbers <- numbers[numbers %% currPrime!=0]
        currPrime <- numbers[1] ;  
        primes <- c(primes,currPrime)
    }
    primes <- c(primes,numbers)
    primes
}

This "R-centric" approach seems to be much better, and in particular for smaller N indeed it is. For N<1000000 it is about 10x faster than the loop approach, and when setting up a comparison I expected the vector method to be uniformly the unchallenged winner, no matter what.

Testing

I set up a function testSieve using R's simple system timer doing the same sieving with both methods, and returning N and the total time (in seconds). This function was then called by another, comparePrimeSieves, which called it for powers of 2 times a starting size (default 10000), and returning the full series of results as rows in a matrix.

testSieve <- function(MAXN) {
    print(t1 <- system.time(p1 <- sieveLoop(MAXN)))
    print(t2 <- system.time(p1 <- sieveVector(MAXN)))
    c(MAXN,t1[3],t2[3])
}

comparePrimeSieves <- function(firstN=10000,nSteps=10,stepSize=2){
    currSize <- firstN ;
    resM <- matrix(0,nrow=nSteps,ncol=3,byrow=T)
    for (i in 1:nSteps) {
        t1 <- testSieve(currSize) ;
        resM[i,] <- c(currSize,t1[2],t1[3])    
        currSize <- currSize * stepSize ;
        if (currSize>500000) print(paste("CurrSize: ",currSize)) ;
    }
    resM
}

Running this comparison for 10000*(2..2¹⁵)= 20000..165 840 000, we should get a good impression of the performance.

Performance

Fig 1. Running times for loop (red) and vector (green)
versions of the sieve, with regression lines.
T in microseconds in regression equations

The results are presented in a log-log plot in figure 1. I was in for a surprise. Both methods produced linear plots, but while the looping method was approximately linear in N, the vector method followed N^1.28. (Close to 3^rd root of 2.) And for the highest N tested, the linear fit was rather bad. So while the vector method was still faster there, the difference had shrunk to 389 vs 310 seconds.

A variant where the vector of primes was allocated first instead of being extended with each prime (the primes <- c(primes,currPrime) step) ran about 5% faster for N=25000000, but otherwise showed qualitatively the same behavior.

Except for the obvious confirmation of the effectiveness of the vector approach, my two takeaways from this experiment is that full scale checking is necessary to safely choose between functionally equivalent approaches, and that an adequate general algorithm doesn't necessarily have to be morphed into an R-adapted incarnation to be useful in R: Depending on purpose, for occasional use, the performance of the "foreign" looping version in this case could be adequate.

Prime number theorem

The occurrence of primes is highly irregular. Can we still say something about how many primes are to be expected in a certain number range? It is standard notation to call the number of primes<=N for π(N), so given the interval (N₁,N₂), we are looking for an estimate of π(N₂) - π(N₁).

Fig 2. N/π(N) vs log(N)

The striking regularity in fig 2 was soon discovered by mathematicians: N/π(N) follows log(N) closely, with the relation already apparent for N<5000. For a long time, the prime number theorem was a conjecture, until it was first established (independently) by Jacques Hadamard and Charles Jean de la Vallée-Poussin in 1896. They used fairly heavy mathematical machinery, but in 1949 the first "elementary" proofs were published, by Atle Selberg and Paul Erdős. The shortest and simplest proof to date is considered to be Donald J. Newman's from 1980. It is "non-elementary" only in the sense that it uses Cauchy's integral theorem from complex analysis.

There are several heuristic motivations for the theorem. One of the more thorough is presented in a blog entry by Terry Tao. Here, he introduces, among other number-theoretic concepts, the von Mangoldt function and Möbius inversion. It is not room for going into any details here, but the reader is encouraged to try some digging into these areas of number theory. I think it is very rewarding.

R as a tool for experimentation

I had not expected R to be as efficient at sieving millions of primes, using simple implementations with no attempts at adaptations or optimizations. So it seems to be an even better tool for experimentation than I had thought. The important quality to look for here is to be able to focus on the problems, not implementation and optimization details.