Friday, 6 April 2018

Gompertz overhauled: Adaptation of a 200 years old distribution to the modern style of dying

If he could have looked 200 years into the future, Benjamin Gompertz (1779-1865) would surely have been pleased to see how well his "law of mortality" holds up under conditions vastly different from those where they were formulated. In particular, how it nowadays outperforms the more modern and versatile Weibull distribution, Gompertz is FG(x) = 1 - exp(-b/a (e^ax-1)) vs Weibull FW(x) = 1 - exp(-(x/a)^k). I imagine Gompertz, having applied his standard set of tools, could have been spot on in his criticism of the Weibull as a mortality law:
"Look here. All the mortality data I have looked into approximately follow a geometric progression of hazard rates, h_n = h₀αⁿ [1] But the Weibull has something in between an arithmetic and a geometric progression. Counting from age x₀ we get: h_n = h₀(1+n/x₀)^k-1. In general, that is a worse fit to the data, and the lower h₀ and the higher n are, the less adept it gets when it comes to capturing the more challenging features of human mortality."

But low h₀ and high n are the most important characteristics of contemporary mortality in developed countries. We can illustrate this by applying maximum likelihood estimation (MLE) of Weibull, Gompertz, Gompertz-Makeham and "slightly generalized Gompertz" to what model-wise may be one of the most challenging mortality datasets: The 2017 table for Norwegian women. There are quite a few "generalized Gompertz" distributions available, but I haven't seen one labeled "slightly generalized Gompertz", so I use that for the distribution FG3(x) = 1 - exp(-b/a (e^{ax^k}-1)).
Some purists may frown upon this use of "synthetic" data, but I think that criticism overshoots a bit here: The main bulk of deaths occur between ages 75 and 95, and the mortality changes over short time spans are not big enough to make the table approach invalid for model evaluation and estimation.

MLE fitting using optimx and nlminb

The good'ol nlm function of base R doesn't quite cut it here, and I employed the optimx package [2], where the nlminb function proved to be the best choice for this task. I tried to help it by following the book and furnishing gradient and hessian functions, but it didn't quite work out, and I deemed the basic results obtained good enough for the present purpose. Beware that with several parameters, quite different combinations may give quite similar end results. If that is considered a bug, it is a bug of the model.

I think fig 1 quite clearly brings out the strengths and weaknesses of the distributions employed. First, we may say that "re-enacted" Gompertz' principally based criticism of the Weibull distribution is completely vindicated: Weibull isn't able to adapt to the sharply focused contemporary mortality, and strongly overestimates deaths from 60 to 80, in addition to furnishing us with a small, non-existent army of centenarians. Standard Gompertz is far better, but fails to capture the 70-90's deviations from linear log-hazard rate mentioned in [1].

Fig 1. Norwegian female mortality 2017 and four approximations: Gompertz, Gompertz-Makeham, Gompertz with exponent parameter exp(ax^k) and Weibull. Maximum likelihood approximation using R function nlminb on the log-likelihoods of the respective distribution families.

Gompertz-Makeham (GM) and "slightly generalized Gompertz" (FG3) both add a parameter: GM for including a constant component in the hazard rate, log(h(x)) = log(b e^ax + λ). FG3 models an often occurring kind of non-linearity in the log-hazard rate: log(h(x)) = log(b) + log(k) + ax^k + (k-1)log(x). While they approximate marginally better than plain Gompertz, they both fail to capture the deviations from linearity adequately. They overestimate mortality ages 70-80, underestimate it around 90, and don't fully get the sharp peak around 90.

As a mathematician, Gompertz probably appreciated the intricacies involved in focused mortality patterns, and I'm not sure he would rate the 2018 performance of his method so badly. But we can be quite sure that as an expert actuarian, he wouldn't feel content with 15-20% error in predicting deaths in an important age group. So he would surely agree that his method needed an overhaul, while probably insisting the the overhaul be local. For the overall performance of his method is still quite good.

There is clearly not just one way to do this overhaul. But FG3 has told us that the "generalization" approach may not be worthwhile. Maybe we need something "special" rather than "general"? The good performance of standard Gompertz at ages 95-100 may give us some hints: Don't introduce something that deviates a lot from the log-hazard line there.

Overhaul requirements

The overhauled method must be adaptive, so we need parameters. They should come in addition to (a,b)

As few additional parameters as possible, we accept some hard-coding in the result
Localized: Effect of modification minimal outside of age span ca 70-95
Perturbation with strong damping, exponential may be used for that
The deviations to model suggest an anti-symmetric perturbation, like x³+x
Localization may be done by translation x->x-h
We would like perturbation function l(x) to have a continous derivative
We would like to have log(h(x)) ≈ log(b) + a(x+l(x)) + δ(x) with δ(x) "small"

Are those requirements hard to meet, while still providing something that works well enough?

The requirements indicate that we could probably make good use of at least 3 parameters (c,d,h): Perturbation could be dimensioned by c, its exponential dampening regulated by d and the localization (around 85-90 years of age) byh. Giving us something like:
l(x) = c((x-h)³ + A(x-h))/(1+exp(d |x-h|))
The parameter A will be tuned "manually", and then hard-coded into the model.

Maxima: Visiting a friend of R

With the requirements, and a basic idea for solution, it may be time to turn to one of R's good, but maybe underrated and underexploited friends, the computer algebra system (CAS) of ancient Lisp lineage, Maxima[3]. I use it with the wxmaxima user interface. It's not that Maxima is so indispensable as a tool, but it is very good for structured approaches and quality control.

Defining l(x) in Maxima, we can try it out with different values of the parameters, we can differentiate (albeit it may be best to carry out that manually, because of the potential discontinuity of l'(x) at x=h. And more importantly, we can get assistance with the pdf and the log-likelihood. We can see that it does in fact have a continous derivative at x=h, but with a small cusp. The higher order discontinuities have no impact on our mortality modeling.

As an example, here is the "raw" MathML [4] output from wxmaxima, giving the log-likelihood for
FGO(x) = 1 - exp(-a/b (e^a(x+l(x))-1)).
Saved as a function, it can be edited and used in the R code. Note that Maxima's straight use of d/dx |x| = x/|x| is not a good way to implement that functionality in R. $\log (- \frac{c d (x - h) ({(x - h)}^{3} + 6 (x - h)) {%e}^{d |x - h|}}{{({%e}^{d |x - h|} + 1)}^{2} |x - h|} + \frac{c (3 {(x - h)}^{2} + 6)}{{%e}^{d |x - h|} + 1} + 1) + a (\frac{c ({(x - h)}^{3} + 6 (x - h))}{{%e}^{d |x - h|} + 1} + x) - \frac{b {%e}^{a (\frac{c ({(x - h)}^{3} + 6 (x - h))}{{%e}^{d |x - h|} + 1} + x)}}{a} + \log (b) + \frac{b}{a}$

Fig 2. Norwegian female mortality 2017 approximated by overhauled Gompertz. Basic Gompertz shown for reference. Maximum likelihood approximation using R function nlminb on the log-likelihoods of the respective distribution families.
Estimated parameters: (a,b,c,d,h)=(0.115,3.913839e-06,0.03849455,0.3461503,92.5)

Running optimx with different initial conditions will invariably lead to different parameter estimates, many of which may meet the convergence criteria, and this may feel a bit frustrating. Furthermore, maximizing the log-likelihood carefully may not return the best parameter set for your purposes, some limited deviation from maximality may be warranted. But these issues must be considered necessary consequences of the parametrization.

One reason the more plain Newton-based routine nlm is less useful in such situations, is that the parameters may get very close to 0. But several functions may break on negative input, so we may need to limit the search - something nlminb and a few other easily available routines allow for.

Fig 2 shows that overhauling Gompertz by adding 3 parameters for modeling local non-linearity in the log-hazard function may indeed be enough to obtain good fits. Most of the overestimation at 60-80 is gone, and the mortality peak at 90 is well captured. The "costs" in this concrete case are somewhat larger deviations at the oldest ages, and in addition, small irregularities are "neglected". Different sets of parameters yield slightly different results in this respect.

Similar results are obtained by adding the l(x) modification to Gompertz-Makeham, but I don't show them here. With more parameters, we also potentially get more indetermination and more job choosing between essentially equivalent parameter sets. And as we can see from the plots, at the ages 25-50, where the fits are really improved by the Makeham parameter λ the absolute deviation between estimated and observed deaths is rather small.

The full specification of the overhauled Gompertz approximation for 2017 Norwegian female mortality thus becomes:
FGO(x) = 1 - exp(-0.115/3.9e-6 (e^{0.115(x+l(x))}-1)).
l(x) = 0.04((x-92.5)³ + 6(x-92.5))/(1+exp(0.35 abs(x-92.5)))

Fig 3. Norwegian male (blue) and female (red) mortality 1967-2017 approximated by overhauled Gompertz. Basic Gompertz shown for reference.

A 50 years perspective

Fig 3 illustrates how the overhauled version also can be applied to older mortality data, but 50 years ago, the fit of the basic Gompertz distribution might be deemed adequate both for women and men. The estimated value of the c parameter was much smaller 50 years ago, but the 3 extra parameters might contribute to better fit also then. Therefore, it is not necessarily very efficient simply to set c=0 for older data. For older data, there may also be more to be gained by including the Makeham λ parameter. It should be noted, though, that Gompertz-Makeham in no way can account for the largest difference between today and 50 years ago: Infant mortality, which has decreased 6-8-fold. In 1967, about 2% of newborns died before 11, today, 2% of newborns are not expected to have died at 45, and about 97.5% are expected to reach 50.

The peak in deaths has become higher and more narrow over time, and in particular for women, less regular. About 36% of all female deaths now occur between 85 and 92, and fig 3 indicates a tendency for the men's mortality curves to resemble that of women 10-20 years earlier, just a little bit lower and translated a few years to the left. The coefficient plots shown in [1] may be interpreted as indication of a degree of convergence of male and female mortality, but what kind of evidence is extrapolating based on? In Norway, cardiovascular disease has over the years become a less prominent cause of death, while cancer mortality has increased. But what might the impact of that be when about 50% of male deaths and 62% of female deaths occur at age 85 or older?

One might try to assess the development in mortality by looking at plots of the coefficients of overhauled Gompertz, but the optimization procedure used here is not stable enough to produce estimates where the parameters, without further processing or quality control, can be organized in time series in any meaningful way. That is also a result of the higher degree of randomness introduced when irregularities are modeled more precisely. This also applies to derived indicators like the mode, which with overhauled Gompertz does not have the same predictable relation to the whole distribution as it has with basic Gompertz. The parameters (a,b) of the basic linear log-hazard should in principle be displayed in the fitted model, but the 5 parameters (a,b,c,d,h) are dependent, and different combinations may produce quite similar model fits.

The basic R machinery in use

When defining new distributions, functions for the pdf f(x,para) and the negative of the log-likelihood minusLogLikf(para) is in many cases enough. For simplicity, I don't try to supply data to the likelihood function as parameter(s), when it is to be used on several data sets I rather embed it in a wrapper function that takes the data as parameter(s) and prepares it for the optimization procedures. In this case, data is provided as a frequency table, with x holding the case values and dx the number of cases (sum(dx)=100000).

When parameter constraints are provided for optimx (here lower=c(0,0,0,0,0)), several entries in the method list may be excluded. I tend to keep a long list intact, as it may happen that the constraints can be dropped when we have zoomed in on the correct parameter neighborhoods.




fGompAugm5p  <- function(x,para) {
    a <- para[1] ; b <- para[2] ;   c <- para[3] ; d <- para[4] ; h <- para[5] ; 
    xepsi <- 2 ; l <- 1 ;
    diffact <- ifelse(x < h,-1,1) ;
    hxfunc <-c*((x-h)^3 + 3*xepsi*(x-h))/(l + exp(d*abs(x-h))) ; 
    ldxfunc <- 1 + 3*c*((x-h)^2 + 1*xepsi)/(l + exp(d*abs(x-h))) + -d*c*diffact*((x-h)^3 
              + 3*xepsi*(x-h))*exp(d*abs(x-h))/(l + exp(d*abs(x-h)))^2 ;
    b*ldxfunc*exp(a*(x + hxfunc))*exp(-b/a*exp(a*(x+hxfunc)) + b/a)
 }

minusLogLikGompAugm5p  <- function(para) {
    a <- para[1] ; b <- para[2] ;   c <- para[3] ; d <- para[4] ; h <- para[5] ;
    n <- sum(dx) ; xepsi <- 2 ; l <- 1 ;

    hxfunc <-c*((x-h)^3 + 3*xepsi*(x-h)) /(l+ exp(d*abs(x-h)))
    diffact <- ifelse(x < h,-1,1)
    ldxfunc <- 1 +  3*l*c*((x-h)^2 + 1*xepsi)/ (l + exp(d*abs(x-h))) -d*c*diffact*((x-h)^3 
               + 3*xepsi*(x-h))*exp(d*abs(x-h))/(l + exp(d*abs(x-h)))^2 ;
    logdxsum <- sum(dx*log(ldxfunc)) ;      
    xsum <- sum(dx*(x + hxfunc)) ; expsum <- sum(dx*exp(a*(x + hxfunc))) ;

 - ( n*log(b) + n*b/a + logdxsum  + a*xsum  - b/a*expsum  )
 }

# M2017 - dx vector contains mortality data for males 2017
p5 <- c(0.115,3.9e-06,0.04,0.35,92.5);
g5x<-optimx(p5,fn=minusLogLikGompAugm5p,
            method=c("Nelder-Mead","BFGS","nlm","L-BFGS-B","nlminb"),lower=c(0,0,0,0,0)) ;

References

1. "Gompertz rules" posting on this blog 20180402
2. Interview with John Nash about optimx
3. Background on Maxima
4. MathML introduction

Monday, 2 April 2018

Gompertz rules: A 50-year perspective on "contemporary" mortality

There are many ways to describe human mortality, the "best" way will often depend on the purpose. It is influenced by (among others) genetical, environmental, social and behavioral factors. With so many different factors influencing it, one single functional form may hardly be expected optimal for all modeling of life times. But some persistent patterns may emerge, and this is neatly demonstrated by recent Norwegian mortality data confirming the 1810-1825 observations of Benjamin Gompertz (1779-1865). Drawing on several different data sources, Gompertz found that the logarithm of the hazard rate for the elderly seemed to always be approximately a linear function of age, and we see the same basic pattern today: In this respect, Gompertz rules.

The hazard rate of a distribution with cdf F and pdf f, is defined as h(x) = f(x)/(1-F(x)) = d/dx log(1-F(x)). It has an immediate interpretation and application: With N(x) individuals at age x, and hazard rate h(x), the probability of one dying within one year is h(x), and we can expect approximately N(x)h(x) to die in that period. Given an exact h(x), F(x) is determined modulo eventual constants, but in most cases, the information we obtain about the hazard rate may be consistent with different types of distributions. The exponential hazard rate seen in human mortality is kind of limit case in this respect.

Gompertz published a comprehensive report on this in 1825, but did not dwell much on systematic deviations from the linear relationship at younger ages. In 1860, William Makeham (1826-1891) improved on the fit for younger ages by extending the Gompertz model with a constant added to the hazard rate. The resulting distribution is still widely applied as the Gompertz-Makeham law of mortality. It tends to fit fairly well from 15-20 to about 100 years if age.

In 1860, the high overall mortality made the Makeham term highly significant, but most developed countries today have a very low mortality at younger ages, which makes it less practically important. For example, in Norway today only 0.5-1% of females die between ages 20 and 40, and about 98% of female newborns are expected to reach the age of 50.

Gompertz' observation: Mortality follows a geometric progression

Gompertz discovered that whatever adult age he started counting from, mortality n years after, as measured by hazard rate h_n, could be approximately expressed as a geometrical progression of the initial value h₀:
h_n = h₀αⁿ
Taking logarithms, we get
log(h_n) = log(h₀) + n log(α)
I.e. a linear function of time (age) n, and whenever this is approximately correct, the Gompertz distribution will be a (more or less accurate) model for the life times. The Gompertz cumulative distribution function is fairly simple, here is the parametrization used in R's flexsurv package:
F(x) = 1 - exp(-b/a (e^ax-1))
The hazard rate becomes h(x) = f(x)/(1-F(x)) = b e^ax. The corresponding Gompertz-Makeham rate is h(x) = b e^ax + λ

An interpretation of a, is that the mortality doubles in (log(2)≈0.69)/a years. E.g. with a=0.10, mortality doubles every 7 years. This period varies widely, but 5-10 years is a normal range. When overall mortality decreases, the doubling time often decreases, i.e. a increases.

The linear log-hazard function may seem unreasonable

The Gompertz rule has very strong assertions (linearity over an extremely wide range of mortality rates) and an obvious way it breaks down: At some age x_max, we will have a*x_max + log(b) > 0 . That means for P(death) in the next period δx =1 : P(death) ≈ b exp(a*x_max)*δx > 1 So death in the next period is more than guaranteed :-) Therefore, we must expect that from some high age on, this rate will be constant or even fall a bit. But what is the practical relevance of that?

I was more concerned about the shortcomings before looking at the Norwegian data. It turns out that the deviations from linear log-mortality has not increased over time, in spite of a tremendous increase in people reaching very old age. Rather, the practically important deviations mostly occur between 70 and 95, at around 100, Norwegian mortality follows standard Gompertz quite closely. It also turns out the the regression line in question tends to "rotate": Basic mortality have been steadily decreasing, while the age-dependent part has tended to increase lately, both for men and women. The mortality at 95-100 is rather stable, in spite of the huge increase in the number of persons reaching that age. Also, the gender differences in those parameters were much larger 50 years ago.

Used on the Norwegian data, the linear log-hazard function is reasonable. It is not in urgent need for any of the generalizations that have been proposed, and even the Makeham correction term is of only marginal practical significance. But it does not predict the number of deaths 70-90 in recent years very precisely. For ages ca 70-80, it overestimates, and for ca 83-93, it underestimates mortality. A modification which operates "locally" and addresses this shortcoming might be of practical significance.

Plotting Gompertz mortality

Rather than estimating the Gompertz parameters (a,b) by maximum likelihood estimation (MLE), we have focused on the age range 50-90, approximating the log-mortality function by linear regression. In this, we follow (loosely) the method presented in demography course material by Germán Rodríguez of Princeton.

We construct the age vs deaths (x,dx) data from the lx column of the mortality table. Parameters are estimated, and plots of data and residues presented for selected years: 1967,1997,2007 and 2017.

The time development of the two parameters are also plotted, separately and vs each other. An "educated guess" of future values may be made from these plots: First, extrapolating the log(b)-value for the actual year, and then getting the a-value from the a vs log(b) linear regression from latter years. Drawing from the "guessed" Gompertz distribution may provide some basic information on future mortality, but the model fit for age distribution of deaths will probably not be very good for ca 70-90 years of age.

Because we focus on the age interval where >90% of deaths occur (ca 65-105), we don't consider the Makeham adaptation here - it is mostly influential below 50 years of age, and its inclusion does not help that much for the systematic deviations we observe. Nor do we explore the most immediate generalizations of the Gompertz distribution, e.g. x -> x^k in the exponent: F(x) = 1 - exp(-b/a (e^{ax^k}-1)). Or Perks type distributions with hazard rate h(x) = (b exp(ax) + c)/(d exp(ax) + 1) .

Some R-code

We prepare a table with age and sex as background variables, and columns with lx for the years 1966-2017. All years are processed, and plots are produced for selected years and time evolution of the parameters.

paramEstimYear <- function(dF,mkPlot=0,yr=1967,regrStart=50) {
   prepareHazard <- function(hdF,plotcol=4,yr,regrStart=50,plotNum=1) {
        hdF$lx <- ifelse(hdF$lx>0,hdF$lx,2) 
        lxm <- mutate(hdF, lx = lx/100000, Hx = -log(lx)) ;
        lxhm <- data.frame(hx = diff(lxm$Hx), x = midpoints(lxm$age))
        lfm <- lm(log(hx) ~ I(x - regrStart), data = filter(lxhm, (x >=regrStart)&(x<=90)))
        lxhmf <- filter(lxhm, (x >= regrStart)&(x<=90)) %>%  mutate(hf = exp(fitted(lfm)))
      
        if (mkPlot==1) {
           if (plotNum==1) mainTitle <- paste("Year: ",yr) else mainTitle <- " " ;
           plot(lxhm$x,log(lxhm$hx),ylim=c(-11,0),col=plotcol,xlab="Age x",ylab="log(h(x)",main=mainTitle)
           abline(lfm$coef[1]-regrStart*lfm$coef[2],lfm$coef[2],col=plotcol,lty=4)
           plot((regrStart+1):90,lfm$resid,ylim=c(-0.7,0.7),col=plotcol,
           xlab="Age x",ylab="Residuals of log(h(x)")
        }
        list(lxhm=lxhm,lfcoef=lfm$coef,lfres=lfm$resid)  
    }
    
    yrM <- data.frame(dF$age,dF$maleYear) ;  yrF <- data.frame(dF$age,dF$femYear) ;
    names(yrM) <- c("age","lx") ;  names(yrF) <- c("age","lx") ;
    yrHM <- prepareHazard(yrM,plotcol=4,yr,regrStart,plotNum=1) ;
    yrHF <- prepareHazard(yrF,plotcol=2,yr,regrStart,plotNum=2) ;
    list(yrHM=yrHM,yrHF=yrHF)
}


paramEstimSeriesOfYears <- function(sT,years=1967:2017,mkPlot=0,regrStart=50) {
    resM <- matrix(0,nrow=length(years),ncol=5) ; resM[,1] <- years ;
    for (yr in 1:length(years)) {
        yearData <- prepareYear(sT,as.character(yr+years[1]-1)) ;
        pE <- paramEstimYear(yearData,regrStart=regrStart) ;
        resM[yr,2] <- pE$yrHM$lfcoef[1] ;  resM[yr,3] <- pE$yrHM$lfcoef[2] ;
        resM[yr,4] <- pE$yrHF$lfcoef[1] ;  resM[yr,5] <- pE$yrHF$lfcoef[2] ;
    }
    resM
}

plotMortalityParameters <- function(resM,regrStart=50,lastPerStart=2007,savePng=0) {

    yrs <- resM[,1] ; a0M <- resM[,2] - resM[,3]*regrStart ;  a0F <- resM[,4] - resM[,5]*regrStart ;
    aFitM <- lm(a0M~yrs) ;    bFitM <- lm(resM[,3]~yrs) ;
    aFitF <- lm(a0F~yrs) ;    bFitF <- lm(resM[,5]~yrs)
    aCoefM <- aFitM$coef ; bCoefM <- bFitM$coef ;
    aCoefF <- aFitF$coef ; bCoefF <- bFitF$coef ;

    aFitM25<- lm(a0M[yrs>lastPerStart]~resM[yrs>lastPerStart,1]) ;    
    bFitM25 <- lm(resM[yrs>lastPerStart,3]~resM[yrs>lastPerStart,1]) ;
    aFitF25 <- lm(a0F[yrs>lastPerStart]~resM[yrs>lastPerStart,1]) ;    
    bFitF25 <- lm(resM[yrs>lastPerStart,5]~resM[yrs>lastPerStart,1])
    aCoefM25 <- aFitM25$coef ; bCoefM25 <- bFitM25$coef ;
    aCoefF25 <- aFitF25$coef ; bCoefF25 <- bFitF25$coef ;    

    abM25<- lm(resM[yrs>lastPerStart,3]~a0M[yrs>lastPerStart]) ; 
    abF25<- lm(resM[yrs>lastPerStart,5]~a0F[yrs>lastPerStart]) ;  
    abCoefM25 <- abM25$coef ; abCoefF25 <- abF25$coef ;
    print(paste(abCoefM25, abCoefF25))
   
    if (savePng>0) png(filename='gomp_f1.png',width=1000,height=600)
    else  X11(width=12,height=6)
    par(mfrow=c(1,3))
    plot(yrs,a0M,col=4,ylim=c(-12.5,-9.5),xlab="Year",ylab="Basic mortality",main="Parameter a in y=a+bx")
    points(yrs,a0F,col=2)
    abline(aCoefM[1],aCoefM[2],col=4)
    abline(aCoefF[1],aCoefF[2],col=2)
    abline(aCoefM25[1],aCoefM25[2],col=4,lty=3)
    abline(aCoefF25[1],aCoefF25[2],col=2,lty=3)  
    legend(1990,-4.75,pch=c(1,1),col=c(4,2),legend=c("Males","Females"))
    
    plot(resM[,1],resM[,3],col=4,ylim=c(0.09,0.12),xlab="Year",ylab="Aging mortality component",
       main="Parameter b in y=a+bx")
    points(resM[,1],resM[,5],col=2)
    abline(bCoefM[1],bCoefM[2],col=4)
    abline(bCoefF[1],bCoefF[2],col=2)
    abline(bCoefM25[1],bCoefM25[2],col=4,lty=3)
    abline(bCoefF25[1],bCoefF25[2],col=2,lty=3)

    plot(a0M,resM[,3],col=4,xlim=c(-12.5,-9.5),ylim=c(0.09,0.12),xlab="a-parameter",
         ylab="b-parameter",main="b vs a parameters in y=a+bx")
    points(a0F,resM[,5],col=2)
    abline(abCoefM25[1],abCoefM25[2],lty=4,col=4)
    abline(abCoefF25[1],abCoefF25[2],lty=4,col=2)
   
    if (savePng>0) dev.off()
    else dev.copy2eps(device=x11,file='gomp_f1.eps') ;
 
    c(aCoefM25[1],aCoefM25[2],aCoefF25[1],aCoefF25[2],bCoefM25[1],bCoefM25[2],bCoefF25[1],bCoefF25[2])
}

processingGompertz <- function(mkPlotSlct=1,regrStart=50,savePng=0) {
    sT <- getSurvivalTable()
    resM <- paramEstimSeriesOfYears(sT,years=1967:2017,mkPlot=0,regrStart=regrStart) 
    regrC <- plotMortalityParameters(resM,savePng=savePng)

    if (mkPlotSlct==1) {
        if (savePng>0) png(filename='gomp_f2.png',width=1000,height=1200)
        else X11(height=12,width=12) ;
        par(mfrow=c(4,4)) ;
        y1967 <- prepareYear(sT,"1967");  pE <- paramEstimYear(y1967,mkPlot=1,yr=1967,regrStart=regrStart);
        y1997 <- prepareYear(sT,"1997");  pE <- paramEstimYear(y1997,mkPlot=1,yr=1997,regrStart=regrStart);
        y2007 <- prepareYear(sT,"2007");  pE <- paramEstimYear(y2007,mkPlot=1,yr=2007,regrStart=regrStart);
        y2017 <- prepareYear(sT,"2017");  pE <- paramEstimYear(y2017,mkPlot=1,yr=2017,regrStart=regrStart);
        
        if (savePng>0) dev.off()         
        else dev.copy2eps(device=x11,file='gomp_f2.eps') ;
    }
    list(regrC=regrC,resM=resM)
}

Git repository for this blog: https://github.com/tyder-no/r-and-friends

Mortality: Basic patterns and some systematic deviations

The basic qualitative pattern of the log-hazard rate plots has hardly changed over 50 years, but there are a number of features to note:

During the last 30 years, there has been a fairly stable deviation from linearity, as illustrated by the residuals for 70-90 years in most cases having an increasing linear tendency.
Mortality at 93-100 years is "back on track", in spite of these years not being included in the regression fit.
There is a roughly linear "Makeham-type" deviation ca 25-50, horizontal for men, but not for women.
The last 30 years, strong deviations from linear log-mortality at 50+ only occurs above 100 years of age.

Fig 1. Log mortality plots for 1967, 1997, 2007 and 2017. Males (blue) and females (red), residual plots for the regression fit, which is performed for the interval (50,90).

Time evolution of the hazard rate parameters

A common feature is that the "basic mortality", as described by the b-parameter in the linear fit to the log(hazard rate), has been declining for both men and women, with a much faster rate for men - but from a much higher level. After 2010, this rate has continued to fall, but now men's rates are similar to women's - whose reduction may in turn have accelerated the last 10 years. But the absolute levels are low.

50 years ago, women "aged" quite a bit faster than men, as indicated by the a-parameter in the linear fit to the log(hazard rate). While it decreased for some decades for women, it now seems to increase for both sexes, and coefficients are similar.

Fig 2. Panels 1 and 2: Gompertz mortality parameters a and b for males and females 1967-2017, with regression lines for the whole period and for 2007-2017. Note than while these regression fits for men are similar for the two periods, for women they deviate strongly. Panel 3 shows a vs b with regression lines 2007-2017. Changes in the parameters have been strongly correlated in the recent years. Note that the "b" parameter in the plot is actually log(b) from the distribution function, while the "a" parameter is the same.

Wednesday, 21 March 2018

More JSON from Statistics Norway: Metadata, query creation and execution

We have presented the basics of fetching data with the JSON-based API of Statistics Norway's data bank, and transforming them. The next steps are to look a bit more closely into metadata, and how to use more tools for transforming the fetched data. We will not repeat much from last post, but the full source code is available from Github.

Minimalist metadata handling

Together with the three functions presented in the last post, the following 5 functions constitute a bare-bones and simple suite for handling metadata and data in the API. The first, getValuesAndLabels(tableId) returns the metadata for a given table as an R data frame, augmented with a column for variable selection information.

When this information is filled in, the data frame can be handed to createQueryDF(metaDF) for creation of a JSON query. A query data frame is generated and undergoes a little bit of regexp patching in queryFromDF(df), to produce a runnable query.

The function createQueryFromDF(markedValueLabelsDF) carries out the two mentioned steps, producing a query from a marked-up data frame. A frequently used application of this is the function createQueryForAllVars(tableId), which returns a query for all the variables in the table.

Thus, the function calls

> t07902 <- getJSONData("07902",createQueryForAllVars("07902"))
> t04681 <- getJSONData("04681",createQueryForAllVars("04681"))
> t07902 <- getJSONData("03013",createQueryForAllVars("03013"))

will fetch all of tables 07902,04681 and 03013, respectively. This way, we don't really need to handle the metadata in advance if we are content with getting, possibly, haystacks of data.

options(encoding="UTF-8")
# source("ssb_mortality_table_testing.R")
library(httr)
# rjstat is used for converting SSB JSON -> Data frame
library(rjstat)
# jsonlite is used mainly for converting metadata 
library(jsonlite)

# From JSON metadata table structure w/no subtable converted to dataframe by jsonlite
# Adds column "Slct" for selection marking, Parameter tableId: Number of SB-table

getValuesAndLabels <- function(tableId) {
    mDF <- fromJSON(getMetaData(tableId))
    varNms <- mDF[[2]][[1]]  ; varLbls <- mDF[[2]][[1]] ; varNmb <- length(varNms) ;
    valAndLbl <- list()
    for (i in 1:varNmb) {
        xdfi <- data.frame(mDF[[2]][3][[1]][i],  mDF[[2]][4][[1]][i],0)
        names(xdfi) <- c(varNms[i],paste(varNms[i],"Label",sep=""),"Slct")
        valAndLbl[[varNms[i]]] <- xdfi
    }
    valAndLbl
}

#  Creates search data frame from marked-up meta data frame metaDF 

createQueryDF <- function(metaDF){
    srchDF <- fromJSON(getQueryData99999())
    srchDFq  <-  srchDF$query
    nQueryRows <- length(srchDFq$code) ;
    nVar <- length(metaDF) ; nSrchVar <- 0 ;
    
    for (i in 1:nVar){
        nValues <-  length(metaDF[[i]][,3]) ;
        allSlct <- metaDF[[i]][1,3] ; nmbSlct <-  sum(metaDF[[i]][,3]) ;
        topSlct <- metaDF[[i]][nValues,3]
        if (allSlct==10 || nmbSlct>0) {   # Variable included in search
            nSrchVar <-  nSrchVar + 1 ;
            varNm <- names(metaDF[[i]])[1] ;
            srchDFq$code[nSrchVar] <- varNm ;
        }
        if (allSlct==10) {  # All
            srchDFq$selection[nSrchVar,1] <- "all" ;
            srchDFq$selection[nSrchVar,2] <- "*" ;
        }
        else if (nmbSlct>0&topSlct<2) { # Some values, put them in a list
            mDi <- metaDF[[i]] ; mDiSlct <- mDi[mDi[,3]==1,1] ;
            mDiSlct <- paste("\"",mDiSlct,"\"",sep="")
            srchDFq$selection[nSrchVar,1] <- "item" ;
            srchDFq$selection[nSrchVar,2] <- paste(mDiSlct,collapse=",") ;
         }
        else if (topSlct>1) { # Newest topSlct values
            srchDFq$selection[nSrchVar,1] <- "top" ;
            srchDFq$selection[nSrchVar,2] <- as.character(topSlct) ;
        }
    }
    srchDF$query <- srchDFq[1:nSrchVar,]
    # print(length(srchDF$query$code))
    fjL <- list(format="json-stat")
    srchDF$response<-fjL
    srchDF
}

# Patch the generated query by removing the last [] pair
# This way, we can create a correct query from a data frame (really list/frame hybrid) we generate

queryFromDF <- function(df){
    jfJ0 <- toJSON(df)
    jfJ0 <- gsub("\\\\","",jfJ0)
    jfJ0 <- gsub("\\[\"\"","\\[\"",jfJ0)
    jfJ0 <- gsub("\"\"\\]","\"\\]",jfJ0)
    jfJ0 <- gsub("\\]([^]]*)$","\\1",jfJ0)
    jfJ0 <- gsub("\\[([^[]*)$","\\1",jfJ0)
    jfJ0
}    

#  createQueryFromDF - Creates  json search query from marked-up metadata 
#  Parameter: markedValueLabelsDF - list of data frames marked for selection

createQueryFromDF <- function(markedValueLabelsDF){
   sDF <- createQueryDF(markedValueLabelsDF)
   queryFromDF(sDF)
}

# createQueryForAllVars - Creates a query for an entire table 

createQueryForAllVars <- function(tableId)  {
    vlDF <- getValuesAndLabels(tableId) 
    nVar <- length(vlDF) ; 
    for (i in 1:nVar) vlDF[[i]][1,3] <- 10 ;
    createQueryFromDF(vlDF)
}

Example of use on the mortality table

Extracting one of the four mortality table variables (qx,dx,lx,ex) for all available years can be done by setting up a query limited only on ContentsCode - which is what the function pickMortalityVars(mdVL,contCode) does. It is called by the function getMortalityData(contCode), which returns a data frame with values for all years for one variable, e.g. life expectancy (lx).

The next step is to transform the data frame to a suitable format. A table with N+2 colummns; Sex, Age and data for N years, is handy, but there is quite a bit of looping needed if we develop the transformation from scratch. This is a case for the reshape package, the functions melt and cast can quite easily be employed here. They are used in transformToColumnsData(eData), where the variables are "melted" by sex, age and time, and "cast" according to sex+age~time. After some adaptation with colummns, the result is exactly what we want.

pickMortalityVars <- function(mdVL,contCode=3){
    mdVL$Kjonn[1,3] <- 10 ; # All
    mdVL$AlderX[1,3] <- 10 ; # All
    mdVL$ContentsCode[contCode,3] <- 1 ; 
    mdVL$Tid[1,3] <- 10 ; # All
    mdVL
}

transformToColumnsData <- function(eData){
    eData$expLT <- eData$value # To avoid default naming collision on "value"
    eData$value <- NULL ;  # Drop value column
    meltExpectation <- melt(eData,id=c("Kjonn","AlderX","Tid")) ;
    meltExpectation$expLT <- NULL ;  # Drop trivial column
    colExp <- cast(meltExpectation,Kjonn+AlderX~Tid)
    colExp 
}

getMortalityData <- function(contCode){
    metaData <- getValuesAndLabels("07902") ;     
    mdVL <- pickMortalityVars(metaData,contCode=contCode) ;
    eQuery <- createSearchFromDF(mdVL) ;
    eData <- getJSONData("07902",eQuery)
    eData$ContentsCode <- NULL # Drop trivial column
    eData
}

getMortalityDataTable <- function(contCode){
    eData <- getMortalityData(contCode)
    transformToColumnsData(eData)
}

getExpectedLifeTimeTable <- function(){
    getMortalityDataTable(3)
}

> ex19662017 <- getExpectedLifeTimeTable()  
> ex19662017[1:5,45:54]
    2008  2009  2010  2011  2012  2013  2014  2015  2016  2017
1  80.68 80.86 81.04 81.25 81.45 81.66 82.09 82.28 82.42 82.63
2  79.90 80.12 80.26 80.44 80.65 80.86 81.29 81.47 81.59 81.81
3  78.92 79.13 79.29 79.46 79.67 79.88 80.30 80.49 80.61 80.82
4  77.94 78.15 78.29 78.46 78.68 78.88 79.31 79.50 79.62 79.84
5  76.95 77.16 77.30 77.47 77.69 77.89 78.32 78.51 78.62 78.85

Our API access functions are intentionally simplistic. While we might be able to drop the (few) transformation steps for the query with more "rich" API access functions, the main parts of the work are defining the query correctly and transforming the result to a suitable form. Which I find is almost always the harder part when fetching data with this API.

An application example: Plotting life expectancy 1966-2017

OK, but isn't full access to all the data overkill? It may be if lots of the data are of little relevance to the point to be made. In this case, the 50+ years' perspective we get by using all the data can provide new information if we are used to shorter perspective. In particular, we can get more information on what changes and what does not, and the nature of the changes.

We plot curves of the life expectancy at 67, 75, 80, 85, 90 and 95 years for all, women and men. We add the regression lines, which fit quite well for most of the curves - the really notable exceptions being men 67 and 75 years, who had almost constant life expectancy before 1980, but have catched up during the last decades. We save the regression coefficients.

mkplot0 <- function(createPng=0) {

    pointsAndRegr <- function(eL,yrs=years,ages=c(67,eAges),col=1,newPlot=0,main="All") {
       rCoef <- matrix(0,nrow=length(ages),ncol=2) ; 
       for (i in 1:length(ages)) {   
        tS <-unlist(ages[i]+eL[(ages[i]+1),3:54]) ;
        rL  <- lm(tS~yrs) ;  regrL <- rL$coef ;       
        if (ages[i]==67)  plot(yrs,tS,type="l",ylim=c(75,100),lty=1,lwd=2,xlab="Year",ylab="Life expect",main=main,col=col)  
        else  points(yrs,tS,type="l",lty=2,lwd=2,col=col) ;
        abline(regrL[1],regrL[2],lty=4,col=col )
        rCoef[i,] <- regrL ;
       }
      rCoef  
    }

    computeCrossings <- function(rCoef) {
        nCat <- length(rCoef[,2]) ;
        for (i in 1:(nCat-1)) {
            xi <- (rCoef[i,1]-rCoef[i+1,1])/(rCoef[i+1,2]-rCoef[i,2])
            yi <- rCoef[i,1] + rCoef[i,2]*xi
            print(paste("xi: ",xi,"  yi: ",yi)) ;
        }
    }
    
    pointsAndCurves <- function(rCoef,ages=c(67,eAges),col=1,newPlot=0,main="Life expectancy increases") {
        if (newPlot==1) {
            plot(ages,100*rCoef[,2],ylim=c(0,10),xlab="Age",ylab="E(Increase 100 yrs)",main=main,col=col) ;
            points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;
        }
        else  {
            points(ages,100*rCoef[,2],col=col) ;
            points(ages,100*rCoef[,2],type="l",lty=4,col=col) ;  
        }
    }

    relativePensions <- function(eL,yrs=years,d0=eLA[68,42],col=1,newPlot=0,main="Pension level with constant savings"  )  {
        if (newPlot==1) {
            plot(yrs,d0/eL[68,3:54],type="l",ylim=c(0.8,1.4),lty=1,lwd=2,xlab="Year",ylab="Relative pension level",main=main,col=col)
            abline(1,0)
        }
        else  points(yrs,d0/eL[68,3:54],type="l",lty=1,col=col,lwd=2)
    }
    
    X11(width=4,height=12)  ;
    if (createPng>0) png(file="life_expect_1.png",width=480,height=1200) ;
        
    par(mfrow=c(3,1)) ;  years <- 1966:2017 ;  eAges <- c(75,80,85,90,95) ;
  # Plot life  expectations for all   
    eLA <- eLTTable[eLTTable[,1]==0,] ;  rCoefA <- pointsAndRegr(eLA) ;
  # For men   
    eLM <- eLTTable[eLTTable[,1]==1,] ;    rCoefM <- pointsAndRegr(eLM,col=4,main="Men") ;
  # For women    
    eLF <- eLTTable[eLTTable[,1]==2,] ;     rCoefF <- pointsAndRegr(eLF,col=2,main="Women") ;
 
    if (createPng>0)  dev.off() ;
    else  X11(width=10,height=6) ;
    if (createPng>0) png(file="life_expect_2.png",width=800,height=480) ;
    par(mfrow=c(1,2)) ; 
  # 100 yr increases life expectancy  
    pointsAndCurves(rCoefA,col=1,newPlot=1) ;  pointsAndCurves(rCoefM,col=4) ;  pointsAndCurves(rCoefF,col=2) ;
  # Relative pensions 1966-2017
    relativePensions(eLA,col=1,newPlot=1) ; relativePensions(eLM,col=4) ; relativePensions(eLF,col=2) ;
    computeCrossings(rCoefA) ;
    if (createPng>0)  dev.off() ;
}


> mkplot0()
[1] "xi:  2088.83246231796   yi:  92.5295089875549"
[1] "xi:  2102.95157141019   yi:  93.5460896638841"
[1] "xi:  2134.54157204277   yi:  95.1915583982768"
[1] "xi:  2198.87945837088   yi:  97.2971644475882"
[1] "xi:  2321.47806129824   yi:  99.334490928148"
>

Trendline crossing: Inconsistency of extrapolations

Some reflection over the trendlines in fig 1 uncovers a basic problem with linear extrapolations. Because the trendline at 67 is clearly steeper than at 75, some time in the "future" those lines will cross, and from then, the expected life span for 67-year persons will be higher than for 75-year olds! That time is in 2088, at 92.5 years, as shown by the output of the computeCrossings(rCoefA) function. Crossings for higher ages will occur later, but eventually, all curves will cross.

The "optimistic" interpretation of this is that there is a future life expectancy increase to happen at high ages which does not manifest today, and there is every reason to expect that will occur. But we must expect the size of this increase to be quite modest (yearly life expectancy increase at 90 is <2/100 yrs, and fairly constant) and not at all to be able to fully counteract the trendline crossing problem. Therefore, the only sensible interpretation is that the rates of life expectancy increases at all but the highest ages probably will decrease over time. The question is not if, but when.

Fig 1. Life expectations 1966-2017. Curves for all, men (blue) and women (red), bottom to top, 67, 75, 80, 85, 90, 95 years of age, with regression lines.

Fig 2.
A: Increases in expected life span per 100 years by age 67-95, based on 1966-2017 regression lines. Curves for all, men (blue) and women (red). The average curve gets ever close to the female with age, because mostly women survive to high ages.
B: Relative levels of pension, based on expected life years at normal retirement age of 67, 1966-2017 and constant amount of funds. (The principle in the current Norwegian system.) Normalized relative to average expected life span at 67 in 2005. (When the system was designed.) Curves for all, men (blue) and women (red). The actual pensions are gender-neutral in Norway, so this illustrates the basic redistribution effect of that principle.

Tuesday, 20 March 2018

Accessing Norwegian official statistical tables using the JSON API from Statistics Norway with R

JSON (JavaScript Object Notation) has become a de facto standard for data exchange, but the structuring may in some cases be difficult to handle. Therefore, a "light"-version especially suited for statistical data, called JSON-stat, has become quite popular for many applications. Among others, Statistics Norway chose this as the primary delivery format for data from Statistikkbanken, "bank of statistical data", also using JSON as query format.

Basically, a JSON query is HTTP POSTed to the API, using the table number as identification, and a JSON-stat formatted response is returned. A GET request to the same address provides a version of the metadata for the table. The data may then be converted to R data frames etc.

With the necessary R libraries installed and loaded, all operations can quite easily be performed from the command line in R, and by the standard user interface for Statistikkbanken, one may obtain the JSON query needed. For routine use though, it will in most cases be far easier to use a standardized set of functions for the basic data traffic, and one such small and simple set is presented here.

Basically, just two functions are normally used: getMetaData(tableId) for fetching table descriptions, and getJSONData(tableId,query,naming) for submitting a query and converting the response to an R data frame. A third function is provided for access to the raw data returned.

options(encoding="UTF-8")
library(httr)
# rjstat is used for converting SSB JSON -> Data frame
library(rjstat)
# jsonlite is used mainly for converting metadata 
library(jsonlite)
#source("ssb-json-functions.R")

# Fetches the meta data for a table, with number tableId, as a JSON structure
getMetaData <- function(tableId) {
  getUrl <- paste("http://data.ssb.no/api/v0/no/table/",tableId,sep="")
  d.tmp<-GET(getUrl)
  content(d.tmp,"text")
}

# getRawJSONData - Fetches SB-data by POST-request  tableId: Number of SB-table queryData: JSON-formatted query 
getRawJSONData <- function(tableId,queryData) {
#  POST query request
   d.tmp <- POST(paste("http://data.ssb.no/api/v0/no/table/",tableId,sep=""), body = queryData, encode = "json", verbose())
# Returns contents of d.tmp as JSON-formatted text 
   content(d.tmp, "text")
}

# getJSONData - Fetches SB-data by POST-request. Returns a data table for further processing
getJSONData <- function(tableId,queryData,naming="id") {
# Fetches the content processed by fromJSONstat
   sbtable <- fromJSONstat(getRawJSONData(tableId,queryData),naming=naming)
# Only dataset is used from sbtable and returned
   ds <- sbtable[[1]]
   ds
}

If no handling of metadata is to be performed, the two packages httr and rjstat will suffice, but for more flexible use of data and metadata jsonlite comes in handy. I have chosen to leave the verbose setting in the HTTP POST request "on", as it provides a bit more insight into what is happening.

Please note that at Statistics Norway, an new package ApiData by researcher Øyvind Langsrud, with a single user interface function ApiData, is now (March 2017) in internal use for these purposes. By the time you read this, it may be publicly available, and its use will be demonstrated in subsequent posts.

A typical JSON query and its execution

We will use a rather large table as an example, 07902, mortality data 1966-2017 (as per March 2018). The data are given by gender (variable Kjonn), age (AlderX), time (Tid) and four different contents types (ContentsCode).

I have wrapped the query in an R function rather than putting it into an R variable, as I have found it slightly easier to maintain and develop that way. YMMV. This query will retrieve the whole table, using the selection type "selection": {"filter": "all","values":["*"] } . The query has a query part and a response part - the latter is rather trivial "response": {"format": "json-stat"} and common for all json-stat queries on the site. To retrieve the whole table, it is not necessary to specify all variables, but I find complete specifications safer and over time simpler to work with - and they are easier to narrow down.

getQueryData07902 <- function() {
'{
  "query": [
  {
    "code":"Kjonn",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"AlderX",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"ContentsCode",
     "selection": {
    "filter": "all",
    "values":["*"] }
  },{
    "code":"Tid",
    "selection": {
    "filter": "all",
    "values":["*"] }
   }   ],
  "response": {
  "format": "json-stat"
  } 
}'
}

For repeated execution, the POST request may be wrapped into a function, returning a quite long R data frame:

getAllMortalityData07902 <- function(){
    getJSONData("07902",getQueryData07902())
}

> t07902 <- getAllMortalityData07902()
> length(t07902[,1])
[1] 66768

The first few lines look as follows, with the actual data in the value variable:

> t07902[1:5,]
   Kjonn AlderX    ContentsCode  Tid value
1      0    000 LevendePerTusen 1966 1e+05
2      0    000 LevendePerTusen 1967 1e+05
3      0    000 LevendePerTusen 1968 1e+05
4      0    000 LevendePerTusen 1969 1e+05
5      0    000 LevendePerTusen 1970 1e+05

Before we can use this, we must filter and transform it. This can be done in a lot of ways, and many tools are available. We shall start out quite simply, but first we save the data frame, to become independent of the connection. We also load it again, to make sure the saving works. It will be read from file into an object with the original name, here t07902.

> save(t07902,file="../data/t07902-2018.RData")
> load("../data/t07902-2018.RData")

Because all the transformations necessary for later use can be gathered in a single R function, there is, generally, not much to gain by transforming before saving. Unless the data are to be used in only one way.

Simple transformations

To give a fairly simple example of one way working from scratch, we will reconstruct a version of the 2017 mortality table from our data frame. We first filter out the 2017 data, next we pick the 4 different variables according to their ContentsCode values. Then we prepare a data frame with the necessary background data, and we merge this with the variables to create a mortality table data frame. We might also omit Tid here, as it is trivially 2017. For identification when we are handling several periods at once, keeping it could however be a precautionary measure.

One basic way, step by step

In the data frame, the data are sorted by sex and age, but it might be a bit risky to rely solely on this implicit order instead of filtering data by values.

We delete columns by setting them to NULL: b2017$value <- NULL ; b2017$ContentsCode <- NULL ;

> t2017 <-  t07902[t07902$Tid==2017,]
> table(t2017$ContentsCode)

             Dode Dodssannsynlighet   ForvGjenLevetid   LevendePerTusen 
              321               321               321               321 
> 
> lx <-  t2017[t2017$ContentsCode=="LevendePerTusen",5]
> ex <-  t2017[t2017$ContentsCode=="ForvGjenLevetid",5]
> qx <-  t2017[t2017$ContentsCode=="Dodssannsynlighet",5]
> dx <-  t2017[t2017$ContentsCode=="Dode",5]
> t2017d <-  t2017[t2017$ContentsCode=="Dode",]
> b2017 <- t2017d ;
> b2017$value <- NULL ; b2017$ContentsCode <- NULL ;
> dt2017 <- data.frame(b2017,lx,qx,dx,ex)

> dt2017[1:5,]
    Kjonn AlderX  Tid     lx    qx  dx    ex
104     0    000 2017 100000 2.236 224 82.63
312     0    001 2017  99776 0.150  15 81.81
520     0    002 2017  99761 0.165  16 80.82
728     0    003 2017  99745 0.115  11 79.84
936     0    004 2017  99733 0.113  11 78.85

>

We can make a simple plot, for example of life expectancy, to check the transformations and look at the data.

> dt2017All <- dt2017[dt2017$Kjonn==0,]
> dt2017M <- dt2017[dt2017$Kjonn==1,]
> dt2017F <- dt2017[dt2017$Kjonn==2,]
> plot(as.numeric(dt2017All$AlderX),dt2017All$ex)
> points(as.numeric(dt2017All$AlderX),dt2017M$ex,type="l",col=4)
> points(as.numeric(dt2017All$AlderX),dt2017F$ex,type="l",col=2)

Making a function of it

This exercise would hardly be something you want to repeat many times to get mortality tables for several years, or when new data are available. But by making a function of it, with year and data frame as parameters, handling may become fairly efficient. Provided the data structure does not change, updating or creating analyses for a new year may be done quite efficiently. And with little thought about the data formats and transformations involved.

mortTableYear <- function(year,df=t07902) {
    tYr <-  df[df$Tid==year,] ; tYrD <-  tYr[tYr$ContentsCode=="Dode",] ;
    bYr <- tYrD ; bYr$value <- NULL ; bYr$ContentsCode <- NULL ; bYr$Tid <- NULL ;
    cCodes <- c("Dode","Dodssannsynlighet","ForvGjenLevetid","LevendePerTusen") 
    mData <- matrix(0,nrow=length(bYr[,1]),ncol=4)
    for (i in 1:4) mData[,i] <- tYr[tYr$ContentsCode==cCodes[i],5] 
    dfYr <- data.frame(bYr,mData)
    names(dfYr) = c("Kjonn","AlderX","dx","qx","ex","lx") ;
    dfYr 
}

> m2016 <- mortTableYear(2016)
> m2016[1:5,]
     Kjonn AlderX  dx    qx    ex     lx
103      0    000 216 2.158 82.42 100000
311      0    001  18 0.183 81.59  99784
519      0    002  13 0.132 80.61  99766
727      0    003   5 0.049 79.62  99753
935      0    004   5 0.048 78.62  99748

Generally, there is little reason to work with an awkward data organization of R-data if the problem at hand is not a one-off case. In most cases, transformations are relatively easy to set up and apply. For example, we don't need to decide if we want to organize the data by years or by variables, we can do both.

Here is a function version of the above plot, complete with parameter for png-saving. The function-within-a-function pattern is efficient for factoring out common features of function calls, and defining and handling a local context. Local data in an outer function are "global" in the inner functions, in a kind of Poor Man's Object Orientation.

testPlotEX <- function(df,savePng=0) {
    plotter <- function(y,newP=0,col=1) {
        if (newP==1) plot(as.numeric(dfA$AlderX),y,type="l",xlab="Age",ylab="Expected years left",col=col)
        else points(as.numeric(dfA$AlderX),y,type="l",col=col)
        legend(65,75,col=c(1,2,4),lty=c(1,1,1),legend=c("All","Women","Men"))
    }

    X11()
    if (savePng>0) png(file="life_expect_0.png") ;
    dfA <- df[df$Kjonn==0,] ;  dfM <- df[df$Kjonn==1,] ; dfF <- df[df$Kjonn==2,] ;
    plotter(dfA$ex,newP=1) ;  plotter(dfM$ex,col=4) ;  plotter(dfF$ex,col=2) ;
    if (savePng>0) dev.off() ;
}

>
> testPlotEX(dt2017,savePng=0) 
> testPlotEX(dt2017,savePng=1) 
>

NB! This is work in progress :-)

Monday, 19 March 2018

Prime sieving: Eratosthenes as an R-user

Prime sieving is an ancient and (still) fairly efficient way of finding primes. In addition, it may fit well in as a part of benchmark suits, and different implementations may provide interesting data for performance comparisons.

This in particular applies to R, where the "don't loop if you can avoid it" principle could possibly be challenged in cases where a large number of vector operations is the alternative to looping, and one ends up with a fair number of loops anyway.

The sieve of Eratosthenes (276-194/195 BC) provides a prime (no pun intended) example of this.

The sieving process: Looping

Given a number N, the "classical" version of the sieve marks off the multiples of each prime number less than or equal to sqrt(N) in an array. The lowest unmarked number is the next prime, and when we have done this for all the candidates, we know the remaining unmarked numbers are primes. As a small optimization, for a prime p, we may start with p², since the lower multiples of p have already been marked off by smaller prime factors.

A simple and straightforward implementation is the function sieveLoop:

sieveLoop <- function(MAXN=1000){
  numbers <- seq(1:MAXN) ; 
  maxSmallFactor <- round(sqrt(MAXN)) ;
  currPrime <- 2 ;   numbers[1] <- 0 ;
  while(currPrime <= maxSmallFactor) {
      checkNum <- currPrime*currPrime ;
      while(checkNum<=MAXN){
          numbers[checkNum] <- 0 ;
          checkNum <- checkNum + currPrime ;
      }
      currPrime <- currPrime + 1 ;
      while (numbers[currPrime]==0)  currPrime <- currPrime + 1 ;
  }
  numbers[numbers!=0]
}

In this case, the only place where vector operations are used non-trivially is the return statement, where the non-zero entries are returned in an array, but as that occurs just once per sieving process, the impact on the run-time behaviour should be quite small.

Sieving using vector operations

The function sieveVector uses vector operations instead of loops for the basic sieving process, removing all numbers congruent modulo zero to the current prime, and using the first number in the resulting array as the next prime. That number is then stored in an array of found primes. This way, there is no need for a final sieving either, as the array of remaining numbers when we have sieved up to sqrt(N) must be primes. Thus, we may just concatenate our array of found primes up to sqrt(N) with the remaining numbers > sqrt(N) to get the result.

sieveVector <- function(MAXN=1000){
    numbers <- 3:MAXN
    maxSmallFactor <- round(sqrt(MAXN)) ;
    currPrime <- 2 ; primes <- c(2) 
    while(currPrime <= maxSmallFactor) {
        numbers <- numbers[numbers %% currPrime!=0]
        currPrime <- numbers[1] ;  
        primes <- c(primes,currPrime)
    }
    primes <- c(primes,numbers)
    primes
}

This "R-centric" approach seems to be much better, and in particular for smaller N indeed it is. For N<1000000 it is about 10x faster than the loop approach, and when setting up a comparison I expected the vector method to be uniformly the unchallenged winner, no matter what.

Testing

I set up a function testSieve using R's simple system timer doing the same sieving with both methods, and returning N and the total time (in seconds). This function was then called by another, comparePrimeSieves, which called it for powers of 2 times a starting size (default 10000), and returning the full series of results as rows in a matrix.

testSieve <- function(MAXN) {
    print(t1 <- system.time(p1 <- sieveLoop(MAXN)))
    print(t2 <- system.time(p1 <- sieveVector(MAXN)))
    c(MAXN,t1[3],t2[3])
}

comparePrimeSieves <- function(firstN=10000,nSteps=10,stepSize=2){
    currSize <- firstN ;
    resM <- matrix(0,nrow=nSteps,ncol=3,byrow=T)
    for (i in 1:nSteps) {
        t1 <- testSieve(currSize) ;
        resM[i,] <- c(currSize,t1[2],t1[3])    
        currSize <- currSize * stepSize ;
        if (currSize>500000) print(paste("CurrSize: ",currSize)) ;
    }
    resM
}

Running this comparison for 10000*(2..2¹⁵)= 20000..165 840 000, we should get a good impression of the performance.

Performance

Fig 1. Running times for loop (red) and vector (green)
versions of the sieve, with regression lines.
T in microseconds in regression equations

The results are presented in a log-log plot in figure 1. I was in for a surprise. Both methods produced linear plots, but while the looping method was approximately linear in N, the vector method followed N^1.28. (Close to 3^rd root of 2.) And for the highest N tested, the linear fit was rather bad. So while the vector method was still faster there, the difference had shrunk to 389 vs 310 seconds.

A variant where the vector of primes was allocated first instead of being extended with each prime (the primes <- c(primes,currPrime) step) ran about 5% faster for N=25000000, but otherwise showed qualitatively the same behavior.

Except for the obvious confirmation of the effectiveness of the vector approach, my two takeaways from this experiment is that full scale checking is necessary to safely choose between functionally equivalent approaches, and that an adequate general algorithm doesn't necessarily have to be morphed into an R-adapted incarnation to be useful in R: Depending on purpose, for occasional use, the performance of the "foreign" looping version in this case could be adequate.

Prime number theorem

The occurrence of primes is highly irregular. Can we still say something about how many primes are to be expected in a certain number range? It is standard notation to call the number of primes<=N for π(N), so given the interval (N₁,N₂), we are looking for an estimate of π(N₂) - π(N₁).

Fig 2. N/π(N) vs log(N)

The striking regularity in fig 2 was soon discovered by mathematicians: N/π(N) follows log(N) closely, with the relation already apparent for N<5000. For a long time, the prime number theorem was a conjecture, until it was first established (independently) by Jacques Hadamard and Charles Jean de la Vallée-Poussin in 1896. They used fairly heavy mathematical machinery, but in 1949 the first "elementary" proofs were published, by Atle Selberg and Paul Erdős. The shortest and simplest proof to date is considered to be Donald J. Newman's from 1980. It is "non-elementary" only in the sense that it uses Cauchy's integral theorem from complex analysis.

There are several heuristic motivations for the theorem. One of the more thorough is presented in a blog entry by Terry Tao. Here, he introduces, among other number-theoretic concepts, the von Mangoldt function and Möbius inversion. It is not room for going into any details here, but the reader is encouraged to try some digging into these areas of number theory. I think it is very rewarding.

R as a tool for experimentation

I had not expected R to be as efficient at sieving millions of primes, using simple implementations with no attempts at adaptations or optimizations. So it seems to be an even better tool for experimentation than I had thought. The important quality to look for here is to be able to focus on the problems, not implementation and optimization details.