Gompertz rules: A 50-year perspective on "contemporary" mortality

There are many ways to describe human mortality, the "best" way will often depend on the purpose. It is influenced by (among others) genetical, environmental, social and behavioral factors. With so many different factors influencing it, one single functional form may hardly be expected optimal for all modeling of life times. But some persistent patterns may emerge, and this is neatly demonstrated by recent Norwegian mortality data confirming the 1810-1825 observations of Benjamin Gompertz (1779-1865). Drawing on several different data sources, Gompertz found that the logarithm of the hazard rate for the elderly seemed to always be approximately a linear function of age, and we see the same basic pattern today: In this respect, Gompertz rules.

The hazard rate of a distribution with cdf F and pdf f, is defined as h(x) = f(x)/(1-F(x)) = d/dx log(1-F(x)). It has an immediate interpretation and application: With N(x) individuals at age x, and hazard rate h(x), the probability of one dying within one year is h(x), and we can expect approximately N(x)h(x) to die in that period. Given an exact h(x), F(x) is determined modulo eventual constants, but in most cases, the information we obtain about the hazard rate may be consistent with different types of distributions. The exponential hazard rate seen in human mortality is kind of limit case in this respect.

Gompertz published a comprehensive report on this in 1825, but did not dwell much on systematic deviations from the linear relationship at younger ages. In 1860, William Makeham (1826-1891) improved on the fit for younger ages by extending the Gompertz model with a constant added to the hazard rate. The resulting distribution is still widely applied as the Gompertz-Makeham law of mortality. It tends to fit fairly well from 15-20 to about 100 years if age.

In 1860, the high overall mortality made the Makeham term highly significant, but most developed countries today have a very low mortality at younger ages, which makes it less practically important. For example, in Norway today only 0.5-1% of females die between ages 20 and 40, and about 98% of female newborns are expected to reach the age of 50.

Gompertz' observation: Mortality follows a geometric progression

Gompertz discovered that whatever adult age he started counting from, mortality n years after, as measured by hazard rate h_n, could be approximately expressed as a geometrical progression of the initial value h₀:
h_n = h₀αⁿ
Taking logarithms, we get
log(h_n) = log(h₀) + n log(α)
I.e. a linear function of time (age) n, and whenever this is approximately correct, the Gompertz distribution will be a (more or less accurate) model for the life times. The Gompertz cumulative distribution function is fairly simple, here is the parametrization used in R's flexsurv package:
F(x) = 1 - exp(-b/a (e^ax-1))
The hazard rate becomes h(x) = f(x)/(1-F(x)) = b e^ax. The corresponding Gompertz-Makeham rate is h(x) = b e^ax + λ

An interpretation of a, is that the mortality doubles in (log(2)≈0.69)/a years. E.g. with a=0.10, mortality doubles every 7 years. This period varies widely, but 5-10 years is a normal range. When overall mortality decreases, the doubling time often decreases, i.e. a increases.

The linear log-hazard function may seem unreasonable

The Gompertz rule has very strong assertions (linearity over an extremely wide range of mortality rates) and an obvious way it breaks down: At some age x_max, we will have a*x_max + log(b) > 0 . That means for P(death) in the next period δx =1 : P(death) ≈ b exp(a*x_max)*δx > 1 So death in the next period is more than guaranteed :-) Therefore, we must expect that from some high age on, this rate will be constant or even fall a bit. But what is the practical relevance of that?

I was more concerned about the shortcomings before looking at the Norwegian data. It turns out that the deviations from linear log-mortality has not increased over time, in spite of a tremendous increase in people reaching very old age. Rather, the practically important deviations mostly occur between 70 and 95, at around 100, Norwegian mortality follows standard Gompertz quite closely. It also turns out the the regression line in question tends to "rotate": Basic mortality have been steadily decreasing, while the age-dependent part has tended to increase lately, both for men and women. The mortality at 95-100 is rather stable, in spite of the huge increase in the number of persons reaching that age. Also, the gender differences in those parameters were much larger 50 years ago.

Used on the Norwegian data, the linear log-hazard function is reasonable. It is not in urgent need for any of the generalizations that have been proposed, and even the Makeham correction term is of only marginal practical significance. But it does not predict the number of deaths 70-90 in recent years very precisely. For ages ca 70-80, it overestimates, and for ca 83-93, it underestimates mortality. A modification which operates "locally" and addresses this shortcoming might be of practical significance.

Plotting Gompertz mortality

Rather than estimating the Gompertz parameters (a,b) by maximum likelihood estimation (MLE), we have focused on the age range 50-90, approximating the log-mortality function by linear regression. In this, we follow (loosely) the method presented in demography course material by Germán Rodríguez of Princeton.

We construct the age vs deaths (x,dx) data from the lx column of the mortality table. Parameters are estimated, and plots of data and residues presented for selected years: 1967,1997,2007 and 2017.

The time development of the two parameters are also plotted, separately and vs each other. An "educated guess" of future values may be made from these plots: First, extrapolating the log(b)-value for the actual year, and then getting the a-value from the a vs log(b) linear regression from latter years. Drawing from the "guessed" Gompertz distribution may provide some basic information on future mortality, but the model fit for age distribution of deaths will probably not be very good for ca 70-90 years of age.

Because we focus on the age interval where >90% of deaths occur (ca 65-105), we don't consider the Makeham adaptation here - it is mostly influential below 50 years of age, and its inclusion does not help that much for the systematic deviations we observe. Nor do we explore the most immediate generalizations of the Gompertz distribution, e.g. x -> x^k in the exponent: F(x) = 1 - exp(-b/a (e^{ax^k}-1)). Or Perks type distributions with hazard rate h(x) = (b exp(ax) + c)/(d exp(ax) + 1) .

Some R-code

We prepare a table with age and sex as background variables, and columns with lx for the years 1966-2017. All years are processed, and plots are produced for selected years and time evolution of the parameters.

paramEstimYear <- function(dF,mkPlot=0,yr=1967,regrStart=50) {
   prepareHazard <- function(hdF,plotcol=4,yr,regrStart=50,plotNum=1) {
        hdF$lx <- ifelse(hdF$lx>0,hdF$lx,2) 
        lxm <- mutate(hdF, lx = lx/100000, Hx = -log(lx)) ;
        lxhm <- data.frame(hx = diff(lxm$Hx), x = midpoints(lxm$age))
        lfm <- lm(log(hx) ~ I(x - regrStart), data = filter(lxhm, (x >=regrStart)&(x<=90)))
        lxhmf <- filter(lxhm, (x >= regrStart)&(x<=90)) %>%  mutate(hf = exp(fitted(lfm)))
      
        if (mkPlot==1) {
           if (plotNum==1) mainTitle <- paste("Year: ",yr) else mainTitle <- " " ;
           plot(lxhm$x,log(lxhm$hx),ylim=c(-11,0),col=plotcol,xlab="Age x",ylab="log(h(x)",main=mainTitle)
           abline(lfm$coef[1]-regrStart*lfm$coef[2],lfm$coef[2],col=plotcol,lty=4)
           plot((regrStart+1):90,lfm$resid,ylim=c(-0.7,0.7),col=plotcol,
           xlab="Age x",ylab="Residuals of log(h(x)")
        }
        list(lxhm=lxhm,lfcoef=lfm$coef,lfres=lfm$resid)  
    }
    
    yrM <- data.frame(dF$age,dF$maleYear) ;  yrF <- data.frame(dF$age,dF$femYear) ;
    names(yrM) <- c("age","lx") ;  names(yrF) <- c("age","lx") ;
    yrHM <- prepareHazard(yrM,plotcol=4,yr,regrStart,plotNum=1) ;
    yrHF <- prepareHazard(yrF,plotcol=2,yr,regrStart,plotNum=2) ;
    list(yrHM=yrHM,yrHF=yrHF)
}


paramEstimSeriesOfYears <- function(sT,years=1967:2017,mkPlot=0,regrStart=50) {
    resM <- matrix(0,nrow=length(years),ncol=5) ; resM[,1] <- years ;
    for (yr in 1:length(years)) {
        yearData <- prepareYear(sT,as.character(yr+years[1]-1)) ;
        pE <- paramEstimYear(yearData,regrStart=regrStart) ;
        resM[yr,2] <- pE$yrHM$lfcoef[1] ;  resM[yr,3] <- pE$yrHM$lfcoef[2] ;
        resM[yr,4] <- pE$yrHF$lfcoef[1] ;  resM[yr,5] <- pE$yrHF$lfcoef[2] ;
    }
    resM
}

plotMortalityParameters <- function(resM,regrStart=50,lastPerStart=2007,savePng=0) {

    yrs <- resM[,1] ; a0M <- resM[,2] - resM[,3]*regrStart ;  a0F <- resM[,4] - resM[,5]*regrStart ;
    aFitM <- lm(a0M~yrs) ;    bFitM <- lm(resM[,3]~yrs) ;
    aFitF <- lm(a0F~yrs) ;    bFitF <- lm(resM[,5]~yrs)
    aCoefM <- aFitM$coef ; bCoefM <- bFitM$coef ;
    aCoefF <- aFitF$coef ; bCoefF <- bFitF$coef ;

    aFitM25<- lm(a0M[yrs>lastPerStart]~resM[yrs>lastPerStart,1]) ;    
    bFitM25 <- lm(resM[yrs>lastPerStart,3]~resM[yrs>lastPerStart,1]) ;
    aFitF25 <- lm(a0F[yrs>lastPerStart]~resM[yrs>lastPerStart,1]) ;    
    bFitF25 <- lm(resM[yrs>lastPerStart,5]~resM[yrs>lastPerStart,1])
    aCoefM25 <- aFitM25$coef ; bCoefM25 <- bFitM25$coef ;
    aCoefF25 <- aFitF25$coef ; bCoefF25 <- bFitF25$coef ;    

    abM25<- lm(resM[yrs>lastPerStart,3]~a0M[yrs>lastPerStart]) ; 
    abF25<- lm(resM[yrs>lastPerStart,5]~a0F[yrs>lastPerStart]) ;  
    abCoefM25 <- abM25$coef ; abCoefF25 <- abF25$coef ;
    print(paste(abCoefM25, abCoefF25))
   
    if (savePng>0) png(filename='gomp_f1.png',width=1000,height=600)
    else  X11(width=12,height=6)
    par(mfrow=c(1,3))
    plot(yrs,a0M,col=4,ylim=c(-12.5,-9.5),xlab="Year",ylab="Basic mortality",main="Parameter a in y=a+bx")
    points(yrs,a0F,col=2)
    abline(aCoefM[1],aCoefM[2],col=4)
    abline(aCoefF[1],aCoefF[2],col=2)
    abline(aCoefM25[1],aCoefM25[2],col=4,lty=3)
    abline(aCoefF25[1],aCoefF25[2],col=2,lty=3)  
    legend(1990,-4.75,pch=c(1,1),col=c(4,2),legend=c("Males","Females"))
    
    plot(resM[,1],resM[,3],col=4,ylim=c(0.09,0.12),xlab="Year",ylab="Aging mortality component",
       main="Parameter b in y=a+bx")
    points(resM[,1],resM[,5],col=2)
    abline(bCoefM[1],bCoefM[2],col=4)
    abline(bCoefF[1],bCoefF[2],col=2)
    abline(bCoefM25[1],bCoefM25[2],col=4,lty=3)
    abline(bCoefF25[1],bCoefF25[2],col=2,lty=3)

    plot(a0M,resM[,3],col=4,xlim=c(-12.5,-9.5),ylim=c(0.09,0.12),xlab="a-parameter",
         ylab="b-parameter",main="b vs a parameters in y=a+bx")
    points(a0F,resM[,5],col=2)
    abline(abCoefM25[1],abCoefM25[2],lty=4,col=4)
    abline(abCoefF25[1],abCoefF25[2],lty=4,col=2)
   
    if (savePng>0) dev.off()
    else dev.copy2eps(device=x11,file='gomp_f1.eps') ;
 
    c(aCoefM25[1],aCoefM25[2],aCoefF25[1],aCoefF25[2],bCoefM25[1],bCoefM25[2],bCoefF25[1],bCoefF25[2])
}

processingGompertz <- function(mkPlotSlct=1,regrStart=50,savePng=0) {
    sT <- getSurvivalTable()
    resM <- paramEstimSeriesOfYears(sT,years=1967:2017,mkPlot=0,regrStart=regrStart) 
    regrC <- plotMortalityParameters(resM,savePng=savePng)

    if (mkPlotSlct==1) {
        if (savePng>0) png(filename='gomp_f2.png',width=1000,height=1200)
        else X11(height=12,width=12) ;
        par(mfrow=c(4,4)) ;
        y1967 <- prepareYear(sT,"1967");  pE <- paramEstimYear(y1967,mkPlot=1,yr=1967,regrStart=regrStart);
        y1997 <- prepareYear(sT,"1997");  pE <- paramEstimYear(y1997,mkPlot=1,yr=1997,regrStart=regrStart);
        y2007 <- prepareYear(sT,"2007");  pE <- paramEstimYear(y2007,mkPlot=1,yr=2007,regrStart=regrStart);
        y2017 <- prepareYear(sT,"2017");  pE <- paramEstimYear(y2017,mkPlot=1,yr=2017,regrStart=regrStart);
        
        if (savePng>0) dev.off()         
        else dev.copy2eps(device=x11,file='gomp_f2.eps') ;
    }
    list(regrC=regrC,resM=resM)
}

Git repository for this blog: https://github.com/tyder-no/r-and-friends

Mortality: Basic patterns and some systematic deviations

The basic qualitative pattern of the log-hazard rate plots has hardly changed over 50 years, but there are a number of features to note:

During the last 30 years, there has been a fairly stable deviation from linearity, as illustrated by the residuals for 70-90 years in most cases having an increasing linear tendency.
Mortality at 93-100 years is "back on track", in spite of these years not being included in the regression fit.
There is a roughly linear "Makeham-type" deviation ca 25-50, horizontal for men, but not for women.
The last 30 years, strong deviations from linear log-mortality at 50+ only occurs above 100 years of age.

Fig 1. Log mortality plots for 1967, 1997, 2007 and 2017. Males (blue) and females (red), residual plots for the regression fit, which is performed for the interval (50,90).

Time evolution of the hazard rate parameters

A common feature is that the "basic mortality", as described by the b-parameter in the linear fit to the log(hazard rate), has been declining for both men and women, with a much faster rate for men - but from a much higher level. After 2010, this rate has continued to fall, but now men's rates are similar to women's - whose reduction may in turn have accelerated the last 10 years. But the absolute levels are low.

50 years ago, women "aged" quite a bit faster than men, as indicated by the a-parameter in the linear fit to the log(hazard rate). While it decreased for some decades for women, it now seems to increase for both sexes, and coefficients are similar.

Fig 2. Panels 1 and 2: Gompertz mortality parameters a and b for males and females 1967-2017, with regression lines for the whole period and for 2007-2017. Note than while these regression fits for men are similar for the two periods, for women they deviate strongly. Panel 3 shows a vs b with regression lines 2007-2017. Changes in the parameters have been strongly correlated in the recent years. Note that the "b" parameter in the plot is actually log(b) from the distribution function, while the "a" parameter is the same.

R and friends - Computational strolls

Monday, 2 April 2018