R and friends - Computational strolls

Prime sieving: Eratosthenes as an R-user

Prime sieving is an ancient and (still) fairly efficient way of finding primes. In addition, it may fit well in as a part of benchmark suits, and different implementations may provide interesting data for performance comparisons.

This in particular applies to R, where the "don't loop if you can avoid it" principle could possibly be challenged in cases where a large number of vector operations is the alternative to looping, and one ends up with a fair number of loops anyway.

The sieve of Eratosthenes (276-194/195 BC) provides a prime (no pun intended) example of this.

The sieving process: Looping

Given a number N, the "classical" version of the sieve marks off the multiples of each prime number less than or equal to sqrt(N) in an array. The lowest unmarked number is the next prime, and when we have done this for all the candidates, we know the remaining unmarked numbers are primes. As a small optimization, for a prime p, we may start with p², since the lower multiples of p have already been marked off by smaller prime factors.

A simple and straightforward implementation is the function sieveLoop:

sieveLoop <- function(MAXN=1000){
  numbers <- seq(1:MAXN) ; 
  maxSmallFactor <- round(sqrt(MAXN)) ;
  currPrime <- 2 ;   numbers[1] <- 0 ;
  while(currPrime <= maxSmallFactor) {
      checkNum <- currPrime*currPrime ;
      while(checkNum<=MAXN){
          numbers[checkNum] <- 0 ;
          checkNum <- checkNum + currPrime ;
      }
      currPrime <- currPrime + 1 ;
      while (numbers[currPrime]==0)  currPrime <- currPrime + 1 ;
  }
  numbers[numbers!=0]
}

In this case, the only place where vector operations are used non-trivially is the return statement, where the non-zero entries are returned in an array, but as that occurs just once per sieving process, the impact on the run-time behaviour should be quite small.

Sieving using vector operations

The function sieveVector uses vector operations instead of loops for the basic sieving process, removing all numbers congruent modulo zero to the current prime, and using the first number in the resulting array as the next prime. That number is then stored in an array of found primes. This way, there is no need for a final sieving either, as the array of remaining numbers when we have sieved up to sqrt(N) must be primes. Thus, we may just concatenate our array of found primes up to sqrt(N) with the remaining numbers > sqrt(N) to get the result.

sieveVector <- function(MAXN=1000){
    numbers <- 3:MAXN
    maxSmallFactor <- round(sqrt(MAXN)) ;
    currPrime <- 2 ; primes <- c(2) 
    while(currPrime <= maxSmallFactor) {
        numbers <- numbers[numbers %% currPrime!=0]
        currPrime <- numbers[1] ;  
        primes <- c(primes,currPrime)
    }
    primes <- c(primes,numbers)
    primes
}

This "R-centric" approach seems to be much better, and in particular for smaller N indeed it is. For N<1000000 it is about 10x faster than the loop approach, and when setting up a comparison I expected the vector method to be uniformly the unchallenged winner, no matter what.

Testing

I set up a function testSieve using R's simple system timer doing the same sieving with both methods, and returning N and the total time (in seconds). This function was then called by another, comparePrimeSieves, which called it for powers of 2 times a starting size (default 10000), and returning the full series of results as rows in a matrix.

testSieve <- function(MAXN) {
    print(t1 <- system.time(p1 <- sieveLoop(MAXN)))
    print(t2 <- system.time(p1 <- sieveVector(MAXN)))
    c(MAXN,t1[3],t2[3])
}

comparePrimeSieves <- function(firstN=10000,nSteps=10,stepSize=2){
    currSize <- firstN ;
    resM <- matrix(0,nrow=nSteps,ncol=3,byrow=T)
    for (i in 1:nSteps) {
        t1 <- testSieve(currSize) ;
        resM[i,] <- c(currSize,t1[2],t1[3])    
        currSize <- currSize * stepSize ;
        if (currSize>500000) print(paste("CurrSize: ",currSize)) ;
    }
    resM
}

Running this comparison for 10000*(2..2¹⁵)= 20000..165 840 000, we should get a good impression of the performance.

Performance

Fig 1. Running times for loop (red) and vector (green)
versions of the sieve, with regression lines.
T in microseconds in regression equations

The results are presented in a log-log plot in figure 1. I was in for a surprise. Both methods produced linear plots, but while the looping method was approximately linear in N, the vector method followed N^1.28. (Close to 3^rd root of 2.) And for the highest N tested, the linear fit was rather bad. So while the vector method was still faster there, the difference had shrunk to 389 vs 310 seconds.

A variant where the vector of primes was allocated first instead of being extended with each prime (the primes <- c(primes,currPrime) step) ran about 5% faster for N=25000000, but otherwise showed qualitatively the same behavior.

Except for the obvious confirmation of the effectiveness of the vector approach, my two takeaways from this experiment is that full scale checking is necessary to safely choose between functionally equivalent approaches, and that an adequate general algorithm doesn't necessarily have to be morphed into an R-adapted incarnation to be useful in R: Depending on purpose, for occasional use, the performance of the "foreign" looping version in this case could be adequate.

Prime number theorem

The occurrence of primes is highly irregular. Can we still say something about how many primes are to be expected in a certain number range? It is standard notation to call the number of primes<=N for π(N), so given the interval (N₁,N₂), we are looking for an estimate of π(N₂) - π(N₁).

Fig 2. N/π(N) vs log(N)

The striking regularity in fig 2 was soon discovered by mathematicians: N/π(N) follows log(N) closely, with the relation already apparent for N<5000. For a long time, the prime number theorem was a conjecture, until it was first established (independently) by Jacques Hadamard and Charles Jean de la Vallée-Poussin in 1896. They used fairly heavy mathematical machinery, but in 1949 the first "elementary" proofs were published, by Atle Selberg and Paul Erdős. The shortest and simplest proof to date is considered to be Donald J. Newman's from 1980. It is "non-elementary" only in the sense that it uses Cauchy's integral theorem from complex analysis.

There are several heuristic motivations for the theorem. One of the more thorough is presented in a blog entry by Terry Tao. Here, he introduces, among other number-theoretic concepts, the von Mangoldt function and Möbius inversion. It is not room for going into any details here, but the reader is encouraged to try some digging into these areas of number theory. I think it is very rewarding.

R as a tool for experimentation

I had not expected R to be as efficient at sieving millions of primes, using simple implementations with no attempts at adaptations or optimizations. So it seems to be an even better tool for experimentation than I had thought. The important quality to look for here is to be able to focus on the problems, not implementation and optimization details.