The Art of Extremely Inefficient Self-Benchmarking

On one cold night not too long ago, as I pondered my upcoming PC build (a build that I’ve been ~~dreaming~~ planning to do for the past 2 years, and has yet to happen, and inevitably wont for the foreseeable future) I wondered whether I could create my own benchmark. And to keep it simple, one targeted primarily for CPU usage.

I thought of a few ways to test my aging CPU’s little remaining mettle (for reference it’s a Core 2 Duo E6600. And yes, I know… it sucks). But without getting too much memory or HDD I/O activity involved, I eventually decided on the ever reliable prime numbers. “Easy enough” I thought to myself, shouldn’t be more than a few lines of code.

And surprisingly, for once, I was right.

In a around five minutes I was able to write a simple program to calculate all prime number below a given point, in this case 1000.

#include <stdio.h>
#include <time.h>

void main() 
{
    clock_t start, end;
    double runTime;
    start = clock();
    int i, num = 1, primes = 0;

    while (num <= 1000) 
    { 
        i = 2; 
        while (i <= num) 
        { 
            if(num % i == 0)
                break;
            ++i; 
        }
        if (i == num)
            primes++;

        ++num;
    }

    end = clock();
    runTime = (end - start) / (double) CLOCKS_PER_SEC;
    printf("This machine calculated all %d prime numbers under 1000 in %g seconds\n", primes, runTime);
}

On my PC it took around 8 minutes to crunch all prime numbers under 1000, with CPU usage bouncing between 80-100% according to Windows task manager. Now I know the algorithm I conjured isn’t exactly perfect… Ok, I know – it’s terrible. Considering PrimeGen is able to generate primes up to 1000000000 in just 8 seconds on a Pentium II-350, my implementation is certainly a bad example.

But that is ultimately besides the point. It didn’t surprise me that my implementation would be worse than what I’m sure many, many other programmers out there have written. I’m a firm believer of not re-inventing the wheel. An excellent book for those interested on design matters and complex software design, which is titled, rather succinctly “Design Patterns”describes reinventing the wheel as an actual anti-pattern for software architecture.

So why did I seemingly do just that? Jeff Atwood’s blog post on this same topic sums up exactly why quite nicely. And it’s a great read for those tackling the age-old wisdom of the wheel business. Here’s the overall takeaway:

Indeed. If anything, “Don’t Reinvent The Wheel” should be used as a call to arms for deeply educating yourself about all the existing solutions – not as a bludgeoning tool to undermine those who legitimately want to build something better or improve on what’s already out there.
(…)
So, no, you shouldn’t reinvent the wheel. Unless you plan on learning more about wheels, that is.

-Jeff Atwood

Back to the code however, the near-100% CPU usage reported by task manager is actually not entirely true. Considering the E6600 is a dual core processor, a single-threaded program should not consume 100% of toal CPU time, 50% at most. The likely result of course was my code being executed and run at 100% on one physical CPU thread, and the OS scheduler pushed the remaining background process onto the second physical CPU thread.

To confirm this, I re-compiled and ran the code on my quad core Xeon machine at work, and lo and behold it was at 25% total CPU usage, calculating all prime numbers up to 1000 in around 2 seconds.

At that moment a comical light bulb lit – how could I parallelise this program to use all available cores? Admittedly I was a little disappointed with my E6600 benchmark results, and I knew it’s time hadn’t passed just yet. So I began work on creating a multi-threaded version of my program. Question was, how do I go about this?

Prime95, probably the most popular prime-number based benchmarking utility, took the approach of creating individual processes that are executed on individual cores. After the processes have finished execution, a main process collects and outputs the results. This certainly works but the approach seemed a little code heavy for the little time I had; and to be perfectly honest, I was feeling pretty lazy that night.

I had previously used boost::thread, std::thread and other POSIX based implementations (PThread) before, but considering Windows does not natively support the POSIX-standard, I was interested in other means of parallelization. I then stumbled across this, and was swiftly introduced to OpenMP.

OpenMP is probably as simple as it’s going to get as far as multi-threaded API’s go, it works on lots of operating systems and is pretty easy to learn. Here’s the resulting source code of my Extremely Inefficient Multi-Threaded Prime Number Benchmark using an OpenMP implementation:

#include <stdio.h>
#include <time.h>
#include <omp.h>

int main() 
{
    double start, end;
    double runTime;
    start = omp_get_wtime();
    int num = 1, primes = 0;
    int limit = 1000000;

#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
    for (num = 1; num <= limit; num++) 
    { 
        int i = 2; 
        while(i <= num) 
        { 
            if(num % i == 0)
                break;
            i++; 
        }
        if(i == num)
            primes++;
//      printf("%d prime numbers calculated\n",primes);
    }

    end = omp_get_wtime();
    runTime = end - start;
    printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);

    return 0;

There you have it, I ended up running the program much faster on my E6600, computing primes up to 1000 at around 5 minutes, nearly half the time it took for the single-threaded counterpart.

So it looks like my E6600 isn’t entirely obsolete after all. It still has that modern parallel-thing in her somewhere – I’m sure she’ll stay kicking for another few years… right?

UPDATE: 12/05/2014

Since writing this post it is with the upmost joy and glee sadness I have to announce the replacement of my Intel Core 2 E660 and aging Nvidia GTX 570 with an AMF FX-8320 (an 8-cored CPU) and a AMD 7990 (yes, that monstrosity of a card, with dual GPU’s on a single PCB). More paralleling madness is sure to come.

The Art of Extremely Inefficient Self-Benchmarking

Leave a Reply Cancel reply