Spark and HyperLogLog: distributed cardinality estimation

It’s been a few weeks since I have started playing with Apache Spark, and as I am getting more serious with it, I will probably use it more in my daily job routine too. However, I have decided to write down a few examples while I learn the basics (as well as learning the Scala syntax).

The first example is a simple cardinality estimator using HyperLogLog. It is a very simple application of the combineByKey( ) function of a PairRDD on Spark.

  v => {
    val hll = new HyperLogLog(16)
  (acc: HyperLogLog, v) => {
  (acc1: HyperLogLog, acc2: HyperLogLog) =>

I have wrote a simple Python script to generate some sample data and I have run the code using spark-submit. For a 450Mb file, it takes 16 seconds on my MacBook Pro.

Davides-MacBook-Pro:TestSpark davideanastasia$ time ./scripts/ ./data/dataset.txt ./output-7
Spark assembly has been built with Hive, including Datanucleus jars on classpath
2015-03-29 17:12:23.713 java[17560:1903] Unable to load realm info from SCDynamicStore
15/03/29 17:12:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
real	0m16.247s
user	0m20.922s
sys	0m0.984s

On the other hand, if I perform the same operation using awk, it does take 36 seconds (for ONE key, not all of them!)

Davides-MacBook-Pro:TestSpark davideanastasia$ time ./scripts/ ./data/dataset.txt 10 

real	0m37.445s
user	0m36.937s
sys	0m0.358s

Of course HyperLogLog is an estimator, and in my case I have sized it to only 16 bits. Tuning for the right dataset will actually make the estimation more accurate.

All the code is available at:
I hope I can add more examples soon.

Leave a Reply

Your email address will not be published. Required fields are marked *