Many times I get data from someone and they want the unique count of something in a bunch of text files. I want to test to see how fast python dictonaries are so I wrote some code. If the code runs under 5 min I’m pretty happy. When it starts exceeding that I look for ways to optimize. How far can dictionaries scale? I’m running this on Intel(R) Core(TM) i7-4750HQ, 16GB of RAM, Samsung 840 SSD, with python 2.7.6.
Lets run it:
So what is this doing? I’m looping through 0 to 100 and generating a random string of 8 characters. Then I take that 8 character random string and put it into a dictionary as the key. If it is already there I increment the value. I then print out the length of the dictionary, the varience of the dictionary, and the time it took to go through that many items. Your machine will vary on speed depending your hardware. So 100 is super fast. Lets trying 10,000.
Still damn fast. Lets scale farther to 10,000,000:
Interesting varience is no longer 1.0 which means we started seeing duplicates with 8 character random strings. Lets try 100 million:
So ok 100 million goes over the 5 min threshold. 7.5 minutes total. So now lets play with the varience. The reason I want to adjust varience is you are taking a giant list and finding unique values. Chances are your data has a vairience allot closer to .005 then .99
.99 basically means almost every value is unique. This is useless test if all your data is unique. I will pad the string to always have 8 characters but now decrease variance. At this point you should be thinking map reduce. I’m going to still test python and see what it can do. Python update:
Lets test 100 million with 4 characters variance vs 8 to produce more duplicates.
Right under our 5 min threshold for 100 million checks with allot better varience. So we can basically scale to 100 million and still be in 5 min threshold. Lets try to speed it up with pypy.
Lets try some pypy with 100 million and 4 variance.
Sweet we get almost a 50% speed up with pypy. Lets try 2x now for 200 million with pypy:
Still under 5 min which is good. So I can pretty much support any data structure under 200 million records in 5 min. Lets do some math to see if we can get close to 5 min as possible.
You take (200 million * 300) / 287 = X
Which would be 209,059,233.45 rounded to two decimal points. Lets try it and see how close we are with the whole integer 209,059,233 entries.
Wow pretty damn close to 300 seconds or 5 min. Lets now say we extend our 5 min threshold to 30 min. Lets do some more math and see what we can scale to:
It says we can scale to: 1,254,355,400 which is 1.25 billion records.
Wow pretty close to 1800 seconds or 30 min. Linear scaling :)
You never want a single sample size so lets run it again:
100 variance 8
10,000 variance 8
10,000,000 variance 8
100 million variance 8
Now I’m going to turn down variance.
100 million 4 variance
Now we start using pypy
100 million 4 variance
200 million 4 variance with pypy
Close to 5 min as possible with calculation using 209,059,233 variance 4.
Now we want to test 30 min with 1.25 billion records variance 4
For the hell of I test python with the giant test. It just killed itself. Haha
Chances are you don’t have my exact machine consider it is a custom built laptop. Also you should be testing on production servers vs local laptops. Sometimes local is faster then production servers due to cost. Going to test on Ubuntu 14.04 64 bit, r3.xlarge EC2 instance, with 4 vCPUs, 30.5 RAM, and 80 GB SSD.
100 and 8 variance.
10,000 and 8 variance
10,000,000 and 8 variance
100,000,000 and 8 variance
100,000,000 and 4 variance
100,000,000 and 4 variance with pypy
200,000,000 and 4 variance with pypy
The 5 min interval at 209,059,233 and variance 4.
1.25 billion with 4 variance and pypy
What about jython with the 5 min test of 209,059,233 and variance 4. Um Ok….
I tried with 4GB and 20GB of ram to the JVM. It didn’t help. Max I saw was about 12GB of RAM before the JVM just crashes.
Then I came across:
Then you don’t have to an if else statement but I’m not getting any better peformance with pypy.
Lets make sure the python version is not faster with this collections module
There are a bunch of sites telling me to use xrange vs. range but whatever pypy is doing under the hood is making it already as fast as possible with range. Here are the results from the 5 min test with xrange vs range. No performance gains.
There is a branch of pypy where it has the GIL removed. Python-stm
I compiled python-stm from source:
Now the results
They seem to be a little bit faster. I’m not sure I’m using python-stm correctly or not but have to read more about it. I don’t see all the cores firing on my laptop so maybe my code is still not atomic or I’m doing something wrong. I also trust the regular pypy and python version more. This is more of an experimenetal version that might eventually replace the regular pypy.
Conclusion:
Dict{} are pretty damn fast and even faster with pypy. I don’t face many data problems in the billions or even millions. I’m sure people that deal with math, scientific, and finacial data do all the time. I was thinking of testing python 3.x and see if the speeds are any better. What do you use for your fastest key:value system? If you made it this far thanks for reading.