Sunday, January 31, 2010

python try catch cost vs hasattr (overhead = only 2X)

To support different version of code, you might end up using try catch code bloc. But, in the worst cast, what is the cost overhead of throwing an exception every-time?
You might have the possibility to use in some case the hasattr to avoid throwing an exception but your code will be less elegant and/or readable.
In order to make a rational decision, let's compare the time cost. I was expecting a major cost for throwing exception but it ends up been only 2 times more costly then doing an hasattr.
You can see the simple code I have used to compare it here. From now on, I will be less scare about memory performance when I am using try catch.

Friday, January 22, 2010

The power of python within Tomcat for powerful webapps (jython2.5.1)

Many companies are looking to speedup their development process. Unfortunately they are restricted to use well established webserver framework like tomcat and, in this example, they fill limited to Java.
Jython is coming to the rescue. Since September 26th 2009, Jython2.5.1 has been released and can be use to create servlet that runs inside apache tomcat application server. Jython allows you to write python that is running on java VM (100%) and let you use lot of python pure libraries and let you use all java packages with a python synthax.
If you try it, you might have issues like:
  • How can I create a simple jython2.5.1 servlet? (not deprecated jython2.2.1 that doesn't allow you to use most pure python libs)
  • Why do I get ImportError when I use standard python package? How can I fix it?
  • Where should I put jython2.5.1.jar?
  • Where should I put my python code?
  • Where can I get a basic example that is working?
  • Where should I put my pure python libraries?
So I have decided to package a simple example (download: HelloWorld.war).
Here are the steps you should follow:
  1. get tomcat http://tomcat.apache.org/ (I used 5.5.28)
  2. get jython2.5.1 http://sourceforge.net/projects/jython/files/jython/jython_installer-2.5.1.jar
  3. cp ~/jython2.5.1/Lib tomcat5.5.28/share/lib (required to used std python libs)
  4. cp ~/jython2.5.1/jython.jar tomcat5.5.28/share/
  5. download example: wget http://mlboost.svn.sourceforge.net/viewvc/mlboost/jython/HelloWorld/HelloWorld.war
  6. cp HelloWorld.war tomcat/webapps
  7. download java jdk
  8. export JAVA_HOME=sun-jdk-1.6.0_02
  9. tomcat5.5.28/bin/startup.sh
  10. try it: http://localhost:8080/HelloWorld/HelloWorld.py
If you want the flexibility of python but you are stuck with java within tomcat, jython is becoming a real alternative since they've release jython2.5.1.

PS: a war file is a zipped file, you can unzip it in tomcat/webapps for testing so you don't need to rezip it and restart the server. When you are done, simply do a jar cvf HelloWorld.war * in the tomcat/webapps folder and ship that single file to the client tomcat server (make sure jython is installed). If you want to add pure python libraries, you can simply add them into your war file, it will work.


Here is the time comparison of the same service:
  • python: wsgi httpserver
  • jython: wsgi httpserver
  • tomcat: java servlet jython2.5.1
It is interesting to see that tomcat servlet provide better perf

Friday, January 8, 2010

matplotlib & python for powerful data visualization

Here is an example of data that isn't obvious to analyze:
What is the gain and lost effect of percentage of seats in a point of view of proportional representation? Percentage of seats is usually chosen in legislative assemblies. It is the process used in Canadian and Québec elections.

Powerful visualization allow you to see easily the effect. Python & matplotlib is an amazing combination to do so. It took me 20 minutes to allow me to visualize the effect in federal and Quebec election of 2008.
Upper graph (seats vs votes) shows the lost of proportional vote % if you use a seats approach. As an example, liberals gain ~11% and ADQ lost of ~11%.
Lower graph (lost seats vs votes). The real impact of party is the ratio of this lost on their real vote proportion. In this example, it is a gain of ~25% for each Liberals votes (11/(66/125)) and a lost of 66% for the ADQ and ~88% for QS.
Basically:
  • In Canadian election, PC & BQ gain power but BQ way more in proportion and Greens lost everything
  • In Quebec election: QS & ADQ lost lot of power and PQ and LIB gain it: it might explain why they aren't talking of changing election formula
  • Matplot lib and python is an amazing combination to automate data visualization
to get the code do:
svn co https://mlboost.svn.sourceforge.net/svnroot/mlboost/elections
python elections/seats_vs_prop.py

Gerrymandering Explained (youtube;

Gerrymandering - another reason why rep democracy is fundamentally corrupt )

Sunday, January 3, 2010

gmail, a powerfull target marketing tool

When I got my account on gmail many years ago, I was wondering why google was spending so much resource/wasting to provide free email service as hotmail, yahoo and so many others. Were they making enough money on gmail advertisement? Does that worth the investment?
I though at one point that their primary end goal was to launch a corporate email portal so company won't need to hire high paid sys admin to provide mail server support and by the same time help world wide employee getting something way better then outlook/exchange server that pollute our live. They are doing it already but I think their real goal was to do target marketing but not traditional one.
What better can you get then user emails to understand his profile and do target marketing. They get the highest quality info from your emails, yes your emails.
According to my experience in more traditional target marketing for Bell and at Microcell-Lab, when people do traditional target marketing, they have few info about users and derive new information from which they try to generate better predictions. As an example, they use your postal code to estimate your family revenue etc, and use that information to generate better prediction that you will buy X or Y.
Traditional target marketing practitioners use a lift approach to get the top N most probable buyers for a given product or service and will try to approach those people with promotion or email etc. With gmail, it it way more simple, you use user profile info like email words (btw, they are parsing your emails, take a close look, you will see), and use a prediction engine to advertise the info you are most likely going to like or buy and show it to you directly because you are using their mail service.
Gmail is an amazing target marketing tool because it get profile info directly at the user fingerprint, can do way better prediction then traditional target marketing technics and has access to the customer directly and scale well to get more users. Our prediction is always as good as your data. What's the point of improving algorithm if you can get better and high quality data or as google do both. Larry Page and Sergey Brin are just visionary target marketers!

Friday, January 1, 2010

Jython, pyPdf, reportlab experimentation & patches proposals

I am currently experimenting jython in order to do pdf files manipulations. I encountered several problems and I want to share some of the solutions (time to give back).

Intro, only pure python code and library are working on python and jython. All C related python packages aren't compatible. Jython allows python syntax on top of java VM. One great thing is that you can use java classes within python. Jython2.5.1 as been release last September.

1) manual pdf text modification
I thought it was simple to modify a pdf template to change a text but I was wrong. Even if you are able to re-encode new text and change length, you will hit walls. It is more complex then that (xref etc.). Most pdf lib provide encoding helper function but you will get hard time finding decoding one, as an example ascii85. After some time, I decided to try to make reportlab working with jython.

2) reportlab import error with jython
I tried to used reportlab, a powerful lib to create PDF, but it was generating this error when I was importing reportlab.pdfgen: java.lang.ClassFormatError: java.lang.ClassFormatError: Invalid method Code length 66566 in class file reportlab/pdfbase/_fontdata$py. According to this thread on warkmail, there was a simple solution but the patch wasn't working. You can find the working patch that I have created here and proposed to reportlab team.

3) Saving pdf to memory instead of files
In order to do in memory pdf manipulations, I used the pure pyPdf python lib from Mathieu Fenniack. Basically, I tried to save a canvas in memory and couldn't figure it out why it wasn't working. Basically, I was doing outputStream.writelines(c._
doc.GetPDFData(c)). Unfortunately it doesn't work if you don't call c.showpage() before. I also realized that I could create a canvas directly with StringIO as filename argument because pdfdoc.PDFDocument.SaveToFile(...) check if fname as a write function. I have proposed a canvas api improvement you can find here to make it more friendly to use.

4) Simple comparison python/jython
I was wondering how much slower was jython compare to python. As you can see, it is slower and it degrades with some parameter size (ex: n pages). In this example, it also takes 4 to 6 times more memory.

5) Jython out of memory
If you get:
OutOfMemoryError: java.lang.OutOfMemoryError: GC overhead limit exceeded
use -J-Xmx1024m jython option to allow more memory heap size for the java netbeans.

4) Threading optimization
Jython doesn't suffer from the GIL problem. Look at this video to get more information about it "Mindblowing Python GIL". Basically jython can do real multi-threading. In my context, I could easily parallelize part of my code so I tried it by using the Theadpool of Christopher Arndt. Unfortunately, I still haven't been able to make is faster. pyPdf hasn't been designed to be used in a real threading environment (PdfFileReader can't be shared between threads) which introduce limitations.

5) pyPDF profiling
pyprof2calltree is an amazing tool for profiling as you can see in the figure. Guessing what needs optimization is a path we should never go because we are most of the time wrong and are wasting our time doing uncritical optimizations that make code unreadable. I saw a great presentation fromMike C. Fletcher on python profiler at pycon 2009. I might try to optimize readObject function of pyPdf.

It is amazing to see the tremendous effort people are putting to make python syntax available on each platform (java->jython; .Net->ironpython etc.) It is a sign of python great syntax.