Friday, January 1, 2010

Jython, pyPdf, reportlab experimentation & patches proposals

I am currently experimenting jython in order to do pdf files manipulations. I encountered several problems and I want to share some of the solutions (time to give back).

Intro, only pure python code and library are working on python and jython. All C related python packages aren't compatible. Jython allows python syntax on top of java VM. One great thing is that you can use java classes within python. Jython2.5.1 as been release last September.

1) manual pdf text modification
I thought it was simple to modify a pdf template to change a text but I was wrong. Even if you are able to re-encode new text and change length, you will hit walls. It is more complex then that (xref etc.). Most pdf lib provide encoding helper function but you will get hard time finding decoding one, as an example ascii85. After some time, I decided to try to make reportlab working with jython.

2) reportlab import error with jython
I tried to used reportlab, a powerful lib to create PDF, but it was generating this error when I was importing reportlab.pdfgen: java.lang.ClassFormatError: java.lang.ClassFormatError: Invalid method Code length 66566 in class file reportlab/pdfbase/_fontdata$py. According to this thread on warkmail, there was a simple solution but the patch wasn't working. You can find the working patch that I have created here and proposed to reportlab team.

3) Saving pdf to memory instead of files
In order to do in memory pdf manipulations, I used the pure pyPdf python lib from Mathieu Fenniack. Basically, I tried to save a canvas in memory and couldn't figure it out why it wasn't working. Basically, I was doing outputStream.writelines(c._
doc.GetPDFData(c)). Unfortunately it doesn't work if you don't call c.showpage() before. I also realized that I could create a canvas directly with StringIO as filename argument because pdfdoc.PDFDocument.SaveToFile(...) check if fname as a write function. I have proposed a canvas api improvement you can find here to make it more friendly to use.

4) Simple comparison python/jython
I was wondering how much slower was jython compare to python. As you can see, it is slower and it degrades with some parameter size (ex: n pages). In this example, it also takes 4 to 6 times more memory.

5) Jython out of memory
If you get:
OutOfMemoryError: java.lang.OutOfMemoryError: GC overhead limit exceeded
use -J-Xmx1024m jython option to allow more memory heap size for the java netbeans.

4) Threading optimization
Jython doesn't suffer from the GIL problem. Look at this video to get more information about it "Mindblowing Python GIL". Basically jython can do real multi-threading. In my context, I could easily parallelize part of my code so I tried it by using the Theadpool of Christopher Arndt. Unfortunately, I still haven't been able to make is faster. pyPdf hasn't been designed to be used in a real threading environment (PdfFileReader can't be shared between threads) which introduce limitations.

5) pyPDF profiling
pyprof2calltree is an amazing tool for profiling as you can see in the figure. Guessing what needs optimization is a path we should never go because we are most of the time wrong and are wasting our time doing uncritical optimizations that make code unreadable. I saw a great presentation fromMike C. Fletcher on python profiler at pycon 2009. I might try to optimize readObject function of pyPdf.

It is amazing to see the tremendous effort people are putting to make python syntax available on each platform (java->jython; .Net->ironpython etc.) It is a sign of python great syntax.

2 comments:

  1. With Jython you can also look at PDF libraries written for Java, for example http://pdfbox.apache.org/ is popular. Jython has very good integration with Java.

    ReplyDelete
  2. Hi, I saw this post. What was the solution to the imports failing on Tomcat? Thanks.

    ReplyDelete