Wednesday, October 13, 2010

simple multivariate classifier example using python & numpy

I was wondering how long it could take to write a multivariate classifier in python.
With python and numpy it isn't long. We simply need to be able to compute the covariance matrix, the determinant and to inverse a matrix (covariance matrix). Even if the matrix is singular, which mean it can't inverse it, you can compute the pseudo-inverse (Moore-Penrose) easily (i.e.: numpy.linalg.pinv).
As expected, assuming too much about the data lead to poor classification.
You can find a simple python program of 75 lines here.

Sunday, October 10, 2010

Dimensionality reduction; a simple PCA example using python




Dimensionality reduction is a powerful approach to reduce inputs size, reduce training time and visualize data.
As an example, you can use PCA(Principal Component Analysis) or ICA (independent component analysis) or LLE (Locally Linear Embedding).
to see class grouping. You can try it on your data easily with python in a couple of lines.
import mdp
pca = mdp.pca(ds.data)
pylab.title("PCA")
pylab.plot(pca[:,0], pca[:,1], '.')
The figure presents the PCA dimensionally reduction applied on a digit dataset. You can find the source code here to see you to do a PCA, ICA or LLE using python. Unfortunately, ICA doesn't work on our dataset because it doesn't converge.

Saturday, October 9, 2010

PDF watermarking service using pdfrw on google appengine

If you are looking to watermark a pdf, you can use this simple appengine service:
This service use pdfrw (a PDF file manipulation library written by Paul Gauvin) and reportlab. pdfrw is much faster then pypdf for watermarking.

Tuesday, September 21, 2010

Summary of machine learning libs available in python

Here is a summary of all python related machine learning libraries in python (inspired by Similar or Related Projects of PyMVPA, lisa mailing list and personal notes).
  • pybrain: PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive "Backronym". see features. key feature : ecurrent networks (RNN), including Long Short-Term Memory (LSTM) architectures
  • mlpy:Machine Learning PYthon (mlpy) is a high-performance Python library for predictive modeling. mlpy makes extensive use of NumPy to provide fast N-dimensional array manipulation and easy integration of C code. The GNU Scientific Library ( GSL) is also required. It provides high level procedures that support, with few lines of code, the design of rich Data Analysis Protocols (DAPs) for preprocessing, clustering, predictive classification, regression and feature selection. Methods are available for feature weighting and ranking, data resampling, error evaluation and experiment landscaping. Key feature: feature selection
  • scikit.learn: scikits.learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). Key distinct features: lasso, nearest neighbor, isomap, various metrics, mean shift, cross validation, LDA, HMMs
  • opencv (machine learning): Normal Bayes Classifier, K Nearest Neighbors, SVM, Decision Trees, Boosting, Random Trees, Expectation-Maximization, Neural Networks
  • Shogun: A Large Scale Machine Learning Toolbox Comprehensive machine learning toolbox with bindings to various programming languages. PyMVPA can optionally use implementations of Support Vector Machines from Shogun. Large scale kernel learning (mostly svms). this wraps other libraries such as libsvm (well-established) and others that get state of the art performance or are good for extremely large datasets, etc.
  • PyMVPA (Multivariate Pattern Analysis in Python): PyMVPA is a Python module intended to ease pattern classification analyses of large datasets. In the neuroimaging contexts such analysis techniques are also known asdecoding or MVPA analysis.
  • pylearn (build on top of theano), under V2 construction. New version of plean (c++).
  • Theano: (deep learning) Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
  • jml: Jeremy's Machine Learning library (C++), include a python interface: Basic classifiers (perceptrons, decision trees, etc) plus ensemble methods (boosting, bagging). Very highly optimized to work with thousands of features and millions of examples. GPGPU support under development. Code derived from this library is extensively used in a commercial computational linguistics application, so it has gone through its paces.
  • 3dsvm: AFNI plugin to apply support vector machine classifiers to fMRI data.
  • Elefant: Efficient Learning, Large-scale Inference, and Optimization Toolkit. Multi-purpose open source library for machine learning.
  • MDP Python data processing framework. MDP provides various algorithms. PyMVPA makes use of MDP’s PCA and ICA implementations. interesting features: ica, LLE
  • MVPA Toolbox: Matlab-based toolbox to facilitate multi-voxel pattern analysis of fMRI neuroimaging data.
  • NiPy: Project with growing functionality to analyze brain imaging data. NiPy is heavily connected to SciPy and lots of functionality developed within NiPy becomes part of SciPy.
  • OpenMEEG: Software package for low-frequency bio-electromagnetism including the EEG/MEG forward and inverse problems. OpenMEEG includes Python bindings.
  • Orange: Powerful general-purpose data mining software. Orange also has Python bindings.
  • PyMGH/PyFSIO: Python IO library to for FreeSurfer’s .mgh data format.
  • PyML: PyML is an interactive object oriented framework for machine learning written in Python. PyML focuses on SVMs and other kernel methods.
  • PyNIfTI: Read and write NIfTI images from within Python. PyMVPA uses PyNIfTI to access MRI datasets.
  • milk: k-means, svm's with arbitrary python types for kernel arguments. Pythonic interface to libSVM. Stepwise Discriminant Analysis for feature selection. K-means clustering. odels can be pickled and unpickled.
  • mlboost: Machine Learning Boost Library (python; includes flayers wrapper); minimal version of sourceforge mlboost project. Specialized on features extraction and visualization.
to watch:

Wednesday, August 25, 2010

How to add activities to redmine timesheet plugin?

I am using redmine, a great project management web application, and added littestream timesheet plugin.
If you want to add timesheet activities, you can't do it from the UI. You need to add it directly into mysql db like that:
insert into enumerations values (10,'Experimental Development', 3, 0, "TimeEntryActivity", 1, NULL, NULL)
insert into enumerations values (11,'Scientific Research', 4, 0, "TimeEntryActivity", 1, NULL, NULL)
Don't forget to set property the id and the position.
select * from enumerations;
| id | name | position | is_default | type| active | project_id | parent_id |

Wednesday, July 21, 2010

patching class function in python

Today we had to patch a class function in production. Monkey patching can become tricky if reference are kept at several place like pointer in C and C++.
Here is a simple example on how to make sure all references will use the new definition.
class Foo:
def f(self):
print "default f"

def newf(self):
print "newf"

Foo.f.im_func.func_code = newf.func_code
This is another reason why interpreter language like python are so powerful.

Tuesday, June 1, 2010

Before filing Software Patents, wait for Bilski case resolution

The big upcoming US case regarding software patents is In re Bilski. The Supreme Court has been sitting on it for months now. "Bilski" an ongoing set of patent cases that will change the patentability of software in the USA. The Bilski patent itself is a business method patent, not a software patent.

The 2008 ruling of the Court of Appeals for the Federal Circuit (CAFC) was broad enough to reject Bilski's patent and a certain category of software patents.

The Supreme Court agreed to review the CAFC's ruling (as Bilski v. Kappos), and the judges raised the issue of software during the hearing.

The Supreme Court's ruling could greatly change the patentability of software patents, business method patents, and the middle ground of e-commerce patents.

If you are looking for US patent news, I recommend patentlyo.

Tuesday, May 18, 2010

Easy Presentation Slides with Latex-Beamer

If you are looking for an elegant way to create slides like the ones I have presented at montreal-python, I recommend latex-beamer.
Here is a example in 6 simple steps:
  1. mkdir trial; cd trial
  2. sudo aptitude install latex-beamer
  3. wget http://mlboost.svn.sourceforge.net/viewvc/mlboost/doc/ml-python-mtl-april2009.tgz
  4. tar -zxvf ml-python-mtl-april2009.tgz
  5. pdflatex ml-python-mtl-april2009.tex
  6. xpdf ml-python-mtl-april2009.pdf
IF you don't remember latex syntax, you can use:
When you start using latex, you never want to go back. Content and visualization should be well isolated/separated.

Saturday, May 15, 2010

Future of Quebec Software Engineer

Currently, new Software Engineer have an easy live. With a shortage of resources, you don't have to be good to find a job. When employers are desperately looking for people, they take what they can and don't complain too much about it (in this context, they feel lucky). As an example, ETS Software engineer students can choose between 17 intern offers. Will the situation change in the future?
  1. For a long time, Canadian dollar was low which makes Quebec and Canadian Software engineer pretty cheap, unfortunately it is changing and it is already not true anymore. They are starting to be expensive. Many software jobs in Québec are related to US company having an office in Canada and many others rely on US clients.
  2. Easy tax credit for experimental development and scientific research will be harder to get since Canadian governmental is back to deficit and Quebec is facing aging population and reduced number of tax payers. For more info, check this post.
  3. Outsourcing, a cheaper and more accessible alternative. As a example, when more Indians will be able to speak comprehensible English, they will become a real danger. Many companies start using cheap remote resource or use outsourcing service like rentacoder to lower down their cost.
  4. Due to globalization and near US market, many companies will prefer perfectly fluent English average engineer as oppose to great engineer with lack of English skills.
  5. Due to globalization and near US market, many companies will promote perfectly fluent English average engineer as oppose to great engineer with lack of English skills.
Unlike Doctors with powerful Professional College, software engineer don't have reserved acts that reduce competition and provide job security, they don't have the luxury to not be competitive.

What should future Francophone engineer need to success and be more competitive in this changing landscape?
  • They need to be fully bilingual (btw, many company don't consider Francophone university because they know that most of their software engineers aren't bilingual)
  • They need management skills. Ability to manage technical people is rare and critical to the success of projects, company are more aware of it and are desperately looking for it. Increase success rate of project is critical.
  • They need leader and decision maker, not only a work force.
  • They need writing and communication skills. Technical people who neglect this will pay the price in the long run. Without those skills, it is hard to evolve in the upper level of a company.
  • They need conflict, negotiation skills. Engineers forget that their job is on average 20% technical and 80% HR related. Resolving human related problems is important part of software engineer work.
But the most important is that companies need decision makers who see the big picture, not only throw-able/easy replaceable work force.
Outsourcing is out the door and software engineer have to better understand the idea of comparative advantages because their cheaper advantage is getting away.
Too expensive non perfectly bilingual engineers with lack of management and communication skills jobs will face increase outsourcing pressure.

Friday, May 14, 2010

Disruptive business model: Lew Cirne, serial entrepreneur on the future of Enterprise Software etc.

I really like New Relic founder (Lew Cirne) point of view about the evolution of business model with the web. Basically, massive sale overhead is not require anymore which lead to reduce drastically software cost and free resource to do better software. Most of the time, software satisfaction is low because monetary resource are sucked by so many overhead layers and isn't use to make better software. You should read his posts:

Software transition: recipe for a disaster

Everyone knows that software has a life cycle and dies at some point. If your software is part of your day to day operations, you can't fail in managing its transition. If you want to reduce the probability of success of the transition, here are some rules to follow:
  1. Don't plan transition
  2. Don't communicate transition plan if you have one
  3. Promote non-skilled people to decision positions
  4. Kill initiative/ignore message (ex: kill messenger and think the problem is gone)
  5. Avoid long term planning (ex: use argument like business moving too fast)
  6. Don't put a technical team lead
  7. Don't do follow-up of team requests
  8. Don't involve HR to smooth transition
  9. Don't hire pro-actively
  10. Don't prepare a B plan
  11. Don't involve and/or update your team on the transition plan
  12. Don't give overall project responsibility to anyone
  13. Stay in a reactive mode
  14. Don't communicate info in daily meeting
  15. Don't ensure you have a good pulse of your team
  16. Give too much power to people who don't understand software process
  17. Like Greenspan, dream the magic hand will fix everything magically (people might not compensate for bad decisions eternally)
  18. Expect people to stay
  19. Don't make difference between maintenance & development cost
  20. Let non technical people take technical decisions
  21. Ignore problems
  22. Don't recognize people work in crisis
  23. Allow managers to not be able to evaluate technical people
  24. Consider Indian outsourcing can solve everything
  25. Think people are easily replaceable
  26. Don't talk about carrier evolution to your crew
If you are lucky, you will have to replace your team one by one but they could resign all in a short time and put your business in a major crisis situation. Don't forget, there is no free lunch.

Monday, May 10, 2010

Tax credits (RS&DE): the new reality = good news for startups

Last month, I went to the CEIM (centre d'entreprises et d'innovation de Montréal) to see a presentation on the new reality of tax credit program in Canada starting in 2009. RS&DE tax credit for scientific research & experimental development has changed drastically. Initiate in 1986, the program claims reached 1.5 billion in 2000 and ~4 billions in 2008 at the federal level. Due to the increase of claims (20K claims 2008), non standard claims and unwritten need to reduce tax credits, an important change happened. Many companies can't send anymore non standard endless novel to lost government officials. Corporation need to send 1400 words to explain the scientific advancements, the problems/technical incertitudes/risk and the technology and scientific content. With 15% more staffs and 1400 words limit per claim, way more people can validate your claims. Knowing we are back in deficit, there will be pressure to reduce accepted claims. It will be way harder for cheaters to try to loose the reader because endless novels is no more possible (romans fleuve). Btw, CEIM offers interesting services.
It will be an harder sell for lawyers firms specialized in R&D tax credit paperwork who can't use their government bodies as easily as before to ensure their clients get their tax credits. This is a good news for startups and bad news for the ones who are making things up because it means less useless work for us, more time and more money to create real wealth.

Saturday, May 8, 2010

VC founding & StartupCamp 6

Montréal StartupCamp 6 was quite interesting. As raised during the unconf discussion, French language technology entrepreneurs seems to prefer bootstrapping their companies then looking for VC foundings (ex: FeelingSoftware/Carré Technologies/DokDok/Vizmatic/QMining/Aptats/Nexyka etc.). keynote Dave McClure presentation on how to create good pitch to investors was very good:
  • Start with problem not solutions
  • Look for high reaction signal (good and bad)
  • Stop adding features
  • Focus on customer reactions, real-time as possible
  • Volume->Cost->Conversion | acquisition/activation/retention/referral/revenu...
  • etc.
VCs are basically non-risk taker and followers as sheep and should be treated accordingly.
But what's the point of getting VC founding for software startups? They don't need much more then computers & time. With VC founding you could get 6-12 months where most of it will be re-reimbursed with R&D tax credits. Basically, they give you cash advance for an important share of your company for a ridicule risk. VC mains arguments are:
  • It will allow you to be the first in the market, its bullshit, everyone knows it is the timing that is the most important.
  • 10% of 15 millions is better then 100% of 1 millions...but you still will do most of the work with more stress from investors and you might get 10% of 2 millions. Most entrepreneurs start their company to take control and not getting back to a slavery mode.
Yes it could allow you to build your product faster to meet a customer needs but better, they could make clients connections which has real value. I think that Steve Dekorte point out exaclty what I profoundly dislike about VCs and haven't presented clearly in a previous post, VC like professional CEO are short term peoples as oppose to founders. With a difference cost function, you will automatically get clashes which lead to frustrations, politics....basically inefficiency due to uncommon vision.

Andy Nulman Keynote presentation was interesting in the perspective of importance of a partner and the need to adapt but the mercantile conclusion was a pathetic anglo-saxon point of view: you could make more cash by doing the chicken dance then doing something interesting.

Phil Telio announcement about the new startup dedicate house notman was great for Montréal. I will definitely apply many of the stuff learned there. As an example, if you are a founder and will move to the CEO position, you better start delegating what you are better at because it will allow you to improve others skills and will make other supervision very efficient.

Tuesday, March 16, 2010

Ubuntu remix & Asus Eee PC 1005PE 10''/250G 14hr review

Ubuntu remix is awesome. I am simply amazed. I have installed it from a usb key on my new ASUS Eee PC 1005PC and everything works perfectly, yes everything...with a small fix for the wireless interruptions:
  • Wireless (see link to fix flaky + interruption)
  • Bluethooth
  • Sound
  • flash (ex on youtube)
  • hibernate
  • second monitor short key
  • lower intensity screen short key
  • ubuntu one (storage)
  • video
I don't need to carry my AC plug anymore. The battery doesn't last 14hrs but easily 10 hrs, way enough for an entire day. The Atom450(1.66 GHz 512 cache kb) processor performance are acceptable. If I compare heavy cpu neural network training with numpy, it takes less then 5 times the time compare to my AMD athlon 64 (2.8 Ghz + 512 cache). If I normalize the comparison on cpu frequency, it is on 3 times slower which is acceptable. I don't need windows anymore on my laptop, my life will be much pleasant.

Buy US made by Google.com: search protectionism?

At confoo conference, I attend to "Comment faire une bonne campagne Search + Social ?" session. I have been close to fall from my chair when I learn that google.com doesn't provide the same results in the United States compare to all other countries.
  • Everywhere except US: google.com = search around the world
  • US only: google.com = search in the US
Yes, if you search on google.com everywhere except in the US, you will search from the same index all around the world. If you are in the US, the search is limited to United States only. Isn't that protectionism?
In many fields, it is great that it is ranking closest business higher (ex: closest pizza place) but not filtering non US competitors when service doesn't need to be close by (ex: consulting services).
Knowing that the United states is the biggest promoter of the free market/open market etc., I am totally amaze to learn such a rule. Do as I Say, Not as I Do. As an example, if you don't have an address is the states and/or host our web site there, SEO (seach engine optimization) is useless to get new customers from the states (a insignificant market;), they won't find you if there are using the biggest search engine...google.
Do Oubama administration really need a buy America plan?
The first role of free trading in unrestricted access to information (i.e: économie de l'information: Joseph E. Stiglitz).
You can try it:
As you will see, results aren't the same.

Monday, March 15, 2010

Opportunities ahead to reduce impact of doctor shortage an money constraint: Machine Learning

The "Collège des médecins du Québec" is hammering us with advertisement to give Québec Government solution to find new founding to pay more our doctors and specialists. In the industry, we make more money by been more efficient, not increasing prices. It is the easy solution when others are paying. "Collège des médecins du Québec" should promote efficiency to reduce their cost, deliver better service and generate more margin to increase nurses and doctors.
As an example, radiologist are earning on average around 700 K/year and Ophthalmologist 600K/year. Yes our generalist doctor could earn more, the average is around 150K/year. As a heavy tax payer, I am suggesting using optician approach in BC, replace heavy paid optometrist by machines to do the exams. Basically, automate what need to automate and use doctor efficiently which could reduce/eliminate the shortage.
When people go to far with their salary expectation, it is time to bring them back on earth. No one accept fees increase for poorer service.
Public system and doctors studies are founded by our taxes and Health System spending represents close to 50% of Québec spending, money doesn't grow in trees. Yes, they are getting less if they were in the states, but US doesn't have a public systems, only wealthy people have access to it and it can't happen in Canada because it is publicly founded.
I have tried to help happy clinic by contacting doctors to offer smart waiting time system to make people wait less but most of them didn't care much: we are busy, people have to way, its a natural filter. They can make us waiting hours even with appointments, treat us like shit because the service offer is low. When most of us are waiting, we aren't earning money to pay them. Shame on you. Everyone is loosing at this game.
Doctor are getting greedy and are starting to see them as untouchable and are forgetting who are paying their salary.
If well packaged, Machine Learning can be use by anyone to do high level screening and provide valuable information. In 2010, doctors remain one of the only profession who doesn't use much machine to make them more efficient and are fighting to stay the bottleneck. With this crisis ahead, it might be a good opportunity for the machine learning community to get into this shielded area in the benefit of everyone, them too.

Friday, March 12, 2010

Why you Shouldn't Accept VC money earlier than you thought: Story about Venture Capital and an Outdated Decision Maker revised

I met Gary Haran at confoo and stalled on one of his post:
It might be advantageous to accept VC money earlier than you thought.
I have auto-censure one of my post about VC but it need to be republish to ensure entrepreneurs understand the possible deep consequences of such a decision. The post Anatomy of a failed software project initiate my thought of un-censure it because we don't talk enough about bad experiences to not offend people or been seen as too negative but its reality and we should have realistic expectation. With Amazon EC2, google app engine, rackspace hosting and so on, it is becoming way less costly to bring new technologies to the market. Most software company don't need big founding anymore and have just more reason to avoid VCs.
I strongly recommend boostrapping your company with consulting and/or R&D credits, INRS programs etc.

So here is the revised version on the post I have pushed Apr 29, 2009 12:32 AM and removed.

Once my boss, which was one of the founders of the company that I was working for, told be: Avoid as much as you can VC, it should stay your last resort. I haven't realized the deep importance of this after some American sharks VCs took control of our company and enforce their ultra capitalism short term vision. GM is a good example of short term vision but, that's another interesting story.

Venture Capital represents high risk investments and their only gold is to get the highest return on investment in the shortess time period. That's fair, you simply have to know the rules.

Entrepreneurs build and invest, VC rape everything that is possible. At that time, they replaced the top management with their Californian super stars or ... remaining turnips (j'exagère un peu). One of their last super star, when I was there which I will refer to "Le Chasseur", came to take the highest position of the technical side of the business. At that time, I was working on new technology and this old cowboy grandpa came to tell me that we weren't doing real research. According to him, real research was what he was doing 25 years ago with his wired and transistors. I simply told him that our group of 3 (i.e: including manager) weren't pretending doing research but simply doing applied research and definitely not pure research as his thousand of coworkers were doing in the old monopolistic US company he was referring to. This guy had made lot of money in the good years of the Internet and was annoying me with his stories, I should have told him to go and say it to his little sons and daughters and simple retire and let us work. His only salary could have let us build a much better team to continue innovating in technology which was our core business. Without any support from him, even working against us, we still have manage to release a complete new technology (i.e.: prototype + knowledge transfer to prod+ prod support) that is bringing quite lot of income to the company. Charlie, most important point is to bring technology to the market, nothing else matter...I have succeed even if he desperately try to make us fail to justify moving advance stuff to California.

Those clowns (i.e: I know, I am over generalizing) were just sucking all financial resources of our company. Been still a share older, I was relieve to hear that "Le Chasseur" has finally been fired in a reorganization. If you need reorganization to do cleanup in a private company, your level of politic game should be quite high... and politics leads to corruption and mediocrity in such an organization. We could have become a great company but it might remains an average one, I don't think I deserve that after such investments but money leads. According to old colleagues, the latest reorganization was efficient.

I think that the concentration of smart people is higher in CA then in Montreal, but we definitely fall on some bad apples.

Key decisions can't stand in incompetent hands, soon or later the guillotine will come. If you made money and reach your level of incompetency, please let the others take the lead. If you haven't make money, just reorient your carrier. We don't care about your ego, the fact that your are the initial founder etc. what is important is the company success and everyone will benefit from it.

You might have better story and or points of view about VC but what I saw wasn't great:
  • They stop/restrain investment in innovation
  • They drop some incompetents clowns (over generalized to make the point)
  • They suck all money to pays their clowns
  • They made your shares less valuable
  • They drawn massive amount of money on building products that had no real value and on non core technology
  • They create a real wall between decision makers and the workforces
  • They invest a lot in fake partnerships
  • They love yes mans
  • They feel way smarter (especially CA vs Mtl)
American VC are like the FMI for Argentina in the 2001 crisis or Corea in the Asia Financial crisis of Golden Sach for Greece bankruptcy, you are going to be fucked.
I hope that I will never in my live work again for a company that will be taken over by VC. If you need VC, you might already been badly organized to need desesperatly money from them. I would definitely like to hear good stories about VC where they achieve their gold...at least.

Yes I had a painful indirect experience with VC and an outdated decision maker but I knew that startups environment aren't easy. I learned better the rules and the game by playing it. I wish they will find a way to make money because success bring success and Montreal deserve more of them, there is so much talent here. I have emphasis on one bad decision maker but the CEO and the CFO were interesting character...but you need one bad apple drop by your VC to fuck your company and they will get most of the benefits.

As mentioned in Samual Bouchard blog, we need way more private founding alternatives but were aren't yet there.

Sunday, January 31, 2010

python try catch cost vs hasattr (overhead = only 2X)

To support different version of code, you might end up using try catch code bloc. But, in the worst cast, what is the cost overhead of throwing an exception every-time?
You might have the possibility to use in some case the hasattr to avoid throwing an exception but your code will be less elegant and/or readable.
In order to make a rational decision, let's compare the time cost. I was expecting a major cost for throwing exception but it ends up been only 2 times more costly then doing an hasattr.
You can see the simple code I have used to compare it here. From now on, I will be less scare about memory performance when I am using try catch.

Friday, January 22, 2010

The power of python within Tomcat for powerful webapps (jython2.5.1)

Many companies are looking to speedup their development process. Unfortunately they are restricted to use well established webserver framework like tomcat and, in this example, they fill limited to Java.
Jython is coming to the rescue. Since September 26th 2009, Jython2.5.1 has been released and can be use to create servlet that runs inside apache tomcat application server. Jython allows you to write python that is running on java VM (100%) and let you use lot of python pure libraries and let you use all java packages with a python synthax.
If you try it, you might have issues like:
  • How can I create a simple jython2.5.1 servlet? (not deprecated jython2.2.1 that doesn't allow you to use most pure python libs)
  • Why do I get ImportError when I use standard python package? How can I fix it?
  • Where should I put jython2.5.1.jar?
  • Where should I put my python code?
  • Where can I get a basic example that is working?
  • Where should I put my pure python libraries?
So I have decided to package a simple example (download: HelloWorld.war).
Here are the steps you should follow:
  1. get tomcat http://tomcat.apache.org/ (I used 5.5.28)
  2. get jython2.5.1 http://sourceforge.net/projects/jython/files/jython/jython_installer-2.5.1.jar
  3. cp ~/jython2.5.1/Lib tomcat5.5.28/share/lib (required to used std python libs)
  4. cp ~/jython2.5.1/jython.jar tomcat5.5.28/share/
  5. download example: wget http://mlboost.svn.sourceforge.net/viewvc/mlboost/jython/HelloWorld/HelloWorld.war
  6. cp HelloWorld.war tomcat/webapps
  7. download java jdk
  8. export JAVA_HOME=sun-jdk-1.6.0_02
  9. tomcat5.5.28/bin/startup.sh
  10. try it: http://localhost:8080/HelloWorld/HelloWorld.py
If you want the flexibility of python but you are stuck with java within tomcat, jython is becoming a real alternative since they've release jython2.5.1.

PS: a war file is a zipped file, you can unzip it in tomcat/webapps for testing so you don't need to rezip it and restart the server. When you are done, simply do a jar cvf HelloWorld.war * in the tomcat/webapps folder and ship that single file to the client tomcat server (make sure jython is installed). If you want to add pure python libraries, you can simply add them into your war file, it will work.


Here is the time comparison of the same service:
  • python: wsgi httpserver
  • jython: wsgi httpserver
  • tomcat: java servlet jython2.5.1
It is interesting to see that tomcat servlet provide better perf

Friday, January 8, 2010

matplotlib & python for powerful data visualization

Here is an example of data that isn't obvious to analyze:
What is the gain and lost effect of percentage of seats in a point of view of proportional representation? Percentage of seats is usually chosen in legislative assemblies. It is the process used in Canadian and Québec elections.

Powerful visualization allow you to see easily the effect. Python & matplotlib is an amazing combination to do so. It took me 20 minutes to allow me to visualize the effect in federal and Quebec election of 2008.
Upper graph (seats vs votes) shows the lost of proportional vote % if you use a seats approach. As an example, liberals gain ~11% and ADQ lost of ~11%.
Lower graph (lost seats vs votes). The real impact of party is the ratio of this lost on their real vote proportion. In this example, it is a gain of ~25% for each Liberals votes (11/(66/125)) and a lost of 66% for the ADQ and ~88% for QS.
Basically:
  • In Canadian election, PC & BQ gain power but BQ way more in proportion and Greens lost everything
  • In Quebec election: QS & ADQ lost lot of power and PQ and LIB gain it: it might explain why they aren't talking of changing election formula
  • Matplot lib and python is an amazing combination to automate data visualization
to get the code do:
svn co https://mlboost.svn.sourceforge.net/svnroot/mlboost/elections
python elections/seats_vs_prop.py

Gerrymandering Explained (youtube;

Gerrymandering - another reason why rep democracy is fundamentally corrupt )

Sunday, January 3, 2010

gmail, a powerfull target marketing tool

When I got my account on gmail many years ago, I was wondering why google was spending so much resource/wasting to provide free email service as hotmail, yahoo and so many others. Were they making enough money on gmail advertisement? Does that worth the investment?
I though at one point that their primary end goal was to launch a corporate email portal so company won't need to hire high paid sys admin to provide mail server support and by the same time help world wide employee getting something way better then outlook/exchange server that pollute our live. They are doing it already but I think their real goal was to do target marketing but not traditional one.
What better can you get then user emails to understand his profile and do target marketing. They get the highest quality info from your emails, yes your emails.
According to my experience in more traditional target marketing for Bell and at Microcell-Lab, when people do traditional target marketing, they have few info about users and derive new information from which they try to generate better predictions. As an example, they use your postal code to estimate your family revenue etc, and use that information to generate better prediction that you will buy X or Y.
Traditional target marketing practitioners use a lift approach to get the top N most probable buyers for a given product or service and will try to approach those people with promotion or email etc. With gmail, it it way more simple, you use user profile info like email words (btw, they are parsing your emails, take a close look, you will see), and use a prediction engine to advertise the info you are most likely going to like or buy and show it to you directly because you are using their mail service.
Gmail is an amazing target marketing tool because it get profile info directly at the user fingerprint, can do way better prediction then traditional target marketing technics and has access to the customer directly and scale well to get more users. Our prediction is always as good as your data. What's the point of improving algorithm if you can get better and high quality data or as google do both. Larry Page and Sergey Brin are just visionary target marketers!

Friday, January 1, 2010

Jython, pyPdf, reportlab experimentation & patches proposals

I am currently experimenting jython in order to do pdf files manipulations. I encountered several problems and I want to share some of the solutions (time to give back).

Intro, only pure python code and library are working on python and jython. All C related python packages aren't compatible. Jython allows python syntax on top of java VM. One great thing is that you can use java classes within python. Jython2.5.1 as been release last September.

1) manual pdf text modification
I thought it was simple to modify a pdf template to change a text but I was wrong. Even if you are able to re-encode new text and change length, you will hit walls. It is more complex then that (xref etc.). Most pdf lib provide encoding helper function but you will get hard time finding decoding one, as an example ascii85. After some time, I decided to try to make reportlab working with jython.

2) reportlab import error with jython
I tried to used reportlab, a powerful lib to create PDF, but it was generating this error when I was importing reportlab.pdfgen: java.lang.ClassFormatError: java.lang.ClassFormatError: Invalid method Code length 66566 in class file reportlab/pdfbase/_fontdata$py. According to this thread on warkmail, there was a simple solution but the patch wasn't working. You can find the working patch that I have created here and proposed to reportlab team.

3) Saving pdf to memory instead of files
In order to do in memory pdf manipulations, I used the pure pyPdf python lib from Mathieu Fenniack. Basically, I tried to save a canvas in memory and couldn't figure it out why it wasn't working. Basically, I was doing outputStream.writelines(c._
doc.GetPDFData(c)). Unfortunately it doesn't work if you don't call c.showpage() before. I also realized that I could create a canvas directly with StringIO as filename argument because pdfdoc.PDFDocument.SaveToFile(...) check if fname as a write function. I have proposed a canvas api improvement you can find here to make it more friendly to use.

4) Simple comparison python/jython
I was wondering how much slower was jython compare to python. As you can see, it is slower and it degrades with some parameter size (ex: n pages). In this example, it also takes 4 to 6 times more memory.

5) Jython out of memory
If you get:
OutOfMemoryError: java.lang.OutOfMemoryError: GC overhead limit exceeded
use -J-Xmx1024m jython option to allow more memory heap size for the java netbeans.

4) Threading optimization
Jython doesn't suffer from the GIL problem. Look at this video to get more information about it "Mindblowing Python GIL". Basically jython can do real multi-threading. In my context, I could easily parallelize part of my code so I tried it by using the Theadpool of Christopher Arndt. Unfortunately, I still haven't been able to make is faster. pyPdf hasn't been designed to be used in a real threading environment (PdfFileReader can't be shared between threads) which introduce limitations.

5) pyPDF profiling
pyprof2calltree is an amazing tool for profiling as you can see in the figure. Guessing what needs optimization is a path we should never go because we are most of the time wrong and are wasting our time doing uncritical optimizations that make code unreadable. I saw a great presentation fromMike C. Fletcher on python profiler at pycon 2009. I might try to optimize readObject function of pyPdf.

It is amazing to see the tremendous effort people are putting to make python syntax available on each platform (java->jython; .Net->ironpython etc.) It is a sign of python great syntax.