Sunday, December 13, 2009

Business Contract 2.0 - Found a template

Knowing that I am flirting with the idea of doing consulting, I was looking for advices about common mistakes and better ways to evaluate client contracts. Contracts can put you in obligations you haven't thought and make you slip your deliveries, force you to make more work than planned and worst, jeopardize your relationship with clients. Root cause is most likely unclear expectations on both sides. With some spaghetti clauses that only juridic lawyers can understand, it is easy to fall in this path and everyone loose at this game.

RIM (Régionale des ingénieurs de Montréal/ordre des ingénieurs du Québec), was organizing a training Workshop called "Business Contracts 2.0". The presentation from Gilles Thibault of edilex was extremely interesting.

Currently, their is no established standard to create contracts. Basically their is as many form as the number of lawyers out there. Most of the time, lawyers are the only one comfortable with them because they've wrote them and it is in their juridic jargon. Unfortunately it doesn't help much who really used them. Most of the time, clients and contractors have hard time understanding them but it is a lucrative process for high hour rate paid lawyers. According to Gilles, lawyer will disappear if they can't provide better services for contracts (preaching for its business).

Edilex is proposing a template for contract to ensure nothing is missing and enforce structure to ensure clear expectation between each party. A contract need a table of content, its like a plan.

Proposed Template blocks:

  1. Identification & location (name + birth->juridic name + rights applicable)
  2. Party identification (Physic/moral-Society/union/coop ->representative liquidator Trustee/power delegation etc. )
  3. Preamble (context used for fall-back defect clauses)
  4. Lexicon (clarification/disambiguation & shorter sentences)
  5. Object (simple/multiple;utility conditions/redaction)
  6. Cost (adjustment/payment method/warranty/phase delivery/penalty late payment etc.)
  7. Attestations party A (not obligation/warranty-improve trust-don't want to fight about obligation and duty to disclose information)
  8. Attestations party B
  9. Reciprocal obligations
  10. Obligations party A (Align with Business process/order of execution)
  11. Obligations party B
  12. Special provisions (orphan/specific/bi-directional)
  13. General provisions
  14. End of contract (resolution/termination)
  15. Start of the contract
  16. Duration
  17. Scope
  18. Annexes
So basically, why a template?
  • Clearly define what it include and isn't (You can't remove sections)
  • Provide table of content (Don't need to reread all the contract)
  • Find holes/unclear-possible point of conflicts
  • Reduce dramatically judge interpretation during conflict
  • Can help deciding to not get involve in the project (risk/client honesty)
  • Provide structure/uniform frame
  • Enforce clarity

I will try to apply some of those ideas and it is making me way less scared and equipped to sign new contracts. I really like this approach.

Wednesday, November 18, 2009

Leaky assumption and Gradient Descent- part 2/3

Last February, I post the first part of this post.
Basically, I was pretending that “uncorrelated inputs” was a leaky abstraction and was the root cause of neural networks back-propagation poor results for training huge neural networks or deep networks. According to me, this simplification was fine while the number of parameters remains small. My hypothesis was that optimization problems are growing with the number of parameters which imply an implicit limit of the usage of this abstraction and explain those poor results.
During my research between 2001 and 2003, I was focusing on finding a way to train a neural network faster as presented on the left side of the figure. Unfortunately, I didn’t found that revolutionary algorithm but simply documented various effects of optimizations problems and ways to reduce or eliminate them with experimental results.
In 2008, I went to see Nicolas Leroux Phd defense and close to the end, he brought back that optimization problems could be the problem without presenting solutions which revive my research interest.
It reminded me a last crazy experimentation I have done in 2003, I found an algorithm that had the characteristic of the right side figure but it did not kept much of my attention at that time. Reducing optimization doesn’t imply necessarily faster classification error in time but should do it per iteration (i.e.: epoch).
During a year, I tried to reproduce that experiment in my free time. PA came to the rescue to discuss the underlying assumptions, to brainstorm and help reproducing the experimentation within flayers. Flayers wasn't suiting our needs anymore, the process to get back into the detailed of the implementation was reducing dramatically our experimentation throughput. We finally decided to drop flayers and rewrite it in python (optbprop) to ensure better collaboration and way faster experimentation. We had to make several optimization to make python speed acceptable but it was still slower then flayers (~10; see post). Even if it was slower, ultra fast experimentation become possible and research speed increased exponentially to try to recreate the experimentation. The complexity was residing into the order of the parameters optimization, what was the right recipe??
In June 2009, some times after ICML, Jeremy joined as the third collaborator and I finally reproduced it, results were even better. I have used an output max sensitivity ordering followed by a max hidden sensitivity backprop strategy.
Unfortunately, it was too good to be true and Jeremy found a critical problem in the solution which was leading to extremely poor generalization. At that point, motivation was too low so we’ve decided to stop our research. I was disappointed but a true relief, I could finally move to something else.
Without this collaboration team, I couldn’t have reached that point alone. I am still not 100% convinced that this research path is dead but it is back to a value that is way lower the motivation minimum threshold.
What could explain such bad results? We were expecting lightning learning speed and at worst little improvements compare to standard stochastic backprop. Our explanation hypothesis is that uncorrelated stochastic learning breaks implicit normalization process required for generalization. Basically, for each example, we update parameters to predict its class correctly but its too violent and unlearn previous examples way faster. Unfortunately, we loose the higher level goal which is generalization. Normalization could be re-integrate with bach learning but we haven't experiment it.
If this is true: “When you're ready to quit, you're closer than you think”, I might write part 3 of this post but It will take some time. There is an interplay between machine learning and optimization but people tend to forget it.

Monday, November 16, 2009

Toward better Web Monitoring Solutions

Confoo, a web techno conference, will take place in Montreal in March 2010. If my proposal is accepted, I will present " Toward better Web Monitoring Solutions". Here is a summary:

Web applications are slowly becoming the new standard. No more installation nor upgrades, they are accessible from any internet connected device. While being the Holy Grail to users, web applications can be a nightmare to engineers, as ensuring quality of service becomes harder.

In fact, web applications create a high level of testing complexity, bringing new challenges to quality and availability of service.

As more and more businesses rely on web applications, techniques such as real-time web monitoring, incidents detection and root causes analysis have become critical.

We will present these new problems in detail, followed by a short history of techniques used to measure and estimate the quality of web-based applications. We will review the most popular monitoring technologies, pointing out their advantages and shortcomings.

This presentation will be done in collaboration with Sebastien Pierre.

Knowledge Workers - Talent is not patient, and it is not faithful

Knowledge workers are impatient as great programmers. They are hard to replace and train and still some corporations are not proactive about it. It might not be costly enough?
Of course HR, managers nor VPs aren't struggling to compensate critical components when some of them leave but underlying teams have to. Those decision makers have to remember that they aren't free lunch and mismanagement of knowledge workers have some consequences.
Better management of Knowledge Workers lead to much more productive teams and low turn over but the opposite could cause your decline. What's so hard about creating a win/win approach instead of a loose/loose approach that so many companies seems to fall in. Maybe a generation clash? or simply missing competencies.

Wednesday, September 2, 2009

What's the relationship between Machine Learning and Data-Mining

Machine Learning and Data-Mining are extremely related but it isn't clear for most people. I'll try to clarify the link in this short blog.

Let's start with definitions:
  • Data-Mining (DM) is the process of extracting patterns from data. The main goal is to understand relationships, validate models or identify unexpected relationships.
  • Machine Learing (ML) algorithms allows computer to learn from data. The learning process consist of extracting the patterns but the end goal is to use the knowledge to do prediction on new data.
Both, in ML and DM, we start by extracting patterns. In DM, the process ends there by looking a the patterns. In ML, we reuse learned patterns to do prediction.

One important difference about patterns extraction is that machine learning algorithms don't need to understand the representation of the patterns but data-miners do. As an example, it is hard to understand exactly what a neural network has learned but decisions tree are easy to understand and compare. On the other hand, comprehensive patterns allows machine learning practitionner to identify data problems and by fixing them, improve the prediction accurary of their model.

So basically, the data-mined patterns learned by any machine learning algos are used to do prediction on new data.

Some people might simply say that they are the same, the only difference is how you use the learned patterns: to understand or to predict.

Unsupervised learning can be considered has data-mining because it doesn't involve prediction. In order to understand discovered clusters difference, we can simply use supervised learning on discovered patterns tagged datasets.

Sunday, July 5, 2009

digipy 0.1.1 - Hand Digit Real Time Demo is available

At Montreal-Python6, I have presented a real-time hand digit real-time demo.
This demo allows you to do real-time digit recognition from your digital camera. It allows you to load any trained neural network and apply in real time the same features extraction. This demo allows you to train, extract features, used trained neural networks inside real-time demo, visualize features in 2D and their frequency distribution and get feature discriminant weight.

The packaging 0.1.1 of the demo is now available on pypi:
(unfortunately, some dependency packages aren't supported by easy_install so you have to do 4 steps instead of 1)
  • install opencv (sudo aptitude install python2.5-opencv)
  • install PyQt (sudo aptitude install pyqt4-dev-tools)*
  • instal matplotlib (sudo aptitude install python2.5-matplotlib)*
  • sudo easy_install digipy

* unfortunatly, this package isn't supported with easy_install

Here is the noise robustness comparison of the trained neural network on the raw pixels vs extracted features (digit surface + image convolution with train digits means (0-9)):

If you aren't convinced that Feature Extraction is absolutely required now, I have failed.

Once installed, you will get access to those command line tools:

  1. digipy: Real-time hand digit recognition demo application (ex: digipy --test)
  2. digipy-features2D : demo of feature 2D visualization to see possible clusters
  3. digipy-train: demo training of a Neural Network using mlboost
  4. digipy-compare: compare noise effect on test error on raw inputs and feature extracted datasets
  5. digipy-freq-analysis: demo feature analysis (frequency distributions)
  6. digipy-extract-features: demo features extraction
  7. digipy-see-data: show dataset train and test samples

If you have any trouble using it, just let me know (fpieraut at gmail).
Source code is available here

(note: now digipy use mlboost.nn module for its NeuralNetwork instead of mlboost.flayers swig wrapper)

Wednesday, June 17, 2009

ICML highlights summary

Let's try to summarized in several sentences what I have learned:
  • Language acquisition: Children loose their capacity to distinguish some phonemes to reduce the scope of choices in order to learn their environment language. The aquisition of phonemes categories from a buttom up approach isn't sufficient (signal processing+unsupervised clustering), a lexical minimal pair (ex:ngram) seems to be required to ensure the learning.
  • Trying to learn the best kernel that restrict optimization to a convex problem seems to be a death end. It might be time change paradigm or move to the non-convex dark side.
  • Boosting is too sensitive to noise but a robust framework has been presented by Yoav Freund
  • Deep Architecture seems to be the next big thing. Regularisation, auto encoder and RBF can be used to pre-train networks from un-label data. Temporal coherence (similarity of consecutive frames in video) can be used as a regulation unsupervised technique in the embedding space. Unsupervised training is a regularisation technique that enforce better clustering. The more unlabeled unsupervised examples are used, the better will be the generalization.
  • Training from IID samples isn't optimal, curriculum learning (i.e.: increase examples complexity) seems to smooth the cost function and lead to faster training and better generalization.
  • GPU is the way to go to make ML algo scalable.
  • Feature hashing is an efficient strategy for dimensionality reduction and can be used to train classifiers.
  • Sparse transformation simplify the optimisation process (i.e: same idea used in the Kernel trick in SVN). PCA is doing the opposite.
Research has stopped in Neural Networks because we couldn't estimate boundary error due to its non convex cost function, there was no more theoretical framework. SVN came to the rescue by providing 3 major benefits, a convex cost function, a better generalization process (margin maximization) and less parameters tuning.
Unfortunately, it doesn't scale well, kernel is hard or impossible to choose to reach optimal solution and it doesn't allow deep architecture. For the same capacity, a shallow architecture needs more neurons then a deep architecture and large shallow architecture are much more likely to numeric issue.
Deep architecture came back with convolution deep neural networks applied to objects recognition and them Hiton proposed a breakthrough, a generative approach to initialised the parameters.
Unsupervised learning lead neural networks to much better initialization state and its regularisation provides better generalisation. But now, even if we are doing better initialisation, we still aren't able to better explore the function space which leave, according to me, still open the question: is this an optimization problem?. Local mimima observe might be an illusion created by the effect of gradients cancellation from opposites gradients which is an optimisation problem induce by the leaky assumption of uncorrelated features which lead people to optimize all parameters at the same time. ICML was inspiring.

Wednesday, June 10, 2009

International Conference on Machine Learning (ICML2009)

The 26th International Conference on Machine Learning (ICML 2009) will take place in Montreal next week (14-18, 2009).

The 3 invited speakers are quite interesting. I look forward to heard them:
  • Emmanuel Dupoux, from Ecole Normale Superieure on: How do infants bootstrap into spoken language?
  • Yoav Freund, University of California on Drifting games, boosting and online learning?
  • Corinna Cortes, from Google on can learning kernels help performance?

I have do decide to which tutorials I will attend this sunday:

  • T6 Machine Learning in IR: Recent Successes and New Opportunities [tutorial webpage]Paul Bennett, Misha Bilenko, and Kevyn Collins-Thompson
  • T8 Large Social and Information Networks: Opportunities for ML [tutorial webpage]Jure Leskovec
  • T9 Structured Prediction for Natural Language Processing [tutorial webpage]Noah Smith

Here are some interesting papers:

  • Curriculum Learning [Full paper]
  • Deep Learning from Temporal Coherence in Video [Full paper]
  • Good Learners for Evil Teachers [Full paper]
  • Using Fast Weights to Improve Persistent Contrastive Divergence [Full paper]
  • Online Dictionary Learning for Sparse Coding [Full paper]
  • A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining [Full paper]
  • A Scalable Framework for Discovering Coherent Co-clusters in Noisy Data [Full paper]
  • Bayesian Clustering for Email Campaign Detection [Full paper]
  • Feature Hashing for Large Scale Multitask Learning [Full paper]
  • Grammatical Inference as a Principal Component Analysis Problem [Full paper]
  • Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations[Full paper]

I have to decide between thoses workshops on Thursday:

I look forward to meet old collegues, friends and new researchers. Next week will be awesome.

Friday, May 29, 2009

Is Python really slow? A practical comparison with C++

The common perception is that Python's implementation is slow, but you can often write fast Python if you know how to profile your code effectively.

I have tried it. I have compared a hight cpu intensive algorithm, the training of a simple one hidden neural network. To do so, I have used my old C++ NeuralNetwork library (flayers) and an implementation in python with Numpy. I have wrote a simple neural net in python and optimize all loops with numpy as suggested in a profiling presentation saw in Pycon2009.

I have compare the training time of a simple fully connected NeuralNetwork will 100 hidden neurones for 10 iteration on letters dataset (cost function = mean square error).
Here is the time to do 10 iteration with flayers (c++):
./fexp / -h 100 -l 0.01 --oh -e 10
Optimization: Standard
Creating Connector [16|100] [inputs | hiddens]
Creating Connector [100|26] [hiddens | outputs]

real 0m11.187s
user 0m10.837s
sys 0m0.012s
Here is the time to do 10 iteration on the full letters dataset with python:

time ./ -e 10 --h 100 -f letters.dat -n
Creation of an NN <16:100:26>
real 85m48.646s
user 85m9.163s
sys 0m1.632s
Here is the time to do 10 iteration on the full letters dataset with python and numpy:time
./ -e 10 --h 100 -f letters.dat
Creation of an NN <16:100:26>
real 1m37.066s
user 1m36.026s
sys 0m0.100s

So if you do the math:
  • The numpy implementation is 60 time faster then a basic python implementation.
  • My C++ implementation is a little more then 10 time faster then my simply python numpy implementation.
Numpy implementation definitly worth it because it reduce the code and has a significant performance impact, the C++ might be required for extreme performance but the trade off of code complexity and time my not work it. Now that I have the choice, I will still use my C++ lib.

Sunday, May 10, 2009

Short Essay: Engineering vs Scientist

Engineer vs scientist difference is obscure for many people, me included at the beginning.
Already, during a chat with a research professor in 2001 about the competence war between the engineering and the science department of university of Montreal, I got an initial hint about it, he told me that part of computer science department tension with software engineers was about placement rate. Engineers was much higher than computer scientists. I should have request an explanation. What is this war about, the computer science department request engineer to do some normalization courses even if they have great grades and vis-versa.

In my short carrier, I have observe a major difference that I will try to expose with examples.
After my degree in engineering, I have done most of my graduates courses with scientists. In my first final exam, after some discussions with classmates, I realized that I was the only one that have used an estimation to solve a bottleneck calculus in a problem. Most of the other have lost more than half an hour on solving it exactly and couldn't finish the full exam.
Several times, I add to step in projects to set schedule in order to force people to cut some corners. Perfection is an endless path. If you have to fill a vase with rocks, sand and water, you will most likely start with the rocks then the sand and then with the water if the effort is inverse proportional to the size no? Yes it might be more interesting to optimize cool things and advances features but most likely they aren't part of the basic blocks required to allow your project to move to the next step.

What is the point of having half a perfect solution if you could have done a imperfect full solution. I am not saying that all scientists need of perfection leads them to lose the big picture and engineer aren't perfectionist but simply that bias engineer system vision leads them to gage better when it is time to look for perfection. Time kill projects and over-perfection can drag too much of it...but perfection should remain the goal.
Yes I am generalizing and simplifying but you get the big picture. Define your priorities and assign resource accordingly. So, am I a engineer or a scientist or a little of both? since my manager told my I was more a scientist, I should be a little of both...

Monday, April 20, 2009

Montreal Python6 & Peer Programming

Montreal Python 6 demo packaging and UI major improvements are the results of a peer programming with apt-get install ygingras. In this world where time is the biggest constraint, I could not finalized some of the demo improvements and packaging without help. bitbucket, the mercurial host projects has been very useful to store demo parts (i.e.: flayers, mlboost and digipy). I am not working often off-line, but been able to commit in the plane or in the bus, exchange dundles and create branches easily has very useful.
You can play with the demo, just type:
easy_install digipy (on your favorite linux distribution)
Simply type digipy and tab to see all digipy programs used in the live demo (i..e: use --help to get more details). I have been impress by the powerful related distutil tools of python to create packages (sdist, register, upload to push code on the Python Package Index is just GREAT!)
Virtual Environment, a tool to create isolated Python environments, of Ian Biking has been really usefull to test the packages.
Peer programing is the type of approach that can bring the productivity of your team from the sum of the productivity to the multiplication of its individual or a best to the team member count exponential relationship. Peer programming has allow me to get out fast of many extreme time consuming techical show stoppers that were burning out my too little available time.

Sunday, March 29, 2009

Highlights Pycon Chicago 2009

My short trip to pycon 2009 just confirmed my choice. This community is living and has a powerful momentum. Guido, the python father, mentioned that the 3 most important things are: community, community and community. What he started 19 years ago seems to take over.
The simple python syntax lead everyone to converge to it: pypy, jpython,cpython and IronPython. Alex Martelli talk on abstraction as a leverage was great: abstraction is inevible; try to understand at least 2 lower layers and create hook instead of hacks! (I am already applying it for the demo of python-montreal).
The frameworks for the web development is getting quite amazing (Django, Whirldwind, pylons etc.)
The talk " A Whirldwind Excursion through C Extensions" of Ned Batchelder was great to get a quick start on own to create C extension; yes python is slow and optimization is sometime inevitable.
The panel on Object Relational Mapper and the talk on Drop ACID and think about data were quite interesting. The keynote presentation by reddit founders is a good example that python provides amazing tools to spine out web application companies.
The concept of the evening lightning talks and open discussions on topic of interest was showing the dynamism of the python community. I went to an open discussion on parallel computing and people are moving to python. Now that the process lib has been integrate in 2.6, Twisted, Thrift, PyMPI, Numpy, Scipy power computing with python will expend.
I am glad that Yannick convinced me to attend. I think that I will attend to the tutorial next time and some Sprint to ensure the maximum knowledge assimilation.
During this small trip, I tried the mini laptop eee ha but I brought it back, you should buy the HE: the right shift is at the right place and the battery last longer. Pycon was inspiring !

Monday, March 23, 2009

Research, information management, contrastive divergence, no free lunch theorem, non-parametric...all related

During my master studies, I had to take a course on Research Methodology which introduced me to an interesting concept of information management. With our limited brain capacity, more you read papers about others ideas, less space is available for your own ideas and enforce other assumptions, vision and models.

In practice, great researcher are aware of this consequence of the no free lunch theorem and try to keep of good balance of papers reading and research exploration. By simply applying the contrastive divergence concept of your approach, you can gage your distance to the trend and get a estimation of the impact of a possible discovery.
The Research Machine Learning community, as most other community, has the tendency to recruit top grades students that are use to follow exactly the line of thoughts of their teachers. This long training process is, according to me, extremely damaging to the training of the researcher capacity (i.e.: suboptimal cost function). This explain why most of the master student are researcher cheap labor driving force because they can only experiment others ideas with minor contributions.
Top researchers allow their formal students to follow their own line of thoughts or if they have no specific ideas, suggest ideas. I won't have done a research master without this freedom, thanks Yoshua!
So, if you want to impact the most your community, limit the number of papers your are reading, make your own ideas and play with your concepts to train your own intuitions of the unknown guiding rules you are looking for.

You might say, what's your contributions, I haven't heard about it. My contribution is that I have build experimental proof of back-propagation optimization fundamental problems and build the skeleton of top level explanations. Usually, we don't publish this type of results until you find a solution to the problem which, unfortunately, I haven't reach but, it is coming slowly; it is a long process and I learned to be patient.

So, if you want to impact the most your community, limit the number of papers your are reading to ensure you don't constrain yourself to others models. Why using a parametric model that limits your solution space?

You can trust the collective research discovery learning process that ensure the evolution of the human kind because someone will find it or, use it to increase the likelihood your will make a important discovery (i.e.: use it as a contrastive divergence cost function). If everyone was applying this strategy or cost function, I am pretty sure we will evolve faster. In order to move to this step, we will need to encourage failure strategy publications to ensure other don't wast time reproducing the same ideas but this could be elaborated in another thread post that involve a society evolution.

Traders knows about this simple strategy, buy low, sell high, don't follow the trend, take risks.

Sunday, March 15, 2009

Pycon 2009 Chicago

It is about time the see the real face of this python community. I will attend to pycon 2009.
In 3 days, I expect to learn more things then I have learned in the last 6 months and met pationated peoples. Here are some of the talks I will attend:
Designing Applications with Non-Relational Databases (#16)
How Python is Developed (#116)
Twisted, AMQP and Thrift: Bridging messaging and RPC for building scalable distributed applications (#40)
Introduction to Multiprocessing in Python (#6)
The State of the Python Community: Leading the Python tribe (#118)
Google App Engine: How to survive in Google's Ecosystem (#53)
A Whirlwind Excursion through Writing a C Extension (#68)
Abstraction as Leverage (#110)
A winning combination: Plone as a CMS, your favorite Python web framework as a frontend (#100)

Greedy agile, waterfall and local minima

Everyone in the agile world can't contain their words about waterfall inadaptation to real world software projects. I have to admit, I am a fan of XP, scrum and most of agile approaches in general but I feel that people are loosing the big picture provided by the waterfall framework. Agile is kind of a greedy approach that lead you to local minimum where you are stuck too often. I would like to see someone creating value and velocity cost trade off of short term decisions. Those decisions are ofter push through agile methodology without questioning. Currently, at pivotalpayments, we are stuck in a huge local minimum created by such an approach and it will take such an effort to get out of it. The velocity created by a design choice taken long time ago was emazing at firt but it is slowing us doing now so much now. A major refactoring is required and we will stop development for some iterations. At least Agile adapts and give you the illusion of the optimal path...but everyone should know that the greedy approach isn't the optimal one and that looking at your feet won't help much going to your destination.

Wednesday, March 4, 2009

Montreal Python 6: 2009-04-14; Machine Learning empowered by Python

It is now official, I will do the next Montreal-Python presentation. I will be back from holiday the 13, I hope I wont get flight problems.

Our main presenter will be Francis Piéraut on Machine Learning empowered by Python as announced during the flash introduction in Montreal-Python 5.

Machine Learning is a subfield of AI that considers learning patterns from existing data. Related applications are increasing in many fields where adaptive systems are needed, like fraud detection, face recognition, recommendation systems, disambiguation systems, insurance risk estimation, web traffic filtering, voice recognition, and many others.

The first part of this presentation will cover the basics of machine learning; in the second part, we will dive into a real example and see the complete process of using machine learning to create a real-time digit recognition system using Mlboost, a python library. The practical approach should allow the audience to assimilate the most important concepts of machine learning and the critical need for data preprocessing.

After a Software Engineer degree, Francis Piéraut made a research master in Machine Learning at LISA. During his research work, he developed flayers, a powerful C++ neural network library. During the beginning of his career, his spend several years in Montreal startups companies applying Machine Learning and statistical AI related solutions. In 2005, he released the first version of MLboost, a python library that allows him to speedup his Machine Learning projects by simplifying data preprocessing, features selection and data visualization.

Essay on Adaptation, leaky cost function and online Learning...a society analogy

I fall on the Paul Graham article on cities and ambition and it made my think of writing this blog on adaptation, leaky cost function, online learning and an analogy to the sinking French Quebec society.

To make the bridge with the 3 first concepts, I will use a analogy with the Quebec society.

In order the learn, we need adaptive systems as Neural Networks. In online learning, the adaptation capacity should stay constant along the time. Local minimums can screw you up but let's ignore it for the time been.

Quebec society analogy intro

In order to understand my analogy with the society I live in, I want to share some of my reflexions about the puzzle to understand the Quebec Society. I am born in France and I migrated to Quebec at 9 years olds. During the last 5 years, I tried to elucidate my profound incomprehension of the deep ambitions of the French Quebec Society, if they have some;). According to me, it seems to be a leaky cost function assumption that lead them to their stagnation, coming assimilation and their slow extinction.

To understand my point of view, we need to elaborate on key concepts which are adaptation, equality versus inequality and education access.

Adaptation pros and cons, local minimal and ambition

Adaptation is one of the greatest ability of the human kind but also one of the worst. Adaptation to mediocrity can be a survival strategy to get through hard time but getting use to it reflects true low ambitions or incapacity to do online learning. The Quebec nation seems to have this disease.
  • Quebec people accept staying in a destructive mode seen 1982, a constitutional status quo that lead to politic instability, economic stagnation and reduced political power by excluding them self from the Canadian power with the Bloc Quebecois for too long.
  • Quebec people accept the status of a sub nation (nation inside the Canada).
  • Quebec people accept mediocre governments, mediocre public transport systems, way too expensive and inefficient heath system, highest taxes in north america etc.
Quebec people seems to have no real will for improvement, it is doing the job (you should see "Le confort et l'indifférence"). In the machine learning point of view, is that a local minimal? What is the problem with Quebec society cost function? Is that only the lack of ambition? Knowing that close to half of the population have voted for the separation of Canada in the 95 referendum, it might shadow something more rooted to the French culture.

Cost function assumption : Equality versus inequality

From an anthropology point of view, French nuclear family lead to a conception of a world of equality (see Emmanuel Todd). to simplify, everyone should have the same chance, same access to education, same heath services and so on.
The Anglos-Saxon culture lead to the conception of an inequality world. The inequality conception lead people to work harder knowing there is no lower boundary and they can go deeper if they are too lazy.
Knowing we are born unequal, Anglos-axon assumption conception seems to be better adapted to human kind reality. On the other hand equality lead to an education level increase of the society indepedandly of the economy which as lot of pros and cons.
Why equality is a weak assumption? Equality can stand in rich societies because they can afford it. Unfortunately, Quebec society is getting poorer and its population is disadvantaged by its illusion of equality that leaks from everywhere (i.e.: health system, education, public kid garden etc.)

Missing link: Education and production of wealth

Quebec has the most affordable access to education in north America and few take advantage of it. Anyone knows that the more educated is your society, the more productive and healthy she will be and the more accessible the utopia of an equality world can be possible. By using education to get more productive, a society will create wealth and can afford utopia as the equality concept. French society seems to miss this key point.

French Quebec society is dying and the Anglo-Saxon supremacy should take over

The leaky equality concept and the low ambition of the Quebec society seems to lead this society to online learning incapacity, its incapacity to adapt further. This incapacity leads to its extermination by the growing assimilation to the Anglo-Saxon supremacy of its global cost function model. Inequality based cost function seems to be better adapted for a society that want to stay alive.

Wrap up (it is time to conclude)
Leaky cost function assumption can lead online learning to adaptation incapacity like been stuck in a local minimal as a slow death as the French Quebec nation folklorisation current process. It is simply the evolution, a Darwin consequence, who can't adapt simply die. Facing reality makes life easier.
The most important thing is that the cost function should reflect your goals. If you have a supervisor, try to get a good estimation of its cost function because it will simplify your ascention everywhere.

Quebec French culture creates a huge retention for me to stay in montreal but I wish Montreal a better drive for machine learning, startups as you can get in California. Montreal is simply sub-exploited. Don't take my words for granted, it is an essay. Make you own judgement from your own eyes and exploration.

Tuesday, February 24, 2009

Leaky assumption and Gradient Descent

Theoretical models are based on strong assumptions as software layers (i.e.: leaky abstractions). Layers are created to simplify complex systems and allows work specialization as define in Marx capital approach...and to accomodate limited human brain capacity.
Experienced practitioners know that their value reside in weak assumptions dept comprehension because market will hire new grade students otherwise.

Standard backpropagation gradient descent algorithm assumes that inputs are independent so we can optimize them independently of each other. This assumption or according to me a leaky abstraction, allows you to optimize all parameters at the same time which simplify the life of software engineers and researchers because parameters are theoritically uncorrelated. In mathematical works, we are assuming that the hessien matrix has values only on its diagonal.

My master thesis done under Yoshua Bengio supervision was mostly focusing on understanding huge neural networks training inefficiency. At that time, our goal was to train neural network language models. According to my undertanding and the experimental proof that I have documented, the problem is basically an optimization problem. The uncorrelated assumption simplification doesn't stand when parameters numbers explode.
Unfortunatly, I have failed to find a solution to this problem but the new trend in reaction to Hinton break throught in 2006 will and is already reviving research in this topic.

In my literatude review, I found that several researchers identified some of the reasons who can explain this inefficiency. According to me, they are direct and indirect consequences of the optimization problem introduced by the leaky abstraction of uncorrelated inputs. Those reasons are the moving target problem and the attenuation and dilution of the error signal as it is propagates backward through the layers of the network. We present in my master thesis other reasons who can explain this behavior, the opposite gradients problems, the non-existence of a specialization mechanism and the symmetry problem.

I will treat those concepts in a futur post. This inspiration of this post has been possible because of a brainless Hollywood movie that has allow me to free valuable brain cycles. there is always a good side of the story.

Newton law doesn't stand in Einstein theory as uncorrelated inputs in huge neural networks. Always remember your leaky assumptions/abstractions.

Monday, February 9, 2009

cygwin or mingw to compile C++ swig projects on windows?

The answer is definitely mingw. 4 simple steps:
  1. download swig
  2. download mingw
  3. add python, swig, mingw to environment variable PATH
  4. do: python build_ext --inplace --compiler=mingw32
I have been able to compile it easily with cygwin but I had to download shit load of stuff and I haven't figure out how to use python windows packages and cygwin python package all togetter which was a show stopper for me.

Why was I looking to compile my C++ project on windows?

I want to use my C++ machine learning lib on a real-time video and the only package that was working to grab images in python was videocapture but it is only supported on windows. Now I have to compile my machine learning on windows...

If you want to compile swig C++ project, you need a compiler. If you don't have VisualC++, you are stuck with this error:
error: Python was built with Visual Studio 2003;
extensions must be built with a compiler than can generate compatible binaries.
Visual Studio 2003 was not found on this system. If you have Cygwin installed, you can try compiling with MingW32, by passing" -c mingw32" to

At this point, you can try to download free visual C++ compiler or try mingw32 with sigwin. I tried a free visual studio 2003 package but I couldn't make it working. Then, I try cygwin with gcc and it was compiling but I couldn't use the compiled package outside cygwin. so I tried mingw32. The -c option doesn't work and "python build --compiler=mingw32" doesn't allow you to use your package inside python (i.e.: can't import _flayers in my context). Finally I tried python build_ext --inplace --compiler=mingw32 and it was working.

After a chat with Simon and Tristan that are doing video stuff on linux at the SAT, I discovered that they were grabbing video on Linux. They referred me to opencv which works perfectly. In my experimentation with opencv, I realized that Pygame is a million time faster to display video then matplotlib.

Swig, Mingw, OpenCV make Python so convenient!

Friday, February 6, 2009

Startups and faster Learning

Every one that is familiar to gradient descent and Machine Learning knowns that learning is proportional to the error. Bigger are the errors, bigger are the possibility to learn something. According to me, startups, by nature, are the ideal environments to put yourself in hight gradient learning situations. It force you to try much more things, take more risk and innovate to survive if you don't want to have no choice to look for another job and/or get penny stocks options. Another interesting thing is that true nature of people appear and mask fall rapidly. Friendships are tested to its limits and it allows you to filter out short term life partner.

Don't forget that high gradient put high constraints to your body and frustrations has to be released somehow. In my case, cycling has been used as stress balancing vehicle, you have to find yours. Resigning is too often chosen as the latest resort. You have to keep in mind that everyone is replaceable, even in startups. If you really want to be rich, you should do the Californian technique: cumulate stocks of multiple startups. don't forget that most of them (~90-95%) fail badly, so increase your chances by switching before reaching burnout and build your contacts network by the same way. After the graduation is definitely the best time to try this fast growing experience when you can take it. On my side, I tried it 5-6 years, you have to know your limits. Successful startups have workaholic CEO so it will increase the pressure and the gradient. If he isn't one, find another one, it should be the 19 mistakes of startups (18 mistakes of startups).

If you have the chance to access higher decision position, you learn that there are irrational error cost function and that you might be optimizing the wrong error criteria. For the ones who aren't following me, I recommend that you see the French movie 99 Francs and pay further attention the the high executive meeting.

I apply the same principle with my son this winter, he doesn't want to put his gloves, he learn/change his mind faster at -10 degrees Celsius.

Sunday, February 1, 2009

MLboost 0.3 has been released

I am please to announce the release of MLboost 0.3. MLboost 0.3 will be use for the next Montreal-Python presentation ("Machine Learning enpowered by Python"). It will be announced in the coming weeks. The most important new feature is the integration of pyflayers. pyflayers is a simplified python swig interface to flayers, my C++ neural networks library. I am preparing a live real-time demo of a machine learning application, it should be interesting for the audience and it is good motivation to improve my package. I have played with beamer, a latex package to create presentation slides and I have been particularly been pleased, I do recommend it. On the other hand, it is time to prepare the next winter camping trip and to continue eating great food.

Monday, January 5, 2009

The No Asshole Rule Fraka6 review

During Christmas time off, I finished "The No Asshole Book" from Robert I. Sutton. I fall on this book by the time is was looking for the "Art Of War" for a gift. At first, I taught it was a mistake and that it will end up been a light or low content reading but it was extremely interesting. The book treats:
  • How to identify a real asshole
  • The damage of keeping Assholes
  • How to implement the No Asshole Rule
  • How to stop inner jerk from getting out (yes it get out sometimes)
  • How to survive when assholes reign
  • Yes I am a powerful Asshole (I was checking if you were really reading;)
Having spend 5 years in startups environment, I should have read this book before because high stress environment leads people to act as asshole and applying some of this ideas can help you surviving this jungle. Out of 4 startups, I had to deal with 3 places where asshole behavior was destroying my productivity. Only one place had a certified asshole (not a temporary one; no it isn't Eric). What I like about high educated environment is that people are more likely to stand up and make changes happening. I saw one of my supervisor been fired after I stand up and another been fired on the side after I left which is overall very encouraging. What I learned from this book is that once, I have waited way too much time before changing jobs because I didn't realized soon enough that I was simply wasting my time. There is times where fighting is useless and it simply drains you down and lead you to act sometimes as an asshole yourself to keep you breathing (don't forget small wins first). Good faith is sometime useless, as Robert enforce: Hope for the Best; Expect the Worst and Change the way you seen things (Reframing). Reframing refer to developing indifference and emotional detachment. As highlighted, people tend to recruit high passioned people but it isn't a so good idea because it leads them to irrational decisions that can have critical impacts.

The book doesn't treat one point that high technical people forget. For most company, market drives the show, not the technology, so you have less power then you think. This market lead to more politics and assholes are good at this game. Politics is often use to shadow increasing incompetency because they don't focus on what they should be doing. You have to better understand politics games but according to me, the more politics the more incompetency by square foot is expected.

This make me remember Joel On Sotware book, most years top companies don't last, only a few survive in the top n after many years. I am pretty sure those company apply the No Asshole Rule because if they don't do so, long term they get back what they desire because they are infected by reproduced assholes that blame each other by the time all top people leave.

People forget that power pass by empowering others. This power isn't a sum but can't be exponential with teams that have a great synergy. Open source is a great example.

This book is highly recommended. I will suggest it to my manager, one of the few manager that understands what a manager should be doing.

But what is an Asshole, the standard definition is something like that: insulting terms of address for people who are stupid or irritating or ridiculous. According to the book Asshole follow 2 rules:
  1. After spending time with them you feel de-energize, humiliated etc.
  2. Asshole are more polite with more powerful people and are worst with lower level people.
Don't forget, be slow to brand certified asshole. To get the certified brand, you simply have to be constant.