Wednesday, November 18, 2009

Leaky assumption and Gradient Descent- part 2/3


-->
Last February, I post the first part of this post.
Basically, I was pretending that “uncorrelated inputs” was a leaky abstraction and was the root cause of neural networks back-propagation poor results for training huge neural networks or deep networks. According to me, this simplification was fine while the number of parameters remains small. My hypothesis was that optimization problems are growing with the number of parameters which imply an implicit limit of the usage of this abstraction and explain those poor results.
During my research between 2001 and 2003, I was focusing on finding a way to train a neural network faster as presented on the left side of the figure. Unfortunately, I didn’t found that revolutionary algorithm but simply documented various effects of optimizations problems and ways to reduce or eliminate them with experimental results.
In 2008, I went to see Nicolas Leroux Phd defense and close to the end, he brought back that optimization problems could be the problem without presenting solutions which revive my research interest.
It reminded me a last crazy experimentation I have done in 2003, I found an algorithm that had the characteristic of the right side figure but it did not kept much of my attention at that time. Reducing optimization doesn’t imply necessarily faster classification error in time but should do it per iteration (i.e.: epoch).
During a year, I tried to reproduce that experiment in my free time. PA came to the rescue to discuss the underlying assumptions, to brainstorm and help reproducing the experimentation within flayers. Flayers wasn't suiting our needs anymore, the process to get back into the detailed of the implementation was reducing dramatically our experimentation throughput. We finally decided to drop flayers and rewrite it in python (optbprop) to ensure better collaboration and way faster experimentation. We had to make several optimization to make python speed acceptable but it was still slower then flayers (~10; see post). Even if it was slower, ultra fast experimentation become possible and research speed increased exponentially to try to recreate the experimentation. The complexity was residing into the order of the parameters optimization, what was the right recipe??
In June 2009, some times after ICML, Jeremy joined as the third collaborator and I finally reproduced it, results were even better. I have used an output max sensitivity ordering followed by a max hidden sensitivity backprop strategy.
Unfortunately, it was too good to be true and Jeremy found a critical problem in the solution which was leading to extremely poor generalization. At that point, motivation was too low so we’ve decided to stop our research. I was disappointed but a true relief, I could finally move to something else.
Without this collaboration team, I couldn’t have reached that point alone. I am still not 100% convinced that this research path is dead but it is back to a value that is way lower the motivation minimum threshold.
What could explain such bad results? We were expecting lightning learning speed and at worst little improvements compare to standard stochastic backprop. Our explanation hypothesis is that uncorrelated stochastic learning breaks implicit normalization process required for generalization. Basically, for each example, we update parameters to predict its class correctly but its too violent and unlearn previous examples way faster. Unfortunately, we loose the higher level goal which is generalization. Normalization could be re-integrate with bach learning but we haven't experiment it.
If this is true: “When you're ready to quit, you're closer than you think”, I might write part 3 of this post but It will take some time. There is an interplay between machine learning and optimization but people tend to forget it.

Monday, November 16, 2009

Toward better Web Monitoring Solutions

Confoo, a web techno conference, will take place in Montreal in March 2010. If my proposal is accepted, I will present " Toward better Web Monitoring Solutions". Here is a summary:

Web applications are slowly becoming the new standard. No more installation nor upgrades, they are accessible from any internet connected device. While being the Holy Grail to users, web applications can be a nightmare to engineers, as ensuring quality of service becomes harder.

In fact, web applications create a high level of testing complexity, bringing new challenges to quality and availability of service.

As more and more businesses rely on web applications, techniques such as real-time web monitoring, incidents detection and root causes analysis have become critical.

We will present these new problems in detail, followed by a short history of techniques used to measure and estimate the quality of web-based applications. We will review the most popular monitoring technologies, pointing out their advantages and shortcomings.

This presentation will be done in collaboration with Sebastien Pierre.

Knowledge Workers - Talent is not patient, and it is not faithful

Knowledge workers are impatient as great programmers. They are hard to replace and train and still some corporations are not proactive about it. It might not be costly enough?
Of course HR, managers nor VPs aren't struggling to compensate critical components when some of them leave but underlying teams have to. Those decision makers have to remember that they aren't free lunch and mismanagement of knowledge workers have some consequences.
Better management of Knowledge Workers lead to much more productive teams and low turn over but the opposite could cause your decline. What's so hard about creating a win/win approach instead of a loose/loose approach that so many companies seems to fall in. Maybe a generation clash? or simply missing competencies.