Wednesday, November 18, 2009

Leaky assumption and Gradient Descent- part 2/3

Last February, I post the first part of this post.
Basically, I was pretending that “uncorrelated inputs” was a leaky abstraction and was the root cause of neural networks back-propagation poor results for training huge neural networks or deep networks. According to me, this simplification was fine while the number of parameters remains small. My hypothesis was that optimization problems are growing with the number of parameters which imply an implicit limit of the usage of this abstraction and explain those poor results.
During my research between 2001 and 2003, I was focusing on finding a way to train a neural network faster as presented on the left side of the figure. Unfortunately, I didn’t found that revolutionary algorithm but simply documented various effects of optimizations problems and ways to reduce or eliminate them with experimental results.
In 2008, I went to see Nicolas Leroux Phd defense and close to the end, he brought back that optimization problems could be the problem without presenting solutions which revive my research interest.
It reminded me a last crazy experimentation I have done in 2003, I found an algorithm that had the characteristic of the right side figure but it did not kept much of my attention at that time. Reducing optimization doesn’t imply necessarily faster classification error in time but should do it per iteration (i.e.: epoch).
During a year, I tried to reproduce that experiment in my free time. PA came to the rescue to discuss the underlying assumptions, to brainstorm and help reproducing the experimentation within flayers. Flayers wasn't suiting our needs anymore, the process to get back into the detailed of the implementation was reducing dramatically our experimentation throughput. We finally decided to drop flayers and rewrite it in python (optbprop) to ensure better collaboration and way faster experimentation. We had to make several optimization to make python speed acceptable but it was still slower then flayers (~10; see post). Even if it was slower, ultra fast experimentation become possible and research speed increased exponentially to try to recreate the experimentation. The complexity was residing into the order of the parameters optimization, what was the right recipe??
In June 2009, some times after ICML, Jeremy joined as the third collaborator and I finally reproduced it, results were even better. I have used an output max sensitivity ordering followed by a max hidden sensitivity backprop strategy.
Unfortunately, it was too good to be true and Jeremy found a critical problem in the solution which was leading to extremely poor generalization. At that point, motivation was too low so we’ve decided to stop our research. I was disappointed but a true relief, I could finally move to something else.
Without this collaboration team, I couldn’t have reached that point alone. I am still not 100% convinced that this research path is dead but it is back to a value that is way lower the motivation minimum threshold.
What could explain such bad results? We were expecting lightning learning speed and at worst little improvements compare to standard stochastic backprop. Our explanation hypothesis is that uncorrelated stochastic learning breaks implicit normalization process required for generalization. Basically, for each example, we update parameters to predict its class correctly but its too violent and unlearn previous examples way faster. Unfortunately, we loose the higher level goal which is generalization. Normalization could be re-integrate with bach learning but we haven't experiment it.
If this is true: “When you're ready to quit, you're closer than you think”, I might write part 3 of this post but It will take some time. There is an interplay between machine learning and optimization but people tend to forget it.

1 comment: