Tuesday, February 24, 2009

Leaky assumption and Gradient Descent

Theoretical models are based on strong assumptions as software layers (i.e.: leaky abstractions). Layers are created to simplify complex systems and allows work specialization as define in Marx capital approach...and to accomodate limited human brain capacity.
Experienced practitioners know that their value reside in weak assumptions dept comprehension because market will hire new grade students otherwise.

Standard backpropagation gradient descent algorithm assumes that inputs are independent so we can optimize them independently of each other. This assumption or according to me a leaky abstraction, allows you to optimize all parameters at the same time which simplify the life of software engineers and researchers because parameters are theoritically uncorrelated. In mathematical works, we are assuming that the hessien matrix has values only on its diagonal.

My master thesis done under Yoshua Bengio supervision was mostly focusing on understanding huge neural networks training inefficiency. At that time, our goal was to train neural network language models. According to my undertanding and the experimental proof that I have documented, the problem is basically an optimization problem. The uncorrelated assumption simplification doesn't stand when parameters numbers explode.
Unfortunatly, I have failed to find a solution to this problem but the new trend in reaction to Hinton break throught in 2006 will and is already reviving research in this topic.

In my literatude review, I found that several researchers identified some of the reasons who can explain this inefficiency. According to me, they are direct and indirect consequences of the optimization problem introduced by the leaky abstraction of uncorrelated inputs. Those reasons are the moving target problem and the attenuation and dilution of the error signal as it is propagates backward through the layers of the network. We present in my master thesis other reasons who can explain this behavior, the opposite gradients problems, the non-existence of a specialization mechanism and the symmetry problem.

I will treat those concepts in a futur post. This inspiration of this post has been possible because of a brainless Hollywood movie that has allow me to free valuable brain cycles. there is always a good side of the story.

Newton law doesn't stand in Einstein theory as uncorrelated inputs in huge neural networks. Always remember your leaky assumptions/abstractions.

No comments:

Post a Comment