- Language acquisition: Children loose their capacity to distinguish some phonemes to reduce the scope of choices in order to learn their environment language. The aquisition of phonemes categories from a buttom up approach isn't sufficient (signal processing+unsupervised clustering), a lexical minimal pair (ex:ngram) seems to be required to ensure the learning.
- Trying to learn the best kernel that restrict optimization to a convex problem seems to be a death end. It might be time change paradigm or move to the non-convex dark side.
- Boosting is too sensitive to noise but a robust framework has been presented by Yoav Freund
- Deep Architecture seems to be the next big thing. Regularisation, auto encoder and RBF can be used to pre-train networks from un-label data. Temporal coherence (similarity of consecutive frames in video) can be used as a regulation unsupervised technique in the embedding space. Unsupervised training is a regularisation technique that enforce better clustering. The more unlabeled unsupervised examples are used, the better will be the generalization.
- Training from IID samples isn't optimal, curriculum learning (i.e.: increase examples complexity) seems to smooth the cost function and lead to faster training and better generalization.
- GPU is the way to go to make ML algo scalable.
- Feature hashing is an efficient strategy for dimensionality reduction and can be used to train classifiers.
- Sparse transformation simplify the optimisation process (i.e: same idea used in the Kernel trick in SVN). PCA is doing the opposite.
Research has stopped in Neural Networks because we couldn't estimate boundary error due to its non convex cost function, there was no more theoretical framework. SVN came to the rescue by providing 3 major benefits, a convex cost function, a better generalization process (margin maximization) and less parameters tuning.
Unfortunately, it doesn't scale well, kernel is hard or impossible to choose to reach optimal solution and it doesn't allow deep architecture. For the same capacity, a shallow architecture needs more neurons then a deep architecture and large shallow architecture are much more likely to numeric issue.
Deep architecture came back with convolution deep neural networks applied to objects recognition and them Hiton proposed a breakthrough, a generative approach to initialised the parameters.
Unsupervised learning lead neural networks to much better initialization state and its regularisation provides better generalisation. But now, even if we are doing better initialisation, we still aren't able to better explore the function space which leave, according to me, still open the question: is this an optimization problem?. Local mimima observe might be an illusion created by the effect of gradients cancellation from opposites gradients which is an optimisation problem induce by the leaky assumption of uncorrelated features which lead people to optimize all parameters at the same time. ICML was inspiring.