Wednesday, September 2, 2009

What's the relationship between Machine Learning and Data-Mining

Machine Learning and Data-Mining are extremely related but it isn't clear for most people. I'll try to clarify the link in this short blog.

Let's start with definitions:
  • Data-Mining (DM) is the process of extracting patterns from data. The main goal is to understand relationships, validate models or identify unexpected relationships.
  • Machine Learing (ML) algorithms allows computer to learn from data. The learning process consist of extracting the patterns but the end goal is to use the knowledge to do prediction on new data.
Both, in ML and DM, we start by extracting patterns. In DM, the process ends there by looking a the patterns. In ML, we reuse learned patterns to do prediction.

One important difference about patterns extraction is that machine learning algorithms don't need to understand the representation of the patterns but data-miners do. As an example, it is hard to understand exactly what a neural network has learned but decisions tree are easy to understand and compare. On the other hand, comprehensive patterns allows machine learning practitionner to identify data problems and by fixing them, improve the prediction accurary of their model.

So basically, the data-mined patterns learned by any machine learning algos are used to do prediction on new data.

Some people might simply say that they are the same, the only difference is how you use the learned patterns: to understand or to predict.

note:
Unsupervised learning can be considered has data-mining because it doesn't involve prediction. In order to understand discovered clusters difference, we can simply use supervised learning on discovered patterns tagged datasets.

Sunday, July 5, 2009

digipy 0.1.1 - Hand Digit Real Time Demo is available

At Montreal-Python6, I have presented a real-time hand digit real-time demo.
This demo allows you to do real-time digit recognition from your digital camera. It allows you to load any trained neural network and apply in real time the same features extraction. This demo allows you to train, extract features, used trained neural networks inside real-time demo, visualize features in 2D and their frequency distribution and get feature discriminant weight.

The packaging 0.1.1 of the demo is now available on pypi:
(unfortunately, some dependency packages aren't supported by easy_install so you have to do 4 steps instead of 1)
  • install opencv (sudo aptitude install python2.5-opencv)
  • install PyQt (sudo aptitude install pyqt4-dev-tools)*
  • instal matplotlib (sudo aptitude install python2.5-matplotlib)*
  • sudo easy_install digipy

* unfortunatly, this package isn't supported with easy_install

Here is the noise robustness comparison of the trained neural network on the raw pixels vs extracted features (digit surface + image convolution with train digits means (0-9)):

If you aren't convinced that Feature Extraction is absolutely required now, I have failed.

Once installed, you will get access to those command line tools:

  1. digipy: Real-time hand digit recognition demo application (ex: digipy --test)
  2. digipy-features2D : demo of feature 2D visualization to see possible clusters
  3. digipy-train: demo training of a Neural Network using mlboost
  4. digipy-compare: compare noise effect on test error on raw inputs and feature extracted datasets
  5. digipy-freq-analysis: demo feature analysis (frequency distributions)
  6. digipy-extract-features: demo features extraction
  7. digipy-see-data: show dataset train and test samples



If you have any trouble using it, just let me know (fpieraut at gmail).
Source code is available here http://bitbucket.org/fraka6/.

(note: now digipy use mlboost.nn module for its NeuralNetwork instead of mlboost.flayers swig wrapper)

Wednesday, June 17, 2009

ICML highlights summary

Let's try to summarized in several sentences what I have learned:
  • Language acquisition: Children loose their capacity to distinguish some phonemes to reduce the scope of choices in order to learn their environment language. The aquisition of phonemes categories from a buttom up approach isn't sufficient (signal processing+unsupervised clustering), a lexical minimal pair (ex:ngram) seems to be required to ensure the learning.
  • Trying to learn the best kernel that restrict optimization to a convex problem seems to be a death end. It might be time change paradigm or move to the non-convex dark side.
  • Boosting is too sensitive to noise but a robust framework has been presented by Yoav Freund
  • Deep Architecture seems to be the next big thing. Regularisation, auto encoder and RBF can be used to pre-train networks from un-label data. Temporal coherence (similarity of consecutive frames in video) can be used as a regulation unsupervised technique in the embedding space. Unsupervised training is a regularisation technique that enforce better clustering. The more unlabeled unsupervised examples are used, the better will be the generalization.
  • Training from IID samples isn't optimal, curriculum learning (i.e.: increase examples complexity) seems to smooth the cost function and lead to faster training and better generalization.
  • GPU is the way to go to make ML algo scalable.
  • Feature hashing is an efficient strategy for dimensionality reduction and can be used to train classifiers.
  • Sparse transformation simplify the optimisation process (i.e: same idea used in the Kernel trick in SVN). PCA is doing the opposite.
Research has stopped in Neural Networks because we couldn't estimate boundary error due to its non convex cost function, there was no more theoretical framework. SVN came to the rescue by providing 3 major benefits, a convex cost function, a better generalization process (margin maximization) and less parameters tuning.
Unfortunately, it doesn't scale well, kernel is hard or impossible to choose to reach optimal solution and it doesn't allow deep architecture. For the same capacity, a shallow architecture needs more neurons then a deep architecture and large shallow architecture are much more likely to numeric issue.
Deep architecture came back with convolution deep neural networks applied to objects recognition and them Hiton proposed a breakthrough, a generative approach to initialised the parameters.
Unsupervised learning lead neural networks to much better initialization state and its regularisation provides better generalisation. But now, even if we are doing better initialisation, we still aren't able to better explore the function space which leave, according to me, still open the question: is this an optimization problem?. Local mimima observe might be an illusion created by the effect of gradients cancellation from opposites gradients which is an optimisation problem induce by the leaky assumption of uncorrelated features which lead people to optimize all parameters at the same time. ICML was inspiring.

Wednesday, June 10, 2009

International Conference on Machine Learning (ICML2009)

The 26th International Conference on Machine Learning (ICML 2009) will take place in Montreal next week (14-18, 2009).

The 3 invited speakers are quite interesting. I look forward to heard them:
  • Emmanuel Dupoux, from Ecole Normale Superieure on: How do infants bootstrap into spoken language?
  • Yoav Freund, University of California on Drifting games, boosting and online learning?
  • Corinna Cortes, from Google on can learning kernels help performance?

I have do decide to which tutorials I will attend this sunday:

  • T6 Machine Learning in IR: Recent Successes and New Opportunities [tutorial webpage]Paul Bennett, Misha Bilenko, and Kevyn Collins-Thompson
  • T8 Large Social and Information Networks: Opportunities for ML [tutorial webpage]Jure Leskovec
  • T9 Structured Prediction for Natural Language Processing [tutorial webpage]Noah Smith

Here are some interesting papers:

  • Curriculum Learning [Full paper]
  • Deep Learning from Temporal Coherence in Video [Full paper]
  • Good Learners for Evil Teachers [Full paper]
  • Using Fast Weights to Improve Persistent Contrastive Divergence [Full paper]
  • Online Dictionary Learning for Sparse Coding [Full paper]
  • A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining [Full paper]
  • A Scalable Framework for Discovering Coherent Co-clusters in Noisy Data [Full paper]
  • Bayesian Clustering for Email Campaign Detection [Full paper]
  • Feature Hashing for Large Scale Multitask Learning [Full paper]
  • Grammatical Inference as a Principal Component Analysis Problem [Full paper]
  • Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations[Full paper]

I have to decide between thoses workshops on Thursday:

I look forward to meet old collegues, friends and new researchers. Next week will be awesome.

Friday, May 29, 2009

Is Python really slow? A practical comparison with C++

The common perception is that Python's implementation is slow, but you can often write fast Python if you know how to profile your code effectively.

I have tried it. I have compared a hight cpu intensive algorithm, the training of a simple one hidden neural network. To do so, I have used my old C++ NeuralNetwork library (flayers) and an implementation in python with Numpy. I have wrote a simple neural net in python and optimize all loops with numpy as suggested in a profiling presentation saw in Pycon2009.

I have compare the training time of a simple fully connected NeuralNetwork will 100 hidden neurones for 10 iteration on letters dataset (cost function = mean square error).
Here is the time to do 10 iteration with flayers (c++):
./fexp / -h 100 -l 0.01 --oh -e 10
...
Optimization: Standard
Creating Connector [16|100] [inputs | hiddens]
Creating Connector [100|26] [hiddens | outputs]
...

real 0m11.187s
user 0m10.837s
sys 0m0.012s
Here is the time to do 10 iteration on the full letters dataset with python:

time ./bpnn.py -e 10 --h 100 -f letters.dat -n
Creation of an NN <16:100:26>
...
real 85m48.646s
user 85m9.163s
sys 0m1.632s
Here is the time to do 10 iteration on the full letters dataset with python and numpy:time
./bpnn.py -e 10 --h 100 -f letters.dat
Creation of an NN <16:100:26>
...
real 1m37.066s
user 1m36.026s
sys 0m0.100s

So if you do the math:
  • The numpy implementation is 60 time faster then a basic python implementation.
  • My C++ implementation is a little more then 10 time faster then my simply python numpy implementation.
Numpy implementation definitly worth it because it reduce the code and has a significant performance impact, the C++ might be required for extreme performance but the trade off of code complexity and time my not work it. Now that I have the choice, I will still use my C++ lib.


Sunday, May 10, 2009

Short Essay: Engineering vs Scientist

Engineer vs scientist difference is obscure for many people, me included at the beginning.
Already, during a chat with a research professor in 2001 about the competence war between the engineering and the science department of university of Montreal, I got an initial hint about it, he told me that part of computer science department tension with software engineers was about placement rate. Engineers was much higher than computer scientists. I should have request an explanation. What is this war about, the computer science department request engineer to do some normalization courses even if they have great grades and vis-versa.

In my short carrier, I have observe a major difference that I will try to expose with examples.
After my degree in engineering, I have done most of my graduates courses with scientists. In my first final exam, after some discussions with classmates, I realized that I was the only one that have used an estimation to solve a bottleneck calculus in a problem. Most of the other have lost more than half an hour on solving it exactly and couldn't finish the full exam.
Several times, I add to step in projects to set schedule in order to force people to cut some corners. Perfection is an endless path. If you have to fill a vase with rocks, sand and water, you will most likely start with the rocks then the sand and then with the water if the effort is inverse proportional to the size no? Yes it might be more interesting to optimize cool things and advances features but most likely they aren't part of the basic blocks required to allow your project to move to the next step.

What is the point of having half a perfect solution if you could have done a imperfect full solution. I am not saying that all scientists need of perfection leads them to lose the big picture and engineer aren't perfectionist but simply that bias engineer system vision leads them to gage better when it is time to look for perfection. Time kill projects and over-perfection can drag too much of it...but perfection should remain the goal.
Yes I am generalizing and simplifying but you get the big picture. Define your priorities and assign resource accordingly. So, am I a engineer or a scientist or a little of both? since my manager told my I was more a scientist, I should be a little of both...

Monday, April 20, 2009

Montreal Python6 & Peer Programming

Montreal Python 6 demo packaging and UI major improvements are the results of a peer programming with apt-get install ygingras. In this world where time is the biggest constraint, I could not finalized some of the demo improvements and packaging without help. bitbucket, the mercurial host projects has been very useful to store demo parts (i.e.: flayers, mlboost and digipy). I am not working often off-line, but been able to commit in the plane or in the bus, exchange dundles and create branches easily has very useful.
You can play with the demo, just type:
easy_install digipy (on your favorite linux distribution)
Simply type digipy and tab to see all digipy programs used in the live demo (i..e: use --help to get more details). I have been impress by the powerful setup.py related distutil tools of python to create packages (sdist, register, upload to push code on the Python Package Index is just GREAT!)
Virtual Environment, a tool to create isolated Python environments, of Ian Biking has been really usefull to test the packages.
Peer programing is the type of approach that can bring the productivity of your team from the sum of the productivity to the multiplication of its individual or a best to the team member count exponential relationship. Peer programming has allow me to get out fast of many extreme time consuming techical show stoppers that were burning out my too little available time.