Tuesday, April 30, 2013

Simplifying clustering visualization with mlboost


Are you looking for a simple way to visualized your supervised or semi-supervised data clusters with different dimension reduction algorithms like PCA, LDA, isomap, LLE ,mds, random trees, spectral embedding  etc.?
Here is an output example on 4 newsgroups dataset.

If you are following sklearn loading standard, with mlboost, you can do it by changing 2 lines of code (line #5 and #6) or modify this example. (python yourvisu.py -m y)

1
2
3
4
5
6
7
import sys
from mlboost.clustering import visu

# add your data loading function that return data_train and data_test
from X import LOAD_DATASET_Y
visu.add_loading_dataset_fct('y', LOAD_DATASET_Y)
visu.main(sys.argv[1:])
Btw, if you click on the legend, it will remove the class as you can see here when I remove the green class 2. In the context of semi-supervised, simply set samples class to "?" (dataset.target[i]). 
 

Without scikit-learn and matplotlib, it won't be that easy to experiment visualization. 

Thursday, April 25, 2013

How to remap your keyboard on linux if you drop water that has change your keyboard key mapping? ex: pressing 'c' key and getting '+' display....

Last week, I drop some liquid on my keyboard and a weird thing happened, my 'c' key mutated to a + which is quite annoying.
 After some research I fall on the linux screw FAQ: How to disable/remap a keyboard key in Linux? but it wasn't clear enough so here is what you should do.

  1. run the keycode command xev in your prompt
  2. press on the key that has a bad mapping
  3. identify the keycode in the xev prompt: state 0x10, keycode 86 (keysym 0x63, c), ....
  4. change the keycode mapping : xmodmap -e 'keycode 86=c C'
  5. add (4) cmd in your .bachrc otherwise you will need to redo it in every terminal
As mentioned in the linux screw FAQ, 'c' key should be the keycode 54 not 86 so I just remap the 86 keycode to c and C (when shift is press).

Quite happy I can use my keyboard normally again.  


Saturday, April 6, 2013

Finding the optimal K in kmean: a incremental kmeans in python

I was looking for an good implementation of an incremental k-means where I don't have to set the optimal K. There are interesting papers (x-means, gmeans etc.) but couldn't find any python implementation.

I have decided to write a incremental version on top of sklearn.
The idea is simple:
  1. Start at K=x
  2. identify worst cluster based on an unsupervised measure (ex: silhouette)
  3. Split the worst cluster into 2 clusters
  4. measure the global improvement with the new clusters
  5. if you get an improvement continue adding clusters
You can find the source code in mlboost/clustering/ikmeans.py
A special thanks to scikit-learn lib to let me prototype this version so fast. 

Thursday, April 4, 2013

kmeans with configurable distance function: How to hack sklearn.kmeans to use a different distance function?

Like others, I was looking for a good k-means implementation where I can set the distance function.
Unfortunately, sklearn.kmeans doesn't allow to change the cost function. Euclidean is hard coded. 
As pointed out by the sklearn team, it is quite complex to generalize due to:
So if your distance function is cosine which has the same mean as euclidean, you can monkey patch  sklearn.cluster.k_means_.eucledian_distances this way: (put this code before calling kmean.fit(X).


from sklearn.metrics.pairwise import cosine_similarity
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False) 
    return cosine_similarity(X,Y)

# monkey patch (ensure cosine dist function is used)
from sklearn.cluster import k_means_k_means_.euclidean_distances 
k_means_.euclidean_distances = new_euclidean_distances 


Warning: you need to normalize your input vectors.