Fraka6 Blog - No Free Lunch: April 2013

Tuesday, April 30, 2013

Simplifying clustering visualization with mlboost

Are you looking for a simple way to visualized your supervised or semi-supervised data clusters with different dimension reduction algorithms like PCA, LDA, isomap, LLE ,mds, random trees, spectral embedding etc.?
Here is an output example on 4 newsgroups dataset.

If you are following sklearn loading standard, with mlboost, you can do it by changing 2 lines of code (line #5 and #6) or modify this example. (python yourvisu.py -m y)

import sys
from mlboost.clustering import visu

# add your data loading function that return data_train and data_test
from X import LOAD_DATASET_Y
visu.add_loading_dataset_fct('y', LOAD_DATASET_Y)
visu.main(sys.argv[1:])

Btw, if you click on the legend, it will remove the class as you can see here when I remove the green class 2. In the context of semi-supervised, simply set samples class to "?" (dataset.target[i]).

Without scikit-learn and matplotlib, it won't be that easy to experiment visualization.

Thursday, April 25, 2013

How to remap your keyboard on linux if you drop water that has change your keyboard key mapping? ex: pressing 'c' key and getting '+' display....

Last week, I drop some liquid on my keyboard and a weird thing happened, my 'c' key mutated to a + which is quite annoying.
After some research I fall on the linux screw FAQ: How to disable/remap a keyboard key in Linux? but it wasn't clear enough so here is what you should do.

run the keycode command xev in your prompt
press on the key that has a bad mapping
identify the keycode in the xev prompt: state 0x10, keycode 86 (keysym 0x63, c), ....
change the keycode mapping : xmodmap -e 'keycode 86=c C'
add (4) cmd in your .bachrc otherwise you will need to redo it in every terminal

As mentioned in the linux screw FAQ, 'c' key should be the keycode 54 not 86 so I just remap the 86 keycode to c and C (when shift is press).

Quite happy I can use my keyboard normally again.

Saturday, April 6, 2013

Finding the optimal K in kmean: a incremental kmeans in python

I was looking for an good implementation of an incremental k-means where I don't have to set the optimal K. There are interesting papers (x-means, gmeans etc.) but couldn't find any python implementation.

I have decided to write a incremental version on top of sklearn.
The idea is simple:

Start at K=x
identify worst cluster based on an unsupervised measure (ex: silhouette)
Split the worst cluster into 2 clusters
measure the global improvement with the new clusters
if you get an improvement continue adding clusters

You can find the source code in mlboost/clustering/ikmeans.py
A special thanks to scikit-learn lib to let me prototype this version so fast.

Thursday, April 4, 2013

kmeans with configurable distance function: How to hack sklearn.kmeans to use a different distance function?

Like others, I was looking for a good k-means implementation where I can set the distance function.

Unfortunately, sklearn.kmeans doesn't allow to change the cost function. Euclidean is hard coded.

As pointed out by the sklearn team, it is quite complex to generalize due to:

mean is not always the average and is most of the time not trivial (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf see table 8.2 page 501)
generalization isn't always compatible with optimized version

So if your distance function is cosine which has the same mean as euclidean, you can monkey patch sklearn.cluster.k_means_.eucledian_distances this way: (put this code before calling kmean.fit(X).

from sklearn.metrics.pairwise import cosine_similarity
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False) 
    return cosine_similarity(X,Y)

# monkey patch (ensure cosine dist function is used)
from sklearn.cluster import k_means_k_means_.euclidean_distances

k_means_.euclidean_distances = new_euclidean_distances

Warning: you need to normalize your input vectors.