Thursday, April 4, 2013

kmeans with configurable distance function: How to hack sklearn.kmeans to use a different distance function?

Like others, I was looking for a good k-means implementation where I can set the distance function.
Unfortunately, sklearn.kmeans doesn't allow to change the cost function. Euclidean is hard coded. 
As pointed out by the sklearn team, it is quite complex to generalize due to:
So if your distance function is cosine which has the same mean as euclidean, you can monkey patch  sklearn.cluster.k_means_.eucledian_distances this way: (put this code before calling

from sklearn.metrics.pairwise import cosine_similarity
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False) 
    return cosine_similarity(X,Y)

# monkey patch (ensure cosine dist function is used)
from sklearn.cluster import k_means_k_means_.euclidean_distances 
k_means_.euclidean_distances = new_euclidean_distances 

Warning: you need to normalize your input vectors. 

1 comment:

  1. Easiest way I've found ever. But cosine_distances should be used here instead of cosine_similarity.