Like others, I was looking for a good k-means implementation where I can set the distance function.
Unfortunately, sklearn.kmeans doesn't allow to change the cost function. Euclidean is hard coded.
As pointed out by the sklearn team, it is quite complex to generalize due to:
- mean is not always the average and is most of the time not trivial (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf see table 8.2 page 501)
- generalization isn't always compatible with optimized version
from sklearn.metrics.pairwise import cosine_similarity def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False) return cosine_similarity(X,Y) # monkey patch (ensure cosine dist function is used) from sklearn.cluster import k_means_k_means_.euclidean_distances
k_means_.euclidean_distances = new_euclidean_distances
Easiest way I've found ever. But cosine_distances should be used here instead of cosine_similarity.
ReplyDelete