Saturday, May 25, 2013

How to generate confusion matrix visualization in python and how to use it in scikit-learn

Confusions matrix are quite useful to understand your classifier problems. scikit-learn allow you to retrieve easily the confusion matrix (metric.confusion_matrix(y_true, y_pred)) but it is hard to read.
An image representation is a great way to look at it like this.


From a confusion matrix, you can derive classification error, precision, recall and extract confusion highlights. mlboost has a simple util class ConfMatrix to do all of this now. Here is an example:

 from mlboost.util.confusion_matrix import ConfMatrix  
 clf.fit(X_train, y_train)  
 pred = clf.predict(X_train)  
 labels = list(set(y_train))  
 labels.sort()  
 cm = ConfMatrix(metrics.confusion_matrix(y_train, pred), labels)  
 cm.save_matrix('conf_matrix.p')  
 cm.get_classification()  
 cm.gen_conf_matrix('conf_matrix')  
 cm.gen_highlights('conf_matrix_highlights')  
Share on Reddit!!!

Thursday, May 16, 2013

The simplest python server example ;)

Today, I was looking for a simple python server template but couldn't find a good one so here is what I was looking for (yes the title is a little arrogant ;):

#!/usr/bin/env python
''' simple python server example; 
    output format supported = html, raw or json '''
import sys
import json
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer

FORMATS = ('html','json','raw')
format = FORMATS[0]

class Handler(BaseHTTPRequestHandler):

    #handle GET command
    def do_GET(self):
        if format == 'html':
            self.send_response(200)
            self.send_header("Content-type", "text/plain")
            self.send_header('Content-type','text-html')
            self.end_headers()
            self.wfile.write("body")
        elif format == 'json':
            self.request.sendall(json.dumps({'path':self.path}))
        else:
            self.request.sendall("%s\t%s" %('path', self.path))
        return

def run(port=8000):

    print('http server is starting...')
    #ip and port of server
    server_address = ('127.0.0.1', port)
    httpd = HTTPServer(server_address, Handler)
    print('http server is running...listening on port %s' %port)
    httpd.serve_forever()

if __name__ == '__main__':
    from optparse import OptionParser
    op = OptionParser(__doc__)

    op.add_option("-p", default=8000, type="int", dest="port"
                  help="port #")
    op.add_option("-f", default='json', dest="format"
                  help="format available %s" %str(FORMATS))
    op.add_option("--no_filter", default=True, action='store_false'
                  dest="filter", help="don't filter")

    opts, args = op.parse_args(sys.argv)

    format = opts.format
    run(opts.port)
Share on Reddit!!!

Tuesday, April 30, 2013

Simplifying clustering visualization with mlboost


Are you looking for a simple way to visualized your supervised or semi-supervised data clusters with different dimension reduction algorithms like PCA, LDA, isomap, LLE ,mds, random trees, spectral embedding  etc.?
Here is an output example on 4 newsgroups dataset.

If you are following sklearn loading standard, with mlboost, you can do it by changing 2 lines of code (line #5 and #6) or modify this example. (python yourvisu.py -m y)

1
2
3
4
5
6
7
import sys
from mlboost.clustering import visu

# add your data loading function that return data_train and data_test
from X import LOAD_DATASET_Y
visu.add_loading_dataset_fct('y', LOAD_DATASET_Y)
visu.main(sys.argv[1:])
Btw, if you click on the legend, it will remove the class as you can see here when I remove the green class 2. In the context of semi-supervised, simply set samples class to "?" (dataset.target[i]). 
 

Without scikit-learn and matplotlib, it won't be that easy to experiment visualization. 
Share on Reddit!!!

Thursday, April 25, 2013

How to remap your keyboard on linux if you drop water that has change your keyboard key mapping? ex: pressing 'c' key and getting '+' display....

Last week, I drop some liquid on my keyboard and a weird thing happened, my 'c' key mutated to a + which is quite annoying.
 After some research I fall on the linux screw FAQ: How to disable/remap a keyboard key in Linux? but it wasn't clear enough so here is what you should do.

  1. run the keycode command xev in your prompt
  2. press on the key that has a bad mapping
  3. identify the keycode in the xev prompt: state 0x10, keycode 86 (keysym 0x63, c), ....
  4. change the keycode mapping : xmodmap -e 'keycode 86=c C'
  5. add (4) cmd in your .bachrc otherwise you will need to redo it in every terminal
As mentioned in the linux screw FAQ, 'c' key should be the keycode 54 not 86 so I just remap the 86 keycode to c and C (when shift is press).

Quite happy I can use my keyboard normally again.  


Share on Reddit!!!

Saturday, April 6, 2013

Finding the optimal K in kmean: a incremental kmeans in python

I was looking for an good implementation of an incremental k-means where I don't have to set the optimal K. There are interesting papers (x-means, gmeans etc.) but couldn't find any python implementation.

I have decided to write a incremental version on top of sklearn.
The idea is simple:
  1. Start at K=x
  2. identify worst cluster based on an unsupervised measure (ex: silhouette)
  3. Split the worst cluster into 2 clusters
  4. measure the global improvement with the new clusters
  5. if you get an improvement continue adding clusters
You can find the source code in mlboost/clustering/ikmeans.py
A special thanks to scikit-learn lib to let me prototype this version so fast. 
Share on Reddit!!!

Thursday, April 4, 2013

kmeans with configurable distance function: How to hack sklearn.kmeans to use a different distance function?

Like others, I was looking for a good k-means implementation where I can set the distance function.
Unfortunately, sklearn.kmeans doesn't allow to change the cost function. Euclidean is hard coded. 
As pointed out by the sklearn team, it is quite complex to generalize due to:
So if your distance function is cosine which has the same mean as euclidean, you can monkey patch  sklearn.cluster.k_means_.eucledian_distances this way: (put this code before calling kmean.fit(X).


from sklearn.metrics.pairwise import cosine_similarity
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False) 
    return cosine_similarity(X,Y)

# monkey patch (ensure cosine dist function is used)
from sklearn.cluster import k_means_k_means_.euclidean_distances 
k_means_.euclidean_distances = new_euclidean_distances 


Warning: you need to normalize your input vectors. 
Share on Reddit!!!

Thursday, March 21, 2013

Google Analytics Hidden Business Model

What is GA (Google Analytics) Business Model?

Analytics is used by google to improve adwords product (most of google revenue). Nothing is free.

Here is the most common reason why GA is free:
  • Provide accountability and transparency to existing Google advertisers (adwords clients)
  • Provide confidence and prove the value of online advertising to potential new advertisers
Here is 2 another ones?
  • Provide advertiser tools to speed their site (real-user monitoring)
  • Increase ads relevance  (hidden)
Why real-user monitoring?
Since 2011, google ranking consider site speed. Faster are sites, more search people will be done.

How can GA be used by google to increase ads relevance?

In order to understand this concept, let's look at google eco-system.

If a user visit several site monitored by google Analytics, GA information can be leverage to better recommend ads anywhere and improve google search ads.

With GA, you get quite interesting information about users: their behavior, the content and sites they visit, their frequency and time spent and this across multiple sites. GA is an amazing source of user information. 

In order to get better customized ads, you need:
  • session data (behavior, site visited, content type, frequency, etc.)
  • aggregate data to user data
  • get more information about the user (no simply an id, a google+ or gmail account) 
Once you get the data, the biggest challenge is merging the sessions. In order to do it, your need a way to merge the sessions. Here is the ways google does it:
  • Google analytics (same sessions id between sites)
  • gmail login (get a uniq id)
  • Chrome browser (control browser, control browser global id)
  • google+ and gmail account (user info)

The next biggest challenge is leveraging the social graph, which explain google+ war against facebook. 
The ad platform has to leverage GA, gmail, google+ etc. to improve its search.

Share on Reddit!!!