Monday, November 25, 2013

Visualizing High Dimensional Data (PCA, LLE, t-SNE)

Here is a Great talk about data visualization: Visualizing Data Using t-SNE - YouTube

Here is the PCA 2 dimension reduction of mnist data (digit 28x28). 

Now with a better transformation to keep distance function.

Isomap


LLE (collapse many points to one)

T-SNE:
So now, what is better in the context of unlabeled data?
 or 

You can now use t-SNE to visualize millions of objects. Very exciting!  http://arxiv.org/abs/1301.3342


Monday, November 18, 2013

Simplicity's impact on conversion

What drives online business is conversion. What is the biggest driver of increase in conversions?
  • More users landing (PPC)
  • Payment simplicity
What is the point of increasing the number of users landing to your site if you don't start by optimizing your biggest show stopper: the payment funnel.

Digital currency such as bitcoin can simplify this funnel by making it one step, not 4. 
So which currency is less friction?


Does simplicity really increase conversion:

Tuesday, November 12, 2013

Want to start a startup?

Compilations of fraka6 post on startup + important concepts reference

Serial entrepreneur rules: You will not stop to one 
Founding: Network than Money
Strategy: Disrupt existing market
Information theory applied to business
Recruiting/talent management: Hire personality, train skills
Why MBA aren't designed for startup
Bootstrapping: Its feasible
Contrat: found a template (20 items)
IP: Don't get caught in the lawyer game of gray zone
Startup Vs Entreprise: The futur of innovation if startup inside the beast

+ importants concepts: Start with why, get organized and focus
David S. Rose: How to pitch to a VC (You|Integrity|passion|experience|knowledge|skills|Leadership|Commitment|Vision|Realism|coachable)
importance of advisor agreement = biggest mistake you can make
FREE ENTREPRENEUR TOOL (Profit spotify is an interesting tool to mine amazon best seller to identify market)
Amazing selling machine

Sunday, November 10, 2013

Why MBA aren't design for startup?



Why standard MBA don't bring any value to startup is quite simple, they are trained to be successful in a big corporation not in a struggling startup in the early stage.

Thursday, November 7, 2013

Information theory applied to business

If you want to validate a business need, you have to master information theory. 

#1 Evaluating people feedback
        - Indifference = Nice to Have = No value
        - Hate or Desire = High Information = High potential

Potential is proportional to reaction, good or bad. Normal reaction is a sign of nice to have which means no value.


#2 Value
    - Ask = Little value (nice to have)
    - Pain = High Value
    - Observe = a business opportunity

Never ask people want the want, observe them and focus on their pain.


#3 Don't forget: Benefits >>Cost  


Last, there is no business if the benefits aren't significantly higher then the cost.

Saturday, October 19, 2013

Machine Learning API challenges ahead

Here is a summary of the ML challenges to reach mass market with APIs:
  1. Simplify the preprocessing (data cleaning, features extraction & selection-> 90% work) & integration into a mining or ML (10%) & http://scikit-learn.org (opensource project supported by google &  INRIA research group)
  2. Simplification of data visualization
  3. Simplification of semi supervised tagging (reduce the tagging/labelling effort) 
  4. Simplification of parameter selection (model included): Hyperopt: A Python library for optimizing machine learning algorithms; SciPy 2013 - YouTube
  5. For services in the cloud, the biggest show stopper is data transfer (way to slow) & confidentiality

Players to watch: 
PredictionIO : open source machine learning server 
Apache Mahout: Scalable Machine Learning and data-mining

Deep Learning history and most important pointers



Prior to 2006, neural networks wasn't cool at all and were ignored by the machine learning community (in papers selection), after Bayesian Network domination, SVM introducted by Vapnick and the kernel trick was dominating the literature (kernel engineering boring papers). SVM had a major advantage, its convergence could be proven and it was more interesting for theoretical paper work than gradient descent which is simply a greedy optimization problem which was leading to local minimum problems
Despite its theoretical amazing capacity (been able to model any functions), no one was able to train big neural networks and it was even worst to train deep one which could represent similar functions with less parameters. Basically, moving beyond shallow
architecture which imply moving back to the non convex problem optimization nightmare.... back to the optimization research painful path. 
Despite the community momentum, Lecun, Hinton and Bengio were still doing research on the subject (thinking outside of the box). The deep learning revolution seems to have started from the discovery of Lecun, with its applications of convolution networks to face recognition. The structure of the topology could affect the training capability drastically. How could we learn this embedding? Then Hinton came with its revolution idea of deep belief networks and auto encoder (a generative model) to better initialize parameters and force lower parameters to learn low level features or embedding initialization state, basically adding data driven constraints to the natural manifold inherent to the problem.  

Deep learning is not the answer to AI but a major major breakthrough.
Why? Deep learning not just combines unsupervised learning and supervised learning but also solve partially the training of deep architecture which is critical to learn layers of intelligence (beyond shallow architecture). Furthermore, with neural networks, we don't need to make any assumptions about distributions, its scalable and not that hard to implement. Basically, we use unlabeled data to initialize deep architecture and then train it with our label data.
Concerning Deep learning ability to learn higher-level reasoning, it is the next big step. Learning to detect objects from images seems to me the right step towards that goal. Object is a higher level of representation as an idea from a sentence and so on. Concerning "Thinking, Fast and Slow", we can see the fast part as the evaluation of the model and the slow part as its retraining during sleeping (dreams are part of the retraining process). Deep learning is a revolution in the field of machine learning/AI and I am looking forward to see how Ng, Hinton, Bengio, Lecun and others will make it evolve to learn more layers of intelligence.
The next deep learning improvements are fuzzy auto-encoder, maxout, word2net
We shouldn't forget that SVN doesn't scale and with the explosion of data available, scalable approaches had to emerge despite the machine learning obstinacy to focus on un-scalable solutions because convergence is easier to prove.   

Here is a good summary on Learning Deep Architectures for AI:   http://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdfTo learn more about DeepLearning and available implementation (maintain by the Bengio Lisa Lab), take a look at http://deeplearning.net/tutorial/ and http://fastml.com/pylearn2-in-practice/

Sunday, September 29, 2013

Attracting, evaluating & criteria about Investors

How to attract good investors?

Apply the core ninja principal:

If you go out looking for cash….
You’ll find people who tell you how to improve your business….

If you go out looking for people to improve your business (real entrepreneurs)....
You will find cash! (and you will have people to help you improve your business too...cool huh?!)

How to evaluate an investor?

The real value of an investor is inversely proportional to how he will try to take advantage of you.
Basically, the more he will try to screw you, the lowest is its value. … apply the ninja rule...

What to look for in an investor?
1- Network
2- Experience
3- Money

How to select your investor?
1- Identify your need
2- Identify the best people who can help (use the beachhead model)
3- Request help (apply the ninja principle)

Wednesday, September 25, 2013

A simple way to identify outliers and focus on clusters

A simple way to identify outliers is to apply a 2D dimension reduction and click on the outliers to retrieve the original record as follow.

This feature is now integrated by default in mlboost
Now, can you zoom/focus on specific clusters? Now you can if some tagged data by excluding classes (-e) or limiting classes to specific ones (-o) like this. 


Don't have to much fun and don't forget you can specify the dimension reduction transformation used.

Friday, September 6, 2013

The big Data Dead Valley Dillemma

Here is the dump of my perception about Big Data (presented in BigData MTL 7). Basically the Big Data Dead Valley Dilemma.
Basically, where will Big Data real applications are most likely going to happen.
First, to do Big Data, your organization need be be technology mature (big limitation #1). The level of maturity required is mostly available in start-ups or big enterprises. SMB aren't considered because they have no big data problems, no long term vision and little technology maturity in general.
Second, you have to consider the risk of been able to do big data for real (big limitation #2). Most startups don't reach big data due to: founding issue, data issue (not enough) and data availability (privacy constraint).
Once you have the data (enterprise), you are stuck with IT constraints (fear of outsourcing, politics, incompetencies) and data quality (unusable data due to limited integrated QA).
Basically, big data is a dead valley. Big Data is a marketing carrot to attract SMB to invest indefinitely in project that will most likely fail, a failing project will generate more revenue to big data consultant firms. The 2 main places where big data will succeed are in Enterprise like google, amazon, yahoo, nuance and successful startups like facebook, linkedin, twitter. BigData is not a big market, there is too much barrier of entry and the benefits will be destroyed by the complexity of the integration cost. Big data can't succeed if one part of the chain is broken which is the case in 99.9% of the cases.   

Wednesday, July 17, 2013

Bizplan = entrepreneur handcuff for investors

Business plan is an amazing tool to help entrepreneurs clarifying their vision but its used as a risk tool by some investors to reduce their risk, basically its handcuff for entrepreneurs. Many investors (no value investor/only money investors) will use it to reduce their risk. Less values your investors will bring on the table the more they will use your own business plan to hang you and the worst valuation they will give you. Basically, the less value they bring on the table, the more they will try to screw you. Learn to negotiate, stay in a strong position, its like war.

5 Lessons Entrepreneurs Can Take From D-Day
http://www.linkedin.com/today/post/article/20130607022027-37142409-5-lessons-entrepreneurs-can-take-from-d-day?_mSplash=1

Tuesday, June 25, 2013

How to disrupt google AdWords PPC? Why AdWords will never move from PPC to PProiC?

Everyone knows that google milk cow is AdWords and that the best way to attract customers is is help them finding you. AdWords business model is quite simple: Pay Per Click (PPC). But what about ROI?
Most business care about paying customer and don't want to wast money on others no?
As mentioned on AdWords FAQ, ROI is simple to compute : (Revenue - Cost) / Cost = ROI. 
Why AdWords nor Google Analytics doesn't allow you to PProiC easily (Pay Per ROI Click)?
Because you are selling a brand, cost is hard to measure, we can't measure the exact ROI, blablabla....
The truth is that if most companies were able to measure real PPC easily, they will stop using standard PPC because its simply too expensive. Let's take an example: 1$/click, 8266 clicks, 39 sales ->  211.95$/sales. If your average sales is 150$/click, you are loosing 61.95$/click and need to pump more money into your business?
Why do you think so many companies are doing email marketing with mailchimp or marketo or try to optimize their campaigns with acquisio? They want positive ROI/limit the dilution of their marging and are just desperately looking for PPC alternative like ROI dashboard but who can compete free Google Analytics, QMining?
PPC are great solutions to increase sales but also a powerful tool to kill many companies who are looking for better tool to reach their audience. Google has to be disrupted but who is strong enough to take that challenge...
Why Google might never move to PProiC? Because it will never go against its core business model, would you? no one does until they face the wall like Microsoft did last month.

Saturday, May 25, 2013

How to generate confusion matrix visualization in python and how to use it in scikit-learn

Confusions matrix are quite useful to understand your classifier problems. scikit-learn allow you to retrieve easily the confusion matrix (metric.confusion_matrix(y_true, y_pred)) but it is hard to read.
An image representation is a great way to look at it like this.


From a confusion matrix, you can derive classification error, precision, recall and extract confusion highlights. mlboost has a simple util class ConfMatrix to do all of this now. Here is an example:

 from mlboost.util.confusion_matrix import ConfMatrix  
 clf.fit(X_train, y_train)  
 pred = clf.predict(X_train)  
 labels = list(set(y_train))  
 labels.sort()  
 cm = ConfMatrix(metrics.confusion_matrix(y_train, pred), labels)  
 cm.save_matrix('conf_matrix.p')  
 cm.get_classification()  
 cm.gen_conf_matrix('conf_matrix')  
 cm.gen_highlights('conf_matrix_highlights')  

Thursday, May 16, 2013

The simplest python server example ;)

Today, I was looking for a simple python server template but couldn't find a good one so here is what I was looking for (yes the title is a little arrogant ;):

#!/usr/bin/env python
''' simple python server example; 
    output format supported = html, raw or json '''
import sys
import json
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer

FORMATS = ('html','json','raw')
format = FORMATS[0]

class Handler(BaseHTTPRequestHandler):

    #handle GET command
    def do_GET(self):
        if format == 'html':
            self.send_response(200)
            self.send_header("Content-type", "text/plain")
            self.send_header('Content-type','text-html')
            self.end_headers()
            self.wfile.write("body")
        elif format == 'json':
            self.request.sendall(json.dumps({'path':self.path}))
        else:
            self.request.sendall("%s\t%s" %('path', self.path))
        return

def run(port=8000):

    print('http server is starting...')
    #ip and port of server
    server_address = ('127.0.0.1', port)
    httpd = HTTPServer(server_address, Handler)
    print('http server is running...listening on port %s' %port)
    httpd.serve_forever()

if __name__ == '__main__':
    from optparse import OptionParser
    op = OptionParser(__doc__)

    op.add_option("-p", default=8000, type="int", dest="port"
                  help="port #")
    op.add_option("-f", default='json', dest="format"
                  help="format available %s" %str(FORMATS))
    op.add_option("--no_filter", default=True, action='store_false'
                  dest="filter", help="don't filter")

    opts, args = op.parse_args(sys.argv)

    format = opts.format
    run(opts.port)

Tuesday, April 30, 2013

Simplifying clustering visualization with mlboost


Are you looking for a simple way to visualized your supervised or semi-supervised data clusters with different dimension reduction algorithms like PCA, LDA, isomap, LLE ,mds, random trees, spectral embedding  etc.?
Here is an output example on 4 newsgroups dataset.

If you are following sklearn loading standard, with mlboost, you can do it by changing 2 lines of code (line #5 and #6) or modify this example. (python yourvisu.py -m y)

1
2
3
4
5
6
7
import sys
from mlboost.clustering import visu

# add your data loading function that return data_train and data_test
from X import LOAD_DATASET_Y
visu.add_loading_dataset_fct('y', LOAD_DATASET_Y)
visu.main(sys.argv[1:])
Btw, if you click on the legend, it will remove the class as you can see here when I remove the green class 2. In the context of semi-supervised, simply set samples class to "?" (dataset.target[i]). 
 

Without scikit-learn and matplotlib, it won't be that easy to experiment visualization. 

Thursday, April 25, 2013

How to remap your keyboard on linux if you drop water that has change your keyboard key mapping? ex: pressing 'c' key and getting '+' display....

Last week, I drop some liquid on my keyboard and a weird thing happened, my 'c' key mutated to a + which is quite annoying.
 After some research I fall on the linux screw FAQ: How to disable/remap a keyboard key in Linux? but it wasn't clear enough so here is what you should do.

  1. run the keycode command xev in your prompt
  2. press on the key that has a bad mapping
  3. identify the keycode in the xev prompt: state 0x10, keycode 86 (keysym 0x63, c), ....
  4. change the keycode mapping : xmodmap -e 'keycode 86=c C'
  5. add (4) cmd in your .bachrc otherwise you will need to redo it in every terminal
As mentioned in the linux screw FAQ, 'c' key should be the keycode 54 not 86 so I just remap the 86 keycode to c and C (when shift is press).

Quite happy I can use my keyboard normally again.  


Saturday, April 6, 2013

Finding the optimal K in kmean: a incremental kmeans in python

I was looking for an good implementation of an incremental k-means where I don't have to set the optimal K. There are interesting papers (x-means, gmeans etc.) but couldn't find any python implementation.

I have decided to write a incremental version on top of sklearn.
The idea is simple:
  1. Start at K=x
  2. identify worst cluster based on an unsupervised measure (ex: silhouette)
  3. Split the worst cluster into 2 clusters
  4. measure the global improvement with the new clusters
  5. if you get an improvement continue adding clusters
You can find the source code in mlboost/clustering/ikmeans.py
A special thanks to scikit-learn lib to let me prototype this version so fast. 

Thursday, April 4, 2013

kmeans with configurable distance function: How to hack sklearn.kmeans to use a different distance function?

Like others, I was looking for a good k-means implementation where I can set the distance function.
Unfortunately, sklearn.kmeans doesn't allow to change the cost function. Euclidean is hard coded. 
As pointed out by the sklearn team, it is quite complex to generalize due to:
So if your distance function is cosine which has the same mean as euclidean, you can monkey patch  sklearn.cluster.k_means_.eucledian_distances this way: (put this code before calling kmean.fit(X).


from sklearn.metrics.pairwise import cosine_similarity
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False) 
    return cosine_similarity(X,Y)

# monkey patch (ensure cosine dist function is used)
from sklearn.cluster import k_means_k_means_.euclidean_distances 
k_means_.euclidean_distances = new_euclidean_distances 


Warning: you need to normalize your input vectors. 

Thursday, March 21, 2013

Google Analytics Hidden Business Model

What is GA (Google Analytics) Business Model?

Analytics is used by google to improve adwords product (most of google revenue). Nothing is free.

Here is the most common reason why GA is free:
  • Provide accountability and transparency to existing Google advertisers (adwords clients)
  • Provide confidence and prove the value of online advertising to potential new advertisers
Here is 2 another ones?
  • Provide advertiser tools to speed their site (real-user monitoring)
  • Increase ads relevance  (hidden)
Why real-user monitoring?
Since 2011, google ranking consider site speed. Faster are sites, more search people will be done.

How can GA be used by google to increase ads relevance?

In order to understand this concept, let's look at google eco-system.

If a user visit several site monitored by google Analytics, GA information can be leverage to better recommend ads anywhere and improve google search ads.

With GA, you get quite interesting information about users: their behavior, the content and sites they visit, their frequency and time spent and this across multiple sites. GA is an amazing source of user information. 

In order to get better customized ads, you need:
  • session data (behavior, site visited, content type, frequency, etc.)
  • aggregate data to user data
  • get more information about the user (no simply an id, a google+ or gmail account) 
Once you get the data, the biggest challenge is merging the sessions. In order to do it, your need a way to merge the sessions. Here is the ways google does it:
  • Google analytics (same sessions id between sites)
  • gmail login (get a uniq id)
  • Chrome browser (control browser, control browser global id)
  • google+ and gmail account (user info)

The next biggest challenge is leveraging the social graph, which explain google+ war against facebook. 
The ad platform has to leverage GA, gmail, google+ etc. to improve its search.