Sunday, January 3, 2010
gmail, a powerfull target marketing tool
I though at one point that their primary end goal was to launch a corporate email portal so company won't need to hire high paid sys admin to provide mail server support and by the same time help world wide employee getting something way better then outlook/exchange server that pollute our live. They are doing it already but I think their real goal was to do target marketing but not traditional one.
What better can you get then user emails to understand his profile and do target marketing. They get the highest quality info from your emails, yes your emails.
According to my experience in more traditional target marketing for Bell and at Microcell-Lab, when people do traditional target marketing, they have few info about users and derive new information from which they try to generate better predictions. As an example, they use your postal code to estimate your family revenue etc, and use that information to generate better prediction that you will buy X or Y.
Traditional target marketing practitioners use a lift approach to get the top N most probable buyers for a given product or service and will try to approach those people with promotion or email etc. With gmail, it it way more simple, you use user profile info like email words (btw, they are parsing your emails, take a close look, you will see), and use a prediction engine to advertise the info you are most likely going to like or buy and show it to you directly because you are using their mail service.
Gmail is an amazing target marketing tool because it get profile info directly at the user fingerprint, can do way better prediction then traditional target marketing technics and has access to the customer directly and scale well to get more users. Our prediction is always as good as your data. What's the point of improving algorithm if you can get better and high quality data or as google do both. Larry Page and Sergey Brin are just visionary target marketers! Share on Reddit!!!
Friday, January 1, 2010
Jython, pyPdf, reportlab experimentation & patches proposals
I am currently experimenting jython in order to do pdf files manipulations. I encountered several problems and I want to share some of the solutions (time to give back).Intro, only pure python code and library are working on python and jython. All C related python packages aren't compatible. Jython allows python syntax on top of java VM. One great thing is that you can use java classes within python. Jython2.5.1 as been release last September.
1) manual pdf text modification
I thought it was simple to modify a pdf template to change a text but I was wrong. Even if you are able to re-encode new text and change length, you will hit walls. It is more complex then that (xref etc.). Most pdf lib provide encoding helper function but you will get hard time finding decoding one, as an example ascii85. After some time, I decided to try to make reportlab working with jython.
2) reportlab import error with jython
I tried to used reportlab, a powerful lib to create PDF, but it was generating this error when I was importing reportlab.pdfgen: java.lang.ClassFormatError: java.lang.ClassFormatError: Invalid method Code length 66566 in class file reportlab/pdfbase/_fontdata$py. According to this thread on warkmail, there was a simple solution but the patch wasn't working. You can find the working patch that I have created here and proposed to reportlab team.
3) Saving pdf to memory instead of files
In order to do in memory pdf manipulations, I used the pure pyPdf python lib from Mathieu Fenniack. Basically, I tried to save a canvas in memory and couldn't figure it out why it wasn't working. Basically, I was doing outputStream.writelines(c._
4) Simple comparison python/jython
I was wondering how much slower was jython compare to python. As you can see, it is slower and it degrades with some parameter size (ex: n pages). In this example, it also takes 4 to 6 times more memory.
5) Jython out of memory
If you get:
OutOfMemoryError: java.lang.OutOfMemoryError: GC overhead limit exceeded
use -J-Xmx1024m jython option to allow more memory heap size for the java netbeans.
4) Threading optimization
Jython doesn't suffer from the GIL problem. Look at this video to get more information about it "Mindblowing Python GIL". Basically jython can do real multi-threading. In my context, I could easily parallelize part of my code so I tried it by using the Theadpool of Christopher Arndt. Unfortunately, I still haven't been able to make is faster. pyPdf hasn't been designed to be used in a real threading environment (PdfFileReader can't be shared between threads) which introduce limitations.
5) pyPDF profiling
pyprof2calltree is an amazing tool for profiling as you can see in the figure. Guessing what needs optimization is a path we should never go because we are most of the time wrong and are wasting our time doing uncritical optimizations that make code unreadable. I saw a great presentation fromMike C. Fletcher on python profiler at pycon 2009. I might try to optimize readObject function of pyPdf.It is amazing to see the tremendous effort people are putting to make python syntax available on each platform (java->jython; .Net->ironpython etc.) It is a sign of python great syntax. Share on Reddit!!!
Sunday, December 13, 2009
Business Contract 2.0 - Found a template
RIM (Régionale des ingénieurs de Montréal/ordre des ingénieurs du Québec), was organizing a training Workshop called "Business Contracts 2.0". The presentation from Gilles Thibault of edilex was extremely interesting.
Currently, their is no established standard to create contracts. Basically their is as many form as the number of lawyers out there. Most of the time, lawyers are the only one comfortable with them because they've wrote them and it is in their juridic jargon. Unfortunately it doesn't help much who really used them. Most of the time, clients and contractors have hard time understanding them but it is a lucrative process for high hour rate paid lawyers. According to Gilles, lawyer will disappear if they can't provide better services for contracts (preaching for its business).
Edilex is proposing a template for contract to ensure nothing is missing and enforce structure to ensure clear expectation between each party. A contract need a table of content, its like a plan.
Proposed Template blocks:
- Identification & location (name + birth->juridic name + rights applicable)
- Party identification (Physic/moral-Society/union/coop ->representative liquidator Trustee/power delegation etc. )
- Preamble (context used for fall-back defect clauses)
- Lexicon (clarification/disambiguation & shorter sentences)
- Object (simple/multiple;utility conditions/redaction)
- Cost (adjustment/payment method/warranty/phase delivery etc.)
- Attestations party A (not obligation/warranty-improve trust-don't want to fight about obligation and duty to disclose information)
- Attestations party B
- Reciprocal obligations
- Obligations party A (Align with Business process/order of execution)
- Obligations party B
- Special provisions (orphan/specific/bi-directional)
- General provisions
- End of contract (resolution/termination)
- Start of the contract
- Duration
- Scope
- Annexes
- Clearly define what it include and isn't (You can't remove sections)
- Provide table of content (Don't need to reread all the contract)
- Find holes/unclear-possible point of conflicts
- Reduce dramatically judge interpretation during conflict
- Can help deciding to not get involve in the project (risk/client honesty)
- Provide structure/uniform frame
- Enforce clarity
Wednesday, November 18, 2009
Leaky assomption and Gradient Descent- part 2/3
Last February, I post the first part of this post.
Basically, I was pretending that “uncorrelated inputs” was a leaky abstraction and was the root cause of neural networks back-propagation poor results for training huge neural networks or deep networks. According to me, this simplification was fine while the number of parameters remains small. My hypothesis was that optimization problems are growing with the number of parameters which imply an implicit limit of the usage of this abstraction and explain those poor results.
During my research between 2001 and 2003, I was focusing on finding a way to train a neural network faster as presented on the left side of the figure. Unfortunately, I didn’t found that revolutionary algorithm but simply documented various effects of optimizations problems and ways to reduce or eliminate them with experimental results.
In 2008, I went to see Nicolas Leroux Phd defense and close to the end, he brought back that optimization problems could be the problem without presenting solutions which revive my research interest.
It reminded me a last crazy experimentation I have done in 2003, I found an algorithm that had the characteristic of the right side figure but it did not kept much of my attention at that time. Reducing optimization doesn’t imply necessarily faster classification error in time but should do it per iteration (i.e.: epoch).
During a year, I tried to reproduce that experiment in my free time. PA came to the rescue to discuss the underlying assumptions, to brainstorm and help reproducing the experimentation within flayers. Flayers wasn't suiting our needs anymore, the process to get back into the detailed of the implementation was reducing dramatically our experimentation throughput. We finally decided to drop flayers and rewrite it in python (optbprop) to ensure better collaboration and way faster experimentation. We had to make several optimization to make python speed acceptable but it was still slower then flayers (~10; see post). Even if it was slower, ultra fast experimentation become possible and research speed increased exponentially to try to recreate the experimentation. The complexity was residing into the order of the parameters optimization, what was the right recipe??
In June 2009, some times after ICML, Jeremy joined as the third collaborator and I finally reproduced it, results were even better. I have used an output max sensitivity ordering followed by a max hidden sensitivity backprop strategy.
Unfortunately, it was too good to be true and Jeremy found a critical problem in the solution which was leading to extremely poor generalization. At that point, motivation was too low so we’ve decided to stop our research. I was disappointed but a true relief, I could finally move to something else.
Without this collaboration team, I couldn’t have reached that point alone. I am still not 100% convinced that this research path is dead but it is back to a value that is way lower the motivation minimum threshold.
What could explain such bad results? We were expecting lightning learning speed and at worst little improvements compare to standard stochastic backprop. Our explanation hypothesis is that uncorrelated stochastic learning breaks implicit normalization process required for generalization. Basically, for each example, we update parameters to predict its class correctly but its too violent and unlearn previous examples way faster. Unfortunately, we loose the higher level goal which is generalization. Normalization could be re-integrate with bach learning but we haven't experiment it.
If this is true: “When you're ready to quit, you're closer than you think”, I might write part 3 of this post but It will take some time. There is an interplay between machine learning and optimization but people tend to forget it.
Share on Reddit!!!
Monday, November 16, 2009
Toward better Web Monitoring Solutions
Web applications are slowly becoming the new standard. No more installation nor upgrades, they are accessible from any internet connected device. While being the Holy Grail to users, web applications can be a nightmare to engineers, as ensuring quality of service becomes harder.
In fact, web applications create a high level of testing complexity, bringing new challenges to quality and availability of service.
As more and more businesses rely on web applications, techniques such as real-time web monitoring, incidents detection and root causes analysis have become critical.
We will present these new problems in detail, followed by a short history of techniques used to measure and estimate the quality of web-based applications. We will review the most popular monitoring technologies, pointing out their advantages and shortcomings.
This presentation will be done in collaboration with Sebastien Pierre. Share on Reddit!!!
Knowledge Workers - Talent is not patient, and it is not faithful
Of course HR, managers nor VPs aren't struggling to compensate critical components when some of them leave but underlying teams have to. Those decision makers have to remember that they aren't free lunch and mismanagement of knowledge workers have some consequences.
Better management of Knowledge Workers lead to much more productive teams and low turn over but the opposite could cause your decline. What's so hard about creating a win/win approach instead of a loose/loose approach that so many companies seems to fall in. Maybe a generation clash? or simply missing competencies. Share on Reddit!!!
Wednesday, September 2, 2009
What's the relationship between Machine Learning and Data-Mining
- Data-Mining (DM) is the process of extracting patterns from data. The main goal is to understand relationships, validate models or identify unexpected relationships.
- Machine Learing (ML) algorithms allows computer to learn from data. The learning process consist of extracting the patterns but the end goal is to use the knowledge to do prediction on new data.
