Thursday, December 30, 2010

List of Best Paper awards in CS/AI/ML conferences

The below is a great list of best paper awards for WWW, SIGIR, CIKM, AAAI, CHI, KDD, SIGMOD, ICML, VLDB, IJCAI, UIST since 1996

http://jeffhuang.com/best_paper_awards.html

Interesting thing to note: Google is ranked last in frequency, Microsoft first.

This needs NIPS and possibly UAI added to it.

Posted via email from nealrichter's posterous

Wednesday, December 29, 2010

Managing Open Source Licenses

From time to time I have helped companies do Open Source code audits in their own source code. Basically this consists of auditing their code to find open source code.

These code audits are particularly important during software releases and M&A events. I've helped companies do this for releases and been on both sides of M&A event driven audits.

If the developers have kept the attributions with any open source code they have re-used then grep is a fine tool for auditing. However this is a big IF. If your developers are sloppy and do not keep the attributions (ie copyright and license notices) with code they lift from open source you have a problem. A software tool needs to be used to scan the corporate source for hits in open source repositories.

There are at least three companies providing software to do this:

Ideally the outcome of this process is as follows:
  1. A clear company policy is set on what open source licenses are allowed and how developers can use open source come or components.
  2. The corporate code is cleanly annotated with any third party attributions (see below).
  3. Open Source code that has bad licenses for commercial usage is identified and removed before release.
  4. A Bill of Materials is created for each release listing third-party software in the release.
  5. Necessary copyright or other notices appear in About dialogs, manuals or product websites.

Example comment block:

/*

* XYZ.com Third-party or Open Source Declaration

* Name: Bart Simpson

* Date of first commit: 04/25/2009

* Release: 3.5 “The Summer Lager Release”

* Component: tinyjson

* Description: C++ JSON object serializer/deserializer

* Homepage: http://blog.beef.de/projects/tinyjson/

* License: MIT style license

* Copyright: Copyright (c) 2008 Thomas Jansen (thomas@beef.de)

* Note: See below for original declarations from the code

*/

If the above were upgraded to be in a javadoc style comment then a tool could be built to auto-magically generate a Bill of Materials for each release.

There is one grey area in all this: how to handle developers using code from discussion sites like PHP.net, CodeProject, StackOverflow and similar sites. Generally code put in these type of forums has no defined license. In this case the code is either copyrighted by the site or the author of the post... and developers should not use the code without getting an explicit license. However developers generally feel like people put the code up there to share. This conflict means the company policy on usage of this type of code must be clearly communicated to all developers.

This is a nice review article of other considerations for open source auditing:


Posted via email from nealrichter's posterous

Friday, December 17, 2010

Stochastic Universal Sampling/Selection

Stochastic Universal Sampling is a method of weighted random sampling exhibiting less bias and spread that classic roulette wheel sampling. The intuition is a roulette wheel with n equally spaced steel balls spinning in unison around the wheel. This method has better properties and is more efficient that doing repeated samples from the wheel with or without replacement of the selected items.

Stochastic universal sampling

Baker, James E. (1987). "Reducing Bias and Inefficiency in the Selection Algorithm". Proceedings of the Second International Conference on Genetic Algorithms and their Application (Hillsdale, New Jersey: L. Erlbaum Associates): 14–21.

Reference implementations on the web are scare, so here are a few:

https://github.com/dwdyer/watchmaker/blob/master/framework/src/java/main/org/uncommons/watchmaker/framework/selection/StochasticUniversalSampling.java

See the SUSSelection.java buried in the latest tarball.

Posted via email from nealrichter's posterous

Monday, November 22, 2010

Computing, economics and the financial meltdown (a collection of links)

This editor's letter from CACM last year is interesting: The Financial Meltdown and Computing by Moshe Y. Vardi

Information technology has enabled the development of a global financial system of incredible sophistication. At the same time, it has enabled the development of a global financial system of such complexity that our ability to comprehend it and assess risk, both localized and systemic, is severely limited. Financial-oversight reform is now a topic of great discussion. The focus of these talks is primarily over the structure and authority of regulatory agencies. Little attention has been given to what I consider a key issue—the opaqueness of our financial system—which is driven by its fantastic complexity. The problem is not a lack of models. To the contrary, the proliferation of models may have created an illusion of understanding and control, as is argued in a recent report titled "The Financial Crisis and the Systemic Failure of Academic Economics."

Krugman's essay at the time How Did Economists Get It So Wrong? gave a nice history of economic ideas, the models behind and his interpretations of their correctness.

The theoretical model that finance economists developed by assuming that every investor rationally balances risk against reward — the so-called Capital Asset Pricing Model, or CAPM (pronounced cap-em) — is wonderfully elegant
[snip]
Economics, as a field, got in trouble because economists were seduced by the vision of a perfect, frictionless market system.
[snip]
H. L. Mencken: “There is always an easy solution to every human problem — neat, plausible and wrong.”
I read this months ago and it's been percolating in my thoughts since then. My Manhattan Project - How I helped build the bomb that blew up Wall Street by Michael Osinski. Osinski wrote much of the software and models used to form CMOs and CDOs. Essentially the software aggregates debt instruments from mortgage and other debt markets and allowed a bond designer to issue tailor-made portfolio of debt while mitigating default risk of the debt via that aggregation. He called it his sausage grinder.

“You put chicken into the grinder”—he laughed with that infectious Wall Street black humor—“and out comes sirloin.”

Here's a large collection of links from that period that are worth reading. My thought at the moment is this nugget from Twitter:

Posted via email from nealrichter's posterous

Wednesday, October 27, 2010

Review of "Learning to Rank with Partially-Labeled Data"


I've been attending the University of Utah Machine Learning seminar (when I can) this fall. PhD student Piyush Kumar Rai is organizing it.

I volunteered to take the group though Learning to Rank with Partially-Labeled Data by Duh & Kirchhoff. I have some experience researching and implementing LTR algorithms, mostly using reinforcement learning or ant-system type approaches. Some general intro here.

The paper presents the main Transductive Learning algorithm as a framework, then fills in the blanks with Kernel PCA and RankBoost. Several Kernels are used: Linear, polynomial, radial basis function and knn-diffusion. RankBoost learns a kind of ensemble of 'weak learners' with simple thresholds.

The main reason to read the paper if you are already familiar with LTR is the use of the transductive algorithm.

Note the DISCOVER() & LEARN() functions. These are the unsupervised and supervised algorithm blanks they fill with Kernel PCA and RankBoost. What the first actually does is learn a function we could call EXTRACT() that can extract or create features for later use. They do show that the basic idea of layering in unlabeled data with labeled data is a net gain.

There are some issues with the paper. First the computational time performance, as they admit, is not good. The other is that their use of Kernel PCA in an information retrieval context is a bit naive IMO. The IR literature is full of hard-won knowledge of extracting decent features from documents. See this book for example. This is mostly ignored here.

The more confusing thing is the use of K-Nearest Neighbor diffusion kernels. Basically they take the vector of documents, form a matrix by euclidean distance and then random-walk the matrix for a set number of time-steps. The PCA then takes this 'kernel' output and solves the eigenvalue problem, to get the eigenvectors. This all seems a round-about way of saying they approximated the Perron-Frobenius eigenvector (sometimes call PageRank) by iterating the matrix a set number of times and zeroed out low order cells. Or at least I see no effective difference between what they did and what I just described. Basically they just make the matrix sparse to solve it easier (ie this is the dual).

Their use of various classic IR features like TFIDF, BM25 etc needed help. There's pleny of IR wisdom on how to use such features, why let the DISCOVER() wander about attempting to rediscover this? The results were also muddled with only one of the three data sets showing a significant improvement over a baseline technique.

All that aside, it's worth a read for the intro to the transductive alg used with an IR centric task.

Posted via email from nealrichter's posterous

Tuesday, October 26, 2010

Stanford Computational Advertising course - Fall 2010

Andrei Broder and Vanja Josifovski of Yahoo Research Labs are again offering a Fall course on Computational Advertising at Stanford.

Great intro to the area.

Posted via email from nealrichter's posterous

Monday, August 16, 2010

Strategic Marketing

I highly recommend this ExecEd course at MIT Sloan: Strategic Marketing for the Technical Executive.  

Duncan Simester taught it. The course looks to be a condensed form of this course: 15.810 Marketing Management

Key takeaways:
1) Making decisions by experimentation versus meetings+intuition is crucial.
2) Don't assume your role is to know the answer. Your role is really to work out how to find the answer as quickly as possible.
3) Brand is unimportant when customers can observe you are meeting their needs.
4) Brand is important when they can't search/observe and must reason with less data.
5) Don't price your products based upon cost or competition, work out your true value to the customer.

6) Your actual efficiencies might be different than you assume
7) Experiment and iterate as often as you can

Posted via email from nealrichter's posterous

Tuesday, July 27, 2010

Dissertation done!

The big event of the spring was defending and completing the below.
Really happy to be done.

Advice for working professionals attempting a PhD:
1) pick something relevant to your work.
2) think twice about a theoretical topic.
3) don't make it longer/bigger than necessary.
4) don't grow your family during this time

I did not follow this advice and this likely resulted in a 4 year delay. The outcome was great and the topic is now relevant to the new job at Rubicon.

http://nealrichter.com/research/dissertation/

On Mutation and Crossover in the Theory of Evolutionary Algorithms

Abstract:
The Evolutionary Algorithm is a population-based metaheuristic optimization algorithm. The EA employs mutation, crossover and selection operators inspired by biological evolution. It is commonly applied to find exact or approximate solutions to combinatorial search and optimization problems.

This dissertation describes a series of theoretical and experimental studies on a variety of evolutionary algorithms and models of those algorithms. The effects of the crossover and mutation operators are analyzed. Multiple examples of deceptive fitness functions are given where the crossover operator is shown or proven to be detrimental to the speedy optimization of a function. While other research monographs have shown the benefits of crossover on various fitness functions, this is one of the few (or only) doing the inverse.

A background literature review is given of both population genetics and evolutionary computation with a focus on results and opinions on the relative merits of crossover and mutation. Next, a family of new fitness functions is introduced and proven to be difficult for crossover to optimize. This is followed by the construction and evaluation of executable theoretical models of EAs in order to explore the effects of parameterized mutation and crossover.

These models link the EA to the Metropolis-Hastings algorithm. Dynamical systems analysis is performed on models of EAs to explore their attributes and fixed points. Additional crossover deceptive functions are shown and analyzed to examine the movement of fixed points under changing parameters. Finally, a set of online adaptive parameter experiments with common fitness functions is presented.

Finalized April 19, 2010

Posted via email from nealrichter's posterous