Thursday, December 30, 2010

List of Best Paper awards in CS/AI/ML conferences

The below is a great list of best paper awards for WWW, SIGIR, CIKM, AAAI, CHI, KDD, SIGMOD, ICML, VLDB, IJCAI, UIST since 1996

Interesting thing to note: Google is ranked last in frequency, Microsoft first.

This needs NIPS and possibly UAI added to it.

Posted via email from nealrichter's posterous

Wednesday, December 29, 2010

Managing Open Source Licenses

From time to time I have helped companies do Open Source code audits in their own source code. Basically this consists of auditing their code to find open source code.

These code audits are particularly important during software releases and M&A events. I've helped companies do this for releases and been on both sides of M&A event driven audits.

If the developers have kept the attributions with any open source code they have re-used then grep is a fine tool for auditing. However this is a big IF. If your developers are sloppy and do not keep the attributions (ie copyright and license notices) with code they lift from open source you have a problem. A software tool needs to be used to scan the corporate source for hits in open source repositories.

There are at least three companies providing software to do this:

Ideally the outcome of this process is as follows:
  1. A clear company policy is set on what open source licenses are allowed and how developers can use open source come or components.
  2. The corporate code is cleanly annotated with any third party attributions (see below).
  3. Open Source code that has bad licenses for commercial usage is identified and removed before release.
  4. A Bill of Materials is created for each release listing third-party software in the release.
  5. Necessary copyright or other notices appear in About dialogs, manuals or product websites.

Example comment block:


* Third-party or Open Source Declaration

* Name: Bart Simpson

* Date of first commit: 04/25/2009

* Release: 3.5 “The Summer Lager Release”

* Component: tinyjson

* Description: C++ JSON object serializer/deserializer

* Homepage:

* License: MIT style license

* Copyright: Copyright (c) 2008 Thomas Jansen (

* Note: See below for original declarations from the code


If the above were upgraded to be in a javadoc style comment then a tool could be built to auto-magically generate a Bill of Materials for each release.

There is one grey area in all this: how to handle developers using code from discussion sites like, CodeProject, StackOverflow and similar sites. Generally code put in these type of forums has no defined license. In this case the code is either copyrighted by the site or the author of the post... and developers should not use the code without getting an explicit license. However developers generally feel like people put the code up there to share. This conflict means the company policy on usage of this type of code must be clearly communicated to all developers.

This is a nice review article of other considerations for open source auditing:

Posted via email from nealrichter's posterous

Friday, December 17, 2010

Stochastic Universal Sampling/Selection

Stochastic Universal Sampling is a method of weighted random sampling exhibiting less bias and spread that classic roulette wheel sampling. The intuition is a roulette wheel with n equally spaced steel balls spinning in unison around the wheel. This method has better properties and is more efficient that doing repeated samples from the wheel with or without replacement of the selected items.

Stochastic universal sampling

Baker, James E. (1987). "Reducing Bias and Inefficiency in the Selection Algorithm". Proceedings of the Second International Conference on Genetic Algorithms and their Application (Hillsdale, New Jersey: L. Erlbaum Associates): 14–21.

Reference implementations on the web are scare, so here are a few:

See the buried in the latest tarball.

Posted via email from nealrichter's posterous