aicoder

Break.com Case Study on ad serving.

2013-11-21T20:59:00.000-08:00

Guillaume Roels of the Anderson UCLA created a 2008 case study on ad serving used in his MBA courses.

pdf_GR07.pdf

The case studies a 90 day period in the life of an adserver matching supply and demand. There are three associated data elements:

Traffic data for the four inventory zones. This gives expected "supply" on a daily basis in the form of ad unit requests (impressions) per zone per day on average.
A set of booked orders with flight dates, budget and impression allocation limits per zone.
A set of proposed orders with the same data elements as orders.

For convenience I have put the data from the case study in an excel file as well as CSV files.

There are many interesting questions to ask with the case:

How much daily or weekly surplus/unsold inventory is available per zone?
How many of the booked orders will be completed at a given level of supply?
How much total revenue results from fully completed orders?
How much total revenue results if partial credit is given for incomplete orders?
How many of the proposals should be accepted such that they will complete?
If the orders and proposals were treated equally which of the total set would complete?
How much revenue results given either completed and/or partially complete orders?
(advanced) Is there an optimal serving schedule (other than price order) that would result in more total revenue or more completed orders?

While the MBA student might approach this with a spreadsheet analysis, writing code is likely a better approach. When implementing the simulation in code one has to make a few assumptions:

ASAP pacing of orders versus even pacing of volume per day across the flight dates.
Which traffic level to use per day (low/baseline/high).
The mechanism to decide the priority of orders to be served in a given day/zone.

This problem relates to both "matchmaking" and "mechanism design" in economics. It is also a simple example of the problems one might solve in software in "computational advertising". See below for a course with lecture slides touring the discipline.

Stanford MS&E 239: Introduction to Computational Advertising

Von Neumann on Empirics vs Hayek on Abstraction

2012-12-13T09:47:00.001-08:00

Recently Carson Chow posted a short quote on the musings of Von Neumann on straying into pure mathematics when an idea originated in an empirical context. [Hat tip Daniel Lemarie for retweeting it around.]

“[M]athematical ideas originate in empirics, although the genealogy is sometimes long and obscure. But, once they are so conceived, the subject begins to live a peculiar life of its own and is better compared to a creative one, governed by almost entirely aesthetic considerations, than to anything else, and, in particular, to an empirical science. There is, however, a further point which, I believe, needs stressing. As a mathematical discipline travels far from its empirical source, or still more, if it is a second and third generation only indirectly inspired by ideas coming from ‘reality’, it is beset with very grave dangers. It becomes more and more purely aestheticising, more and more purely l’art pour l’art. This need not be bad, if the field is surrounded by correlated subjects, which still have closer empirical connections, or if the discipline is under the influence of men with an exceptionally well-developed taste. But there is a grave danger that the subject will develop along the line of least resistance, that the stream, so far from its source, will separate into a multitude of insignificant branches, and that the discipline will become a disorganised mass of details and complexities. In other words, at a great distance from its empirical source, or after much ‘abstract’ inbreeding, a mathematical subject is in danger of degeneration.”

The quote was unsourced in the post. It originally appeared in J.V.M.'s 1947 essay The Mathematician which is republished here:
Part 1 and Part 2

Recently I was reading Emanuel Derman's book Models.Behaving.Badly.: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life

Friedrich Hayek, the Austrian economist who received the 1947 Nobel Memorial Prize in Economics, pointed out that in the physical sciences we know the macroscopic through concrete experience and proceed from to the microscopic by abstraction. [...] In economics, Hayek argued, the order of abstraction should be reversed: we know the individual agents and players from concrete personal experience, and the macroscopic "economy" is the abstraction. If the correct way to proceed is from the concrete to abstract, he argued, in economics we should begin with agents and proceed to economies and markets rather than vice versa.

These are highly contrasting views!

Hayek argued that the market economy was not designed, rather emerged from the fairly simple actions of actors within the economy (negotiating on price etc). Hayek is essentially saying that we should be able to reproduce the complexity of a given economy if only we can capture an accurate description of all of the 'rules' the actors are obeying.

Looking back on my own intellectual development I was first drawn to these emergent ideas within the 'bottom-up' branch of Artificial Intelligence. It was fascinating to set up a simple rule in cellular automata and watch it generate amazing complexity. Similarly the basic Darwinian model of evolution can be duplicated in a very short bit of code that given an objective function can solve problems indirectly via a massive number of randomized trials directed via the function.

Lately I'm a 'data scientist' which means that I'm attempting to extract models from data. Common methods attempt to wash out enough complexity in the data to derive a model that is inspect-able and understandable by a person. This is fundamentally grounded in empirics as the extracted model is tested for accurate prediction against other data. The inspected models are used mostly for 'story telling', helping users of the system to understand what is being learned from the data.. yet we leave specific predictions exclusively to the algorithms.

Of course one can't resolve these abstract differences in a blog post, that's a tall order.

I would however assert that every learning algorithm should have two fundamentals.

It accurately predicts outcomes given input data
It emits a model that is inspectable and can inform the building of better 'toy models' of the system under study.

No one disputes the first, yet many seem all to happy to ignore the second and are OK with building increasingly powerful black-boxes with 'big data' machinery.

Posted via email from aicoder - nealrichter's blog

Software pattern for proportional control of QPS in a webservice

2012-01-07T09:13:00.001-08:00

Problem Statement

Imagine you are writing a webservice that must call a back-end service, such as a data store. Let's assume (with out loss of generality) that the data store (and your hardware supporting it) has some limit in QPS that it can handle. We'd like the client system (your web service) to impose a limit on the QPS to the back-end service. Also assume that this is a distributed webservice, lots of worker threads on lots of different machines.

Requirements

Given a goal in QPS manage the maximum outgoing requests per second to that goal.
Be fast. Maintain a fast controller settling time when the goal or queries change.
Be adaptive. Respond to swings of incomming requests that need to be queried against the service.
Be distributed. Locally active against global numbers without knowing the number of workers.
Be robust. Handle additions and subtractions of worker/clients to the system without coordination. Minimize overshoot.

Design

Assume that querying this backend service, while valuable and mostly needed, is optional under duress. It's far more important for your front-end service to be responsive and return some error or 'No Content' than hang on a busted back-end. As result we'll use a sampling rate 'r' to denote the % of time that the web service should query the back-end. Under normal conditions this rate is 1 (100%). Also assume that the goal in QPS to the back-end is set in some configuration area in your system. Under duress the rate r will be adaptively tuned to obey the QPS goal. Also assume you have smartly implemented some monitoring system like Ganglia/Nagios/Cacti and are emitting events to it when you call the back-end service.

Inputs

G = Goal QPS
M = Current measured QPS (from Ganglia).
r = Current sampling rate [0,1]

Outputs

r_new = A sampling rate [0,1]

Adaptive Mechanism

r_new = r * (G/M)
r_new = MAX(0,MIN(1,r_new)) //clamp r_new between [0,1]

Benefits

Needs only global G and M as inputs.
No-coordination needed between workers/servers other than the globally observed M.
Adaptively moves the per-worker sampling rate independent of all other worker's rates.
Workers can have different incoming QPS rates from a load balancer, the controller will adapt.

Failure Modes

If the sensor for M fails to be updated then the controller is blind
If the Goal is set or re-set to zero, then the controller will stop traffic
Both of these can be addressed easily.

Desired Response

The overshoot/undershoot is called 'ringing'
The time to approach the goal is called the 'settling time'

Implementation Notes

Emit a ganglia/graphite/XXX counter for both requests sent and skipped
Use a smoothed average of the measured QPS to smooth out controller jitter.
(Optional) Each worker should do a bit of randomization of when it performs its sampling-rate-update to smooth out any startup/restart ringing.

Invertability

This design could be inverted for a server implementation. If your webservice has a set of APIs that are heavy to execute, then this controller could be used to control the incomming QPS that are delegated to the heavy work.

Receive request
Submit to sampling rate
If Yes then delegate the request to the executor
If No then respond to the client with 'HTTP 204 No Content' or equivalent empty reponse.

Standford's Introduction to Computational Advertising course

2011-12-27T09:10:00.001-08:00

Andrei Broder and Vanja Josifovski, of Yahoo! Research, have wrapped up their Stanford course again this fall. As always the lecture slides are a great intro to the area.

MS&E 239: Introduction to Computational Advertising

Computational advertising is an emerging new scientific sub-discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search"), a user reading a web page ("content match" and "display ads"), a user watching a movie on a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. Thus, depending on the definition of "best match" this problem leads to a variety of massive optimization and search problems, with complicated constraints, and challenging data representation and access problems. The solution to these problems provides the scientific and technical foundations for the $20 billion online advertising industry.

This course aims to provide a good introduction to the main algorithmic issues and solutions in computational advertising, as currently applied to building platforms for various online advertising formats. At the same time we intend to briefly survey the economics and marketplace aspects of the industry, as well as some of the research frontiers. The intended audience are students interested in the practical and theoretical aspects of web advertising.

Posted via email from aicoder - nealrichter's blog

Summer Program on Computational Advertising

2011-12-27T09:04:00.001-08:00

Deepak Agarwal of Yahoo Research is organizing a summer program on Computational Advertising. It appears to be geared for grad students.

http://www.samsi.info/programs/summer-program-august-6-17-2012-computational-advertising

This two-week program will run from August 6 to August 17, 2012. The first week will be at the Radisson RTP in the Research Triangle Park, NC. The location is in close proximity to SAMSI. The first three days will be spent on technical presentations by leading researchers and industry experts, to bring everyone up to speed on the currently used methodology. On the fourth day, the participants will self-organize into working groups, each of which will address one of the key problem areas (it is permitted that people join more than one group, and the organizers will try to arrange the working group schedules to faciliate that). The second week will be spent at SAMSI headquarters in Research Triangle Park.

Posted via email from aicoder - nealrichter's blog

What Software Engineers should know about Control Theory

2011-11-15T08:24:00.001-08:00

Over the years I've noticed an interesting lack of specific domain knowledge among CS and software people. Other than the few co-workers that majored in Electrical Engineering, almost no one has heard of the field of 'Control Theory'.

From Wikipedia

Control theory is an interdisciplinary branch of engineering and mathematics, that deals with the behavior of dynamical systems. The desired output of a system is called the reference. When one or more output variables of a system need to follow a certain reference over time, a controller manipulates the inputs to a system to obtain the desired effect on the output of the system.

Let's imagine that you write internet web services for a living. Some Rest or SOAP APIs that take an input and give an output.

Your boss walks up to you one day and that asks for a system that does the following:

Create a webservice that calls another (or three) for data/inputs, then does X with them.
Meters the usage of the other web services.
Your webservice must respond within Y milliseconds with good output or a NULL output.
Support high concurrency, ie not use too many servers.

The problem is that these other third-party webservices are not your own. What is their response time? Will they give bad data? How should your webservice react to failures of the others?

Does this sound familiar? It should to many. This is the replicated-and-shared-connector problem (MySQL, memcached), the partitioned-services problem (federated search, and large scale search engines) and the API-as-a-service problem (Mashery, etc).

There are two basic types of controls relevant here:

Open Loop, Feed-forward: Requires good model of system inputs and response of the system.

Closed Loop, Feed-back

Types of adaptive control are as follows:

Linear Feedback
Stability Analysis
Frequency response
response time

Adaptive Schemes

Gain Scheduling
Model Reference Adaptive Systems
Self-tuning regulators
Dual Control

Here's one survey deck from a lecture. Unfortunately for software engineers, most of the presentations of the above are in linear system form rather than an algorithmic form.

Dr. Joe L Hellerstein of Google and co-workers taught a course at U of Washington in 2008 that was more software focused. He's also written a textbook on it and a few papers.

http://research.microsoft.com/en-us/um/people/liuj/cse590k2008winter/
Joseph L Hellerstein et al "Feedback control of computing systems" 2004 Wiley Google Books Amazon
Hellerstein 2003 IBM Tech Report "Challenges in Control Engineering of Computing Systems"
Hellerstein et al "Research challenges in control engineering of computing systems" Volume: 6 Issue: 4, 2010 IEEE Trans on Network and Service Management

The course page has a collection of great links to applications papers on controllers for software systems.

I'd like to see a 'software patterns' set created for easier use by software engineers. I'll attempt to present a couple common forms as patterns in a future blog post.

Open RTB panel - IAB Ad Ops Summit 2011

2011-11-10T10:23:00.000-08:00

Monday November 7th I was on an IAB Ops panel on OpenRTB.

The clip shows an exchange after Steve from the IAB asked a question about how webpage inventory is described in RTB. I described an example of differentiating a simple commodity, barley.

Two of the major uses of barley in the US are animal feed and malting for making beer. Malting barley has specific requirements in terms of moisture content, protein percentage and other factors. Farmers don't always know what quality their crop will finish at. They count on having two general markets, if the tested quality meets malting standards then the premium over feed prices can be healthy. A 2011 report noted that malting barley provided a 70% premium over feedstock barley. Growing specific varieties and/or using organic farming methods can provide additional premiums over generic feed barley. The curious can follow the links below.

How does this relate to publishers and advertising and OpenRTB? In my opinion we need several things standardized:

1) Inventory registration and description API. Allows publishers influence on how their inventory is exposed in various demand-side and trading-desk platforms. Publishers should fully describe their inventory in a common format. Buy-side GUIs and algorithms will benefit from increased annotation and categorization. This can also harmonize the brand-safety ratings that are not connected between the sell and buy sides.

2) Standardization of the emerging 'Private Marketplace' models in RTB. A set of best practices and trading procedures for PM needs to be defined such that the market can grow properly.

While the main bid request/response API of OpenRTB has been criticized as being 'too late' given the large implementations in production, it is not too late to define standards for the above. These things will help the buy-side better differentiate quality inventory.

SchemaMgr - MySQL schema management tool

2011-07-12T22:49:00.001-07:00

SchemaMgr is a simple perl tool to manage schema change in MySQL DBs. I wrote this in 2007/2008 and it has been used in production for many years.

https://github.com/nealrichter/schemamgr

Each change is assigned a version number and placed in a file. When the SQL in the file is executed successfully, a special table is set with that version number. Subsequent runs install only the higher versioned files.

It can also be used to reinstall views and stored procedures.

The best practice is to copy the file and change X in the filename and in the $DB_NAME variable.

$ ./bin/schemamgr_X.pl
Usage: either create or upgrade X database
schemamgr_X.pl -i -uUSERNAME -pPASSWORD [-vVERSION] [-b]
  updates DB of to current (default) or requested version
schemamgr_X.pl -s -uUSERNAME -pPASSWORD
  reinstalls all stored procedures
schemamgr_X.pl -w -uUSERNAME -pPASSWORD
  reinstalls all views
schemamgr_X.pl -q -uUSERNAME -pPASSWORD
  Requests and prints current version
Optional Params
 -vXX -- upgrades upto a specific version number XX
 -b   -- backs up the database (with data) before upgrades
 -nYY -- runs the upgrades against database YY - default is X

By convention you create ONE create_X_objects_v1 file with a date.
All other files are update files with greater than v1 numbers.

build/
|-- create_X_objects_v1_20110615.sql
|-- update_X_objects_v2_20110701.sql
`-- update_X_objects_v3_20110702.sql

Posted via email from aicoder - nealrichter's blog

Managing yourself to tasks and finishing them.

2011-06-28T09:02:00.001-07:00

I saw these two short articles from HBR come across my twitter stream and read them. They stuck with me for more than a week as they triggered some connections with proof methods.

Treat Every Task as Three Steps, Not One

The essence of the advice is "Prep-Do-Review". For each task you do make and review a plan for it before starting. Once the task completed review the plan. Did you finish? What did you learn? What would you do differently next time?

How to Become a Great Finisher

The essence of this advice is to think in terms of "to-go" versus "to-date" performance on a task. When you entertain to-date thinking it's very easy to see how much you have accomplished so far. This can lead to a lowering of ongoing effort or allow yourself to become distracted and work on other tasks.

So the optimal algorithm that combines the two is:

PrepForWork();

Do {

IncrementalWork();

X = DistanceToFinish();

} Until (X == 0)

ReviewResults();

Don't look back until you are done!

Note that there is a formal method of proof in mathematics and CS that goes something like this:

1) Define an integral metric f(x) measuring the distance to the goal

2) Define the starting distance

3) Show that your "algorithm/method" monotonically decreases f(x)

4) Infer that the goal will be reached

5) (Optional) Calculate the minimum number of steps required to reach the goal.

The important point is the rigor of the mathematical proof version. Your idea gets no partial credit for so-far progress, the algorithm either gets to the goal or it fails. The proof is either true or it is false. Thus you are either Done or you are NOT Done.

Posted via email from aicoder - nealrichter's blog

JSON parsing speed in various Node.JS versions

2011-04-07T14:21:00.001-07:00

We use Node.JS for a very high capacity service at the Rubicon Project. It often drives or handles in excess of 10B HTTP requests per day sending or receiving JSON data.
Out of curiosity I ran some tests on JSON parsing speed in different versions of Node.JS

node.js code:

var sys = require('sys');
var data = "{ \"item_uuid\": \"8ec56438-d3cf-442a-bbf7-7f076f229f35\", \"return_code\": 0, \"data\": [ { \"valid\": true, \"votes\": 2345, \"date\":\"Thu, 07 Apr 2011 15:17:17 EDT\", \"headline\": \"Senate Majority Leader Harry Reid indicates there likely will be a government shutdown on Friday. Lawmakers have been unable to agree on a new federal budget\", \"source\": \"Yahoo News\", \"published\":{\"hour\":\"19\",\"timezone\":\"UTC\",\"second\":\"17\",\"month\":\"4\",\"minute\":\"17\",\"utime\":\"1302203837\",\"day\":\"7\",\"day_of_week\":\"4\",\"year\":\"2011\"} } ] }";
try {
for(var i = 0; i < 1000000; i++)
{
var tmp = JSON.parse(data);
}
} catch(e) { sys.puts("ERROR: on parsing JSON with v8 parser"); }
sys.puts(data);
var tmp = JSON.parse(data);
sys.puts(JSON.stringify(tmp));
sys.puts("\n DONE \n");
process.exit();

Essentially this re-parses the same example JSON (I created a fake RSS like JSON pacakge) 1M times.

Here are the results:

Node 0.1.3x: real 0m30.050s
Node 0.2.6: real 0m30.050s
Node 0.3.8: real 0m9.915s
Node 0.4.5: real 0m9.999s

For reference I ran the same test against a very fast tokening parser in C called jsmn, and a C++ one called vjson.

jsmn: real 0m2.276s
vjson: real 0m7.465s Note that vjson is a destructive parser, and I had to fix that first.

Interestingly the JSON parser in node 0.4.5 and prior versions appears to be written in pure Javascript. See the file: node-v0.4.5/deps/v8/src/json.js

It's unclear if the speed improvements are a result of improvements to the parser implementation or in some efficiency/speed leap in versions of V8 included in Node versions.

Node 0.1.33: v8: 2010-03-17: Version 2.1.5
Node 0.2.6: v8: 2010-08-16: Version 2.3.8
Node 0.3.8: v8: 2011-02-02: Version 3.1.1
Node 0.4.5: v8: 2011-03-02: Version 3.1.8

Posted via email from aicoder - nealrichter's blog

Pamela Samuelson on startups and software patents

2011-03-22T21:26:00.001-07:00

Following up on my last post here is the view from Pamela Samuelson:

Why software startups decide to patent ... or not

Two-thirds of the approximately 700 software entrepreneurs who participated in the 2008 Berkeley Patent Survey report that they neither have nor are seeking patents for innovations embodied in their products and services. These entrepreneurs rate patents as the least important mechanism among seven options for attaining competitive advantage in the marketplace. Even software startups that hold patents regard them as providing only a slight incentive to invest in innovation.

She also lists a variety of reasons why these software entrepreneurs decided to forgo patenting their last invention. It's a very interesting write up.

Posted via email from aicoder - nealrichter's blog

Comments re "The Noisy Channel: A Practical Rant About Software Patents"

2011-03-10T14:30:00.001-08:00

The Noisy Channel: A Practical Rant About Software Patents - [My comments cross-posted here]

Daniel, nice writeup.

I worked for a BigCo and filed many patents. It was a mixed bag. The time horizon is so long that even after I’ve been gone for 3.5 years many of them are still lost in the USPTO. Average time for me to see granted patents was 5+ years.

Here are my biased opinions:

1) Patents really matter for BigCos operating on a long time horizon. It’s a strategic investment.

2) Patents are nearly worthless for a Startup or SmallCo. The time horizon is way past your foreseeable future, and thus the whole effort is akin to planning for an alternate reality different than the current business context. Throwing coins in a fountain for good luck is about as relevant. You simply are better off getting a filing date on a provisional design writeup and hiring an engineer with the money you’d spend on Patent lawyers.

3) As an Acquiring company looking at a company to acquire, Provisional or Pending Patents are a liability not an asset. They take time and resources to push to completion for a strategy of deterrence.

4) Patents are mostly ignored in the professional literature. Take Sentiment Analysis as one example. Sentiment Analysis exploded in 2001 w.r.t. Academic publishing, yet there are more than a few older patents discussing good technical work on Sentiment Analysis. I’ve NEVER seen an algorithm in a patent cited in a paper as previous work. And I have seen academic papers with algorithms already 90% covered by an older patent… and the papers are cited as ‘novel work’.

5) Finding relevant patents is ludicrously hard. It might be the most challenging problem in IR w.r.t. a corpus IMO. Different words mean the same thing and vise versa due to the pseudo-ability in a Patent to redefine a word away from the obvious meaning. With two different lawyers rendering the same technical design into a writeup and claims results in wildly different work product.

6) I’ve seen some doosey granted Patents. Things that appear to either be implementations of very old CS ideas into new domains.. or worse stuff that would be a class project as an undergrad.

It’s just plain ugly in this realm.

Posted via email from aicoder - nealrichter's blog

On Strategic Plans

2011-03-10T01:20:00.001-08:00

This needs absolutely no comment.

“We have a ‘strategic plan.’ It’s called doing things.” ~ Herb Kelleher

Posted via email from aicoder - nealrichter's blog

Hilarious system calls in the BeOS

2011-03-06T10:20:00.001-08:00

These system calls in the BeOS still make me smile.

int32 is_computer_on(void)

Returns 1 if the computer is on. If the computer isn't on, the value returned by this function is undefined.

double is_computer_on_fire(void)

Returns the temperature of the motherboard if the computer is currently on fire. If the computer isn't on fire, the function returns some other value.

#include <stdio.h>
#include <be/kernel/OS.h>
int main()
{
printf("[%d] = is_computer_on()\n", is_computer_on());
printf("[%f] = is_computer_on_fire()\n", is_computer_on_fire());
}

These functions serve a similar purpose to getpid() in Unix, essentially no-op calls that can be used to test the kernel's intrinsic response time under load.

Write up of BeOS history is here, Haiku is an open source clone of the BeOS that is curiously under active development.

Posted via email from aicoder - nealrichter's blog

Contractor Needed: HTML/CSS/Javascript Ninja

2011-03-04T09:21:00.001-08:00

The Rubicon Project is looking for an in-browser HTML/CSS/Javascript Ninja to restructure the workflow of an application GUI. The server side code is perl/mod_perl. Please contact me if you are interested and available. The contract is 4-6 weeks.

Posted via email from aicoder - nealrichter's blog

Job Post: Software Engineer/Scientist: Ad Serving, Optimization and Core Team

2011-03-03T12:26:00.001-08:00

LOCATION: the Rubicon Project HQ in West Los Angeles or Salt Lake City

the Rubicon Project is on a mission to automate buying and selling for the $65 billion global online advertising industry. Backed by $42 million in funding, we are currently looking for the best engineers in the world to work with us.

Team Description

The mission of the Core Team is to build robust, scalable, maintainable and well documented systems for ad serving, audience analytics, and market analysis. Every day we serve billions of ads, process terabytes of data and provide valuable data and insights to our publishers. If building software that touches 500+ million people every month is interesting to you, you'll fit in well here.

Some of the custom software we've built to solve these problems include:

A patented custom ad engine delivering thousands of ad impressions per second with billions of real time auctions daily
A real time bid engine designed to scale out to billions of bid requests daily
Optimization Algorithms capable of scheduling and planning adserving opportunities to maximize revenue
Client side Javascript that performs real-time textual analysis of web pages to extract semantically meaningful data and structures
A web-scale key value store based on ideas from the Amazon Dynamo paper used to store 100s of millions of data points
Unique audience classification system using various technologies such as Solr and Javascript for rich, real-time targeting of web site visitors
Data Mining buying and selling strategies from a torrent of transactional data
Analytics systems capable of turning a trillion data points into real business insight

Job Description

Your job, should you accept it, is to build new systems, new features and extend the functionality of our existing systems. You will be expected to architect new systems from scratch, add incremental features on existing systems, fix bugs in other people's code and help manage production operations of the services you build. Sometimes you'll have to (or want to) do this work when you are not in the office, so working remote can't scare you off.

Most of our systems are written in Perl, Java, and C, but we have pieces of Python, Clojure and server-side Javascript as well. Hopefully you have deep expertise in at least one of these; you'll definitely need to have a desire to quickly learn and work on systems written in all of the above.

You should also have worked with and/or designed service oriented architectures, advanced db schemas, big data processing, highly scalable and available web services and are well aware of the issues surrounding the software development lifecycle. We expect that your resume will itemize your 3+ years experience, mention your BS or MS in Computer Science and be Big Data Buzzword Compliant.

Bonus points for experience with some of the technologies we work with:

Hadoop
NodeJS
MySql
Solr/Lucene
RabbitMQ
MongoDB
Thrift
Amazon EC2
Memcached
MemcacheQ
Machine Learning
Optimization Algorithms
Economic Modeling

Apply Now! Click the Apply button!

Posted via email from aicoder - nealrichter's blog

A note on software teams and individuals

2011-02-28T13:06:00.001-08:00

I'm currently running a loosely coupled team of people all working on a common initiative. While this is not my first time running a team, the same set of things seem to happen with all 'new' teams. Here's a quick set of observations.

The first major observation is that teams of engineers can quickly fall into operating like a "golf team" versus a "football team". In Golf, each team member generally competes against all other players (and their different teams) as an individual. A given team wins if it's individual players collectively do better than some other team's players. Football (or Soccer or Basketball) is very different. A team wins in the face of good opposition only if it plays as a team.

For software teams, done means one thing: the team is done with the milestone or project. Done means finished, tested and shipped code. Does does not mean "my part works", or "my tasks are done".

IMO each team member should answer these questions to the group every day:

What direction I am going relative to team goals.
What specific items I am working on today.
Does anyone need any help from me?
Do I need any help with my work?

Team managers, both the overall and functional leads, should ask or answer these questions for the group every day:

Are we as a group going the right direction (towards the goal)?
Will we meet the timeline and/or functional goals?
Is there any functional or task ambiguity that needs working out?
Are any course corrections needed?

The second observation is that there are two major indicators of if a given individual is a good addition to the team:

Does this person communicate well and often?
Does this person have the capability and desire to resolve ambiguity on their own when possible?

The second skill, resolving ambiguity, is in my opinion the primary question that a software hiring manager needs to answer in the affirmative about a given candidate... assuming of course the candidate has the needed skills.

Much of this also circles back on a blog post that Jordan Mitchell wrote years ago when I was hip-deep in code at Others Online.

Actual vs. Perceived Progress

Posted via email from aicoder - nealrichter's blog

RightNow - Our cowboys ride code

2011-02-07T15:13:00.000-08:00

This is a neat little ad in the January Delta Sky Magazine for RightNow Technologies, where I worked from 1999-2007.

RightNow Technologies serves about 10 billion [customer interactions] a year through he companies and institutions it works with. “Every person in North America has used one of our solutions about 25 times,” says Gianforte.

Why RightNow keeps its headquarters in Bozeman, MT

The quality of life here is a huge advantage, but more importantly, says Gianforte, “there’s a ranch saying around here that goes,‘When something needs to get done, well then, we’re just gonna get ‘er done.’ In many environments, they have to form a committee, pull in consultants and such to make things happen, but our clients appreciate that when something needs to get done, we can easily make that hap pen because of the work ethic here.”

The Provenance of Data, Data Branding and "Big Data" Hype

2011-01-30T23:45:00.001-08:00

The credibility of where data comes from in all these "big data" plays is absolutely crucial. Waving hands re "algorithms" won't cut it. @nealrichter Jan 27, 1010 Tweet

To expand on this tweet here's the argument: If one of your key products as a startup or business is to "crunch data" and derive or extract value from it then you should be concerned about data provenance. This is true whether you are crunching your own data or third-party data.

Some examples:

Web analytics - crunch web traffic and distill visitation and audience analytics reports for web site owners. Often they use these summaries to make decisions and sell their ad-space to advertisers.
Semantic Web APIs - crunch webpages, tweets etc and return topical and semantic annotations of the content
Comparison shopping - gather up product catalogs and pricing to aggregate for visitors
Web publishers - companies who run websites
Prediction services - companies that use data to predict something

In each of the above categories the provenance of the input data and brand of the output data is key. For each of the above one could name a company with either solid-gold data OR a powerful brand-name and good-enough data. Conversely we can find examples of companies with great tech but crappy data or a weak brand.

For web publishers, those that host user-generated content have poor provenance in general compared to news sites (for example). A notable exception is Wikipedia who has a pure "UGC" model but a solid community process and standards to improve provenance of their articles (those without references are targeted for improvement).

In comparison shopping Kayak.com has good data (directly from the airlines) and has built a good brand. The same is true of PriceGrabber and Nextag. TheFind.com on the other hand appears to have great data and tech, but no well known brand.

(I'm refraining from going into specific examples or opinions on big data companies to avoid poking friends in the eye.)

The issue of Provenance and Branding is especially important in sales situations where you are providing a tool (analytics) that helps your customer (a sales person) sell something to a third-party (their customer). If the input data you are using either has a demonstrable provenance or a good brand you'll have an easier time convincing people that the output of your product is worth having (and reselling).

The old saying for this in computer science is Garbage In, Garbage Out.

In "big data" world of startups that is blowing by Web 2.0 as the new hotness there is a startling lack of concern about data provenance. The essentially ethos is that if we (the Data Scientists) accumulate enough data and crunch it with magical algorithms then solid-gold data will come out... or at least that's what the hype machine says.

The lesson from the financial melt down is that magical algorithms making CDOs, CMOs and other derivatives should be viewed with a lens of mistrust. The GIGO principle was forgotten and no one even cared about the provenance (read credit quality) of the base financial instruments making up the derivatives. The credit rating agencies were just selling their brand and cared little about quality.

In my opinion, there is a clear parallel here to "big data". Trust must be part of the platform and not just tons of CPUs and disk-space. A Brand is a brittle object that is easily broken, so concentrate on quality.

Posted via email from nealrichter's posterous

Finance for Engineers

2011-01-14T00:23:00.001-08:00

Last summer I took a great mini-course at MIT Sloan on Finance. It's essentially a breadth-first review of the MBA course complete with three case studies and a review of project evaluation methods via net present value analysis. Approximately 80% of the attendees were engineers/techies with 10+ years experience.. and maybe 25% w/ PhDs.

Fundamentals of Finance for the Technical Executive

TextBook was: Higgins - Analysis for Financial Management

The first case study is Wilson Lumber from Harvard. The material is copyrighted, yet these links look like accurate distillations by business students.

The initial position is that Wilson Lumber growing small business with good suppliers and loyal customers. Volume and revenue are all up period over period. Question is should the bank increase is line of credit to fund the business. Once you break down the financial statements and model the business, the answer is No. Essentially Mr Wilson is over extended by many measures and is growing at the expense of his balance sheet, loaning him money will only make the problem bigger down the road. His basic options are to take in a partner as co-owner for cash, go broke or raise prices to lower volume and improve margins and slowly rebuild the balance sheet.

We then went through two NPV exercises. The first was a basic analysis of go/no-go on an engineering project with a bottom up analysis via putting all cost/benefit assumptions in a model and iterating though possibilities. The second was an analysis of a joint-venture between two biotech companies. Everything from external capital, deal structure to market penetration projections were worked in. Very informative and pretty interesting work for engineers to do once the terminology and methods were explained.

Professor Jenter shared two amusing anecdotes:

His MIT and Stanford MBA students often run off to found start-ups and forget the basic Wilson Lumber case. By the time they approach him for help it's too late and they are in Mr Wilson's position: shut-down, take in $$ and lots of equity dilution (and loss of control) or slow growth dramatically.
Also a quote along the lines of "Startups founded by MIT PhDs fail at a rate above far average".

This certainly hammered home the lesson that strategic planning for growth is very important, even for what look like non hyper-growth (software) companies. I'd recommend this course to any engineer wanting a quick structured intro to basic financial management.

Posted via email from nealrichter's posterous

List of Best Paper awards in CS/AI/ML conferences

2010-12-30T09:34:00.001-08:00

The below is a great list of best paper awards for WWW, SIGIR, CIKM, AAAI, CHI, KDD, SIGMOD, ICML, VLDB, IJCAI, UIST since 1996

http://jeffhuang.com/best_paper_awards.html

Interesting thing to note: Google is ranked last in frequency, Microsoft first.

This needs NIPS and possibly UAI added to it.

http://nips.cc/ConferenceInformation/PaperAwards

Posted via email from nealrichter's posterous

Managing Open Source Licenses

2010-12-29T11:18:00.001-08:00

From time to time I have helped companies do Open Source code audits in their own source code. Basically this consists of auditing their code to find open source code.

These code audits are particularly important during software releases and M&A events. I've helped companies do this for releases and been on both sides of M&A event driven audits.

If the developers have kept the attributions with any open source code they have re-used then grep is a fine tool for auditing. However this is a big IF. If your developers are sloppy and do not keep the attributions (ie copyright and license notices) with code they lift from open source you have a problem. A software tool needs to be used to scan the corporate source for hits in open source repositories.

There are at least three companies providing software to do this:

Ideally the outcome of this process is as follows:

A clear company policy is set on what open source licenses are allowed and how developers can use open source come or components.
The corporate code is cleanly annotated with any third party attributions (see below).
Open Source code that has bad licenses for commercial usage is identified and removed before release.
A Bill of Materials is created for each release listing third-party software in the release.
Necessary copyright or other notices appear in About dialogs, manuals or product websites.

Example comment block:

/*

* XYZ.com Third-party or Open Source Declaration

* Name: Bart Simpson

* Date of first commit: 04/25/2009

* Release: 3.5 “The Summer Lager Release”

* Component: tinyjson

* Description: C++ JSON object serializer/deserializer

* Homepage: http://blog.beef.de/projects/tinyjson/

* License: MIT style license

* Copyright: Copyright (c) 2008 Thomas Jansen (thomas@beef.de)

* Note: See below for original declarations from the code

*/

If the above were upgraded to be in a javadoc style comment then a tool could be built to auto-magically generate a Bill of Materials for each release.

There is one grey area in all this: how to handle developers using code from discussion sites like PHP.net, CodeProject, StackOverflow and similar sites. Generally code put in these type of forums has no defined license. In this case the code is either copyrighted by the site or the author of the post... and developers should not use the code without getting an explicit license. However developers generally feel like people put the code up there to share. This conflict means the company policy on usage of this type of code must be clearly communicated to all developers.

This is a nice review article of other considerations for open source auditing:

Dr Dobbs: Managing Open Source Licensing by Kamal Hassin

Posted via email from nealrichter's posterous

Stochastic Universal Sampling/Selection

2010-12-17T09:11:00.001-08:00

Stochastic Universal Sampling is a method of weighted random sampling exhibiting less bias and spread that classic roulette wheel sampling. The intuition is a roulette wheel with n equally spaced steel balls spinning in unison around the wheel. This method has better properties and is more efficient that doing repeated samples from the wheel with or without replacement of the selected items.

Baker, James E. (1987). "Reducing Bias and Inefficiency in the Selection Algorithm". Proceedings of the Second International Conference on Genetic Algorithms and their Application (Hillsdale, New Jersey: L. Erlbaum Associates): 14–21.

Reference implementations on the web are scare, so here are a few:

Christian Borgelt

http://fuzzy.cs.uni-magdeburg.de/studium/ga/src/sus.c

Dan Dyer

https://github.com/dwdyer/watchmaker/blob/master/framework/src/java/main/org/uncommons/watchmaker/framework/selection/StochasticUniversalSampling.java

University of New Mexico

http://epr.adaptive.cs.unm.edu/asm/code.html

GMU's ECJ

http://cs.gmu.edu/~eclab/projects/ecj/docs/classdocs/ec/select/SUSSelection.html

See the SUSSelection.java buried in the latest tarball.

Posted via email from nealrichter's posterous

Computing, economics and the financial meltdown (a collection of links)

2010-11-22T15:58:00.001-08:00

This editor's letter from CACM last year is interesting: The Financial Meltdown and Computing by Moshe Y. Vardi

Information technology has enabled the development of a global financial system of incredible sophistication. At the same time, it has enabled the development of a global financial system of such complexity that our ability to comprehend it and assess risk, both localized and systemic, is severely limited. Financial-oversight reform is now a topic of great discussion. The focus of these talks is primarily over the structure and authority of regulatory agencies. Little attention has been given to what I consider a key issue—the opaqueness of our financial system—which is driven by its fantastic complexity. The problem is not a lack of models. To the contrary, the proliferation of models may have created an illusion of understanding and control, as is argued in a recent report titled "The Financial Crisis and the Systemic Failure of Academic Economics."

Krugman's essay at the time How Did Economists Get It So Wrong? gave a nice history of economic ideas, the models behind and his interpretations of their correctness.

The theoretical model that finance economists developed by assuming that every investor rationally balances risk against reward — the so-called Capital Asset Pricing Model, or CAPM (pronounced cap-em) — is wonderfully elegant

[snip]

Economics, as a field, got in trouble because economists were seduced by the vision of a perfect, frictionless market system.

[snip]

H. L. Mencken: “There is always an easy solution to every human problem — neat, plausible and wrong.”

I read this months ago and it's been percolating in my thoughts since then. My Manhattan Project - How I helped build the bomb that blew up Wall Street by Michael Osinski. Osinski wrote much of the software and models used to form CMOs and CDOs. Essentially the software aggregates debt instruments from mortgage and other debt markets and allowed a bond designer to issue tailor-made portfolio of debt while mitigating default risk of the debt via that aggregation. He called it his sausage grinder.

“You put chicken into the grinder”—he laughed with that infectious Wall Street black humor—“and out comes sirloin.”

Here's a large collection of links from that period that are worth reading. My thought at the moment is this nugget from Twitter:

Poormojo "Any sufficiently advanced financial instrument is indistinguishable from fraud."

Recipe for Disaster: The Formula That Killed Wall Street

Wall Street’s Math Wizards Forgot a Few Variables

Tales From Lehman’s Crypt

Economic View: Flaw in Free Markets: Humans

Andrew Low: This is your brain on prosperity

A crisis of politics, not economics: Complexity, Ignorance, and policy failure.

Revenge of the Nerd: Paul Wilmott is out to save Wall Street's soul—one dork at a time.

A Conversation with David E. Shaw

Don't Blame The Quants by Steven Shreve

http://www.nytimes.com/2009/10/14/opinion/14trillin.html?em=&adxnnl=1&adxnnlx=1255543552-DPpTSk3i4f5lEJZALsigRA

Sciam: Does Economics Violate the Laws of Physics?

Systemic Risk and Fannie Mae

Geeks trump alpha males as algos dominate Wall St

Posted via email from nealrichter's posterous

Review of "Learning to Rank with Partially-Labeled Data"

2010-10-27T00:02:00.001-07:00

I've been attending the University of Utah Machine Learning seminar (when I can) this fall. PhD student Piyush Kumar Rai is organizing it.

I volunteered to take the group though Learning to Rank with Partially-Labeled Data by Duh & Kirchhoff. I have some experience researching and implementing LTR algorithms, mostly using reinforcement learning or ant-system type approaches. Some general intro here.

The paper presents the main Transductive Learning algorithm as a framework, then fills in the blanks with Kernel PCA and RankBoost. Several Kernels are used: Linear, polynomial, radial basis function and knn-diffusion. RankBoost learns a kind of ensemble of 'weak learners' with simple thresholds.

The main reason to read the paper if you are already familiar with LTR is the use of the transductive algorithm.

Note the DISCOVER() & LEARN() functions. These are the unsupervised and supervised algorithm blanks they fill with Kernel PCA and RankBoost. What the first actually does is learn a function we could call EXTRACT() that can extract or create features for later use. They do show that the basic idea of layering in unlabeled data with labeled data is a net gain.

There are some issues with the paper. First the computational time performance, as they admit, is not good. The other is that their use of Kernel PCA in an information retrieval context is a bit naive IMO. The IR literature is full of hard-won knowledge of extracting decent features from documents. See this book for example. This is mostly ignored here.

The more confusing thing is the use of K-Nearest Neighbor diffusion kernels. Basically they take the vector of documents, form a matrix by euclidean distance and then random-walk the matrix for a set number of time-steps. The PCA then takes this 'kernel' output and solves the eigenvalue problem, to get the eigenvectors. This all seems a round-about way of saying they approximated the Perron-Frobenius eigenvector (sometimes call PageRank) by iterating the matrix a set number of times and zeroed out low order cells. Or at least I see no effective difference between what they did and what I just described. Basically they just make the matrix sparse to solve it easier (ie this is the dual).

Their use of various classic IR features like TFIDF, BM25 etc needed help. There's pleny of IR wisdom on how to use such features, why let the DISCOVER() wander about attempting to rediscover this? The results were also muddled with only one of the three data sets showing a significant improvement over a baseline technique.

All that aside, it's worth a read for the intro to the transductive alg used with an IR centric task.

Posted via email from nealrichter's posterous