Tuesday, December 27, 2011

Standford's Introduction to Computational Advertising course

Andrei Broder and Vanja Josifovski, of Yahoo! Research, have wrapped up their Stanford course again this fall.  As always the lecture slides are a great intro to the area.

MS&E 239: Introduction to Computational Advertising

Computational advertising is an emerging new scientific sub-discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search"), a user reading a web page ("content match" and "display ads"), a user watching a movie on a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. Thus, depending on the definition of "best match" this problem leads to a variety of massive optimization and search problems, with complicated constraints, and challenging data representation and access problems. The solution to these problems provides the scientific and technical foundations for the $20 billion online advertising industry.

This course aims to provide a good introduction to the main algorithmic issues and solutions in computational advertising, as currently applied to building platforms for various online advertising formats. At the same time we intend to briefly survey the economics and marketplace aspects of the industry, as well as some of the research frontiers. The intended audience are students interested in the practical and theoretical aspects of web advertising.

Posted via email from aicoder - nealrichter's blog

Summer Program on Computational Advertising

Deepak Agarwal of Yahoo Research is organizing a summer program on Computational Advertising.  It appears to be geared for grad students.

http://www.samsi.info/programs/summer-program-august-6-17-2012-computational-advertising

This two-week program will run from August 6 to August 17, 2012. The first week will be at the Radisson RTP in the Research Triangle Park, NC. The location is in close proximity to SAMSI. The first three days will be spent on technical presentations by leading researchers and industry experts, to bring everyone up to speed on the currently used methodology. On the fourth day, the participants will self-organize into working groups, each of which will address one of the key problem areas (it is permitted that people join more than one group, and the organizers will try to arrange the working group schedules to faciliate that).  The second week will be spent at SAMSI headquarters in Research Triangle Park.

Posted via email from aicoder - nealrichter's blog

Tuesday, November 15, 2011

What Software Engineers should know about Control Theory

Over the years I've noticed an interesting lack of specific domain knowledge among CS and software people. Other than the few co-workers that majored in Electrical Engineering, almost no one has heard of the field of 'Control Theory'.

Control theory is an interdisciplinary branch of engineering and mathematics, that deals with the behavior of dynamical systems. The desired output of a system is called the reference. When one or more output variables of a system need to follow a certain reference over time, a controller manipulates the inputs to a system to obtain the desired effect on the output of the system.
Let's imagine that you write internet web services for a living. Some Rest or SOAP APIs that take an input and give an output.

Your boss walks up to you one day and that asks for a system that does the following:
  • Create a webservice that calls another (or three) for data/inputs, then does X with them.
  • Meters the usage of the other web services.
  • Your webservice must respond within Y milliseconds with good output or a NULL output.
  • Support high concurrency, ie not use too many servers.

The problem is that these other third-party webservices are not your own. What is their response time? Will they give bad data? How should your webservice react to failures of the others?

Does this sound familiar? It should to many. This is the replicated-and-shared-connector problem (MySQL, memcached), the partitioned-services problem (federated search, and large scale search engines) and the API-as-a-service problem (Mashery, etc).

There are two basic types of controls relevant here:
  • Open Loop, Feed-forward: Requires good model of system inputs and response of the system.
Feedforward

  • Closed Loop, Feed-back

Types of adaptive control are as follows:
  • Linear Feedback
  • Stability Analysis
  • Frequency response
  • response time
Adaptive Schemes
  • Gain Scheduling
  • Model Reference Adaptive Systems
  • Self-tuning regulators
  • Dual Control

Here's one survey deck from a lecture. Unfortunately for software engineers, most of the presentations of the above are in linear system form rather than an algorithmic form.

Dr. Joe L Hellerstein of Google and co-workers taught a course at U of Washington in 2008 that was more software focused. He's also written a textbook on it and a few papers.

I'd like to see a 'software patterns' set created for easier use by software engineers. I'll attempt to present a couple common forms as patterns in a future blog post.


Thursday, November 10, 2011

Open RTB panel - IAB Ad Ops Summit 2011

Monday November 7th I was on an IAB Ops panel on OpenRTB.



The clip shows an exchange after Steve from the IAB asked a question about how webpage inventory is described in RTB. I described an example of differentiating a simple commodity, barley.

Two of the major uses of barley in the US are animal feed and malting for making beer. Malting barley has specific requirements in terms of moisture content, protein percentage and other factors. Farmers don't always know what quality their crop will finish at. They count on having two general markets, if the tested quality meets malting standards then the premium over feed prices can be healthy. A 2011 report noted that malting barley provided a 70% premium over feedstock barley. Growing specific varieties and/or using organic farming methods can provide additional premiums over generic feed barley. The curious can follow the links below.
How does this relate to publishers and advertising and OpenRTB? In my opinion we need several things standardized:

1) Inventory registration and description API. Allows publishers influence on how their inventory is exposed in various demand-side and trading-desk platforms. Publishers should fully describe their inventory in a common format. Buy-side GUIs and algorithms will benefit from increased annotation and categorization. This can also harmonize the brand-safety ratings that are not connected between the sell and buy sides.

2) Standardization of the emerging 'Private Marketplace' models in RTB. A set of best practices and trading procedures for PM needs to be defined such that the market can grow properly.

While the main bid request/response API of OpenRTB has been criticized as being 'too late' given the large implementations in production, it is not too late to define standards for the above. These things will help the buy-side better differentiate quality inventory.

Tuesday, July 12, 2011

SchemaMgr - MySQL schema management tool

SchemaMgr is a simple perl tool to manage schema change in MySQL DBs. I wrote this in 2007/2008 and it has been used in production for many years.

https://github.com/nealrichter/schemamgr

Each change is assigned a version number and placed in a file. When the SQL in the file is executed successfully, a special table is set with that version number. Subsequent runs install only the higher versioned files.

It can also be used to reinstall views and stored procedures.

The best practice is to copy the file and change X in the filename and in the $DB_NAME variable.

$ ./bin/schemamgr_X.pl
Usage: either create or upgrade X database
schemamgr_X.pl -i -uUSERNAME -pPASSWORD [-vVERSION] [-b]
updates DB of to current (default) or requested version
schemamgr_X.pl -s -uUSERNAME -pPASSWORD
reinstalls all stored procedures
schemamgr_X.pl -w -uUSERNAME -pPASSWORD
reinstalls all views
schemamgr_X.pl -q -uUSERNAME -pPASSWORD
Requests and prints current version
Optional Params
-vXX -- upgrades upto a specific version number XX
-b -- backs up the database (with data) before upgrades
-nYY -- runs the upgrades against database YY - default is X
By convention you create ONE create_X_objects_v1 file with a date.
All other files are update files with greater than v1 numbers.
build/
|-- create_X_objects_v1_20110615.sql
|-- update_X_objects_v2_20110701.sql
`-- update_X_objects_v3_20110702.sql

Posted via email from aicoder - nealrichter's blog

Tuesday, June 28, 2011

Managing yourself to tasks and finishing them.

I saw these two short articles from HBR come across my twitter stream and read them. They stuck with me for more than a week as they triggered some connections with proof methods.

The essence of the advice is "Prep-Do-Review". For each task you do make and review a plan for it before starting. Once the task completed review the plan. Did you finish? What did you learn? What would you do differently next time?

The essence of this advice is to think in terms of "to-go" versus "to-date" performance on a task. When you entertain to-date thinking it's very easy to see how much you have accomplished so far. This can lead to a lowering of ongoing effort or allow yourself to become distracted and work on other tasks.

So the optimal algorithm that combines the two is:

PrepForWork();
Do {
IncrementalWork();
X = DistanceToFinish();
} Until (X == 0)
ReviewResults();

Don't look back until you are done!

Note that there is a formal method of proof in mathematics and CS that goes something like this:

1) Define an integral metric f(x) measuring the distance to the goal
2) Define the starting distance
3) Show that your "algorithm/method" monotonically decreases f(x)
4) Infer that the goal will be reached
5) (Optional) Calculate the minimum number of steps required to reach the goal.

The important point is the rigor of the mathematical proof version. Your idea gets no partial credit for so-far progress, the algorithm either gets to the goal or it fails. The proof is either true or it is false. Thus you are either Done or you are NOT Done.

Posted via email from aicoder - nealrichter's blog

Thursday, April 07, 2011

JSON parsing speed in various Node.JS versions

We use Node.JS for a very high capacity service at the Rubicon Project. It often drives or handles in excess of 10B HTTP requests per day sending or receiving JSON data.
Out of curiosity I ran some tests on JSON parsing speed in different versions of Node.JS

node.js code:

var sys = require('sys');
var data = "{ \"item_uuid\": \"8ec56438-d3cf-442a-bbf7-7f076f229f35\", \"return_code\": 0, \"data\": [ { \"valid\": true, \"votes\": 2345, \"date\":\"Thu, 07 Apr 2011 15:17:17 EDT\", \"headline\": \"Senate Majority Leader Harry Reid indicates there likely will be a government shutdown on Friday. Lawmakers have been unable to agree on a new federal budget\", \"source\": \"Yahoo News\", \"published\":{\"hour\":\"19\",\"timezone\":\"UTC\",\"second\":\"17\",\"month\":\"4\",\"minute\":\"17\",\"utime\":\"1302203837\",\"day\":\"7\",\"day_of_week\":\"4\",\"year\":\"2011\"} } ] }";

try {
for(var i = 0; i < 1000000; i++)
{
var tmp = JSON.parse(data);
}
} catch(e) { sys.puts("ERROR: on parsing JSON with v8 parser"); }

sys.puts(data);
var tmp = JSON.parse(data);
sys.puts(JSON.stringify(tmp));
sys.puts("\n DONE \n");
process.exit();

Essentially this re-parses the same example JSON (I created a fake RSS like JSON pacakge) 1M times.

Here are the results:
  • Node 0.1.3x: real 0m30.050s
  • Node 0.2.6: real 0m30.050s
  • Node 0.3.8: real 0m9.915s
  • Node 0.4.5: real 0m9.999s

For reference I ran the same test against a very fast tokening parser in C called jsmn, and a C++ one called vjson.
  • jsmn: real 0m2.276s
  • vjson: real 0m7.465s Note that vjson is a destructive parser, and I had to fix that first.

Interestingly the JSON parser in node 0.4.5 and prior versions appears to be written in pure Javascript. See the file: node-v0.4.5/deps/v8/src/json.js

It's unclear if the speed improvements are a result of improvements to the parser implementation or in some efficiency/speed leap in versions of V8 included in Node versions.
  • Node 0.1.33: v8: 2010-03-17: Version 2.1.5
  • Node 0.2.6: v8: 2010-08-16: Version 2.3.8
  • Node 0.3.8: v8: 2011-02-02: Version 3.1.1
  • Node 0.4.5: v8: 2011-03-02: Version 3.1.8

Posted via email from aicoder - nealrichter's blog

Tuesday, March 22, 2011

Pamela Samuelson on startups and software patents

Following up on my last post here is the view from Pamela Samuelson:

Why software startups decide to patent ... or not

Two-thirds of the approximately 700 software entrepreneurs who participated in the 2008 Berkeley Patent Survey report that they neither have nor are seeking patents for innovations embodied in their products and services. These entrepreneurs rate patents as the least important mechanism among seven options for attaining competitive advantage in the marketplace. Even software startups that hold patents regard them as providing only a slight incentive to invest in innovation.

She also lists a variety of reasons why these software entrepreneurs decided to forgo patenting their last invention.  It's a very interesting write up. 

Posted via email from aicoder - nealrichter's blog

Thursday, March 10, 2011

Comments re "The Noisy Channel: A Practical Rant About Software Patents"

The Noisy Channel: A Practical Rant About Software Patents - [My comments cross-posted here]

Daniel, nice writeup.

I worked for a BigCo and filed many patents. It was a mixed bag. The time horizon is so long that even after I’ve been gone for 3.5 years many of them are still lost in the USPTO. Average time for me to see granted patents was 5+ years.

Here are my biased opinions:

1) Patents really matter for BigCos operating on a long time horizon. It’s a strategic investment.

2) Patents are nearly worthless for a Startup or SmallCo. The time horizon is way past your foreseeable future, and thus the whole effort is akin to planning for an alternate reality different than the current business context. Throwing coins in a fountain for good luck is about as relevant. You simply are better off getting a filing date on a provisional design writeup and hiring an engineer with the money you’d spend on Patent lawyers.

3) As an Acquiring company looking at a company to acquire, Provisional or Pending Patents are a liability not an asset. They take time and resources to push to completion for a strategy of deterrence.

4) Patents are mostly ignored in the professional literature. Take Sentiment Analysis as one example. Sentiment Analysis exploded in 2001 w.r.t. Academic publishing, yet there are more than a few older patents discussing good technical work on Sentiment Analysis. I’ve NEVER seen an algorithm in a patent cited in a paper as previous work. And I have seen academic papers with algorithms already 90% covered by an older patent… and the papers are cited as ‘novel work’.

5) Finding relevant patents is ludicrously hard. It might be the most challenging problem in IR w.r.t. a corpus IMO. Different words mean the same thing and vise versa due to the pseudo-ability in a Patent to redefine a word away from the obvious meaning. With two different lawyers rendering the same technical design into a writeup and claims results in wildly different work product.

6) I’ve seen some doosey granted Patents. Things that appear to either be implementations of very old CS ideas into new domains.. or worse stuff that would be a class project as an undergrad.

It’s just plain ugly in this realm.


Posted via email from aicoder - nealrichter's blog

On Strategic Plans

This needs absolutely no comment.

“We have a ‘strategic plan.’ It’s called doing things.” ~ Herb Kelleher

Posted via email from aicoder - nealrichter's blog

Sunday, March 06, 2011

Hilarious system calls in the BeOS

These system calls in the BeOS still make me smile.

int32 is_computer_on(void)

Returns 1 if the computer is on. If the computer isn't on, the value returned by this function is undefined.

double is_computer_on_fire(void)

Returns the temperature of the motherboard if the computer is currently on fire. If the computer isn't on fire, the function returns some other value.

#include <stdio.h> 
#include <be/kernel/OS.h> 
int main() 

printf("[%d] = is_computer_on()\n", is_computer_on()); 
printf("[%f] = is_computer_on_fire()\n", is_computer_on_fire()); 
These functions serve a similar purpose to getpid() in Unix, essentially no-op calls that can be used to test the kernel's intrinsic response time under load.

Write up of BeOS history is here, Haiku is an open source clone of the BeOS that is curiously under active development.

Posted via email from aicoder - nealrichter's blog

Friday, March 04, 2011

Contractor Needed: HTML/CSS/Javascript Ninja

The Rubicon Project is looking for an in-browser HTML/CSS/Javascript Ninja to restructure the workflow of an application GUI.  The server side code is perl/mod_perl.  Please contact me if you are interested and available.  The contract is 4-6 weeks.

Posted via email from aicoder - nealrichter's blog

Thursday, March 03, 2011

Job Post: Software Engineer/Scientist: Ad Serving, Optimization and Core Team

LOCATION: the Rubicon Project HQ in West Los Angeles or Salt Lake City

the Rubicon Project is on a mission to automate buying and selling for the $65 billion global online advertising industry. Backed by $42 million in funding, we are currently looking for the best engineers in the world to work with us.

Team Description


The mission of the Core Team is to build robust, scalable, maintainable and well documented systems for ad serving, audience analytics, and market analysis. Every day we serve billions of ads, process terabytes of data and provide valuable data and insights to our publishers. If building software that touches 500+ million people every month is interesting to you, you'll fit in well here.

Some of the custom software we've built to solve these problems include:

A patented custom ad engine delivering thousands of ad impressions per second with billions of real time auctions daily
A real time bid engine designed to scale out to billions of bid requests daily
Optimization Algorithms capable of scheduling and planning adserving opportunities to maximize revenue
Client side Javascript that performs real-time textual analysis of web pages to extract semantically meaningful data and structures
A web-scale key value store based on ideas from the Amazon Dynamo paper used to store 100s of millions of data points
Unique audience classification system using various technologies such as Solr and Javascript for rich, real-time targeting of web site visitors
Data Mining buying and selling strategies from a torrent of transactional data
Analytics systems capable of turning a trillion data points into real business insight

Job Description


Your job, should you accept it, is to build new systems, new features and extend the functionality of our existing systems. You will be expected to architect new systems from scratch, add incremental features on existing systems, fix bugs in other people's code and help manage production operations of the services you build. Sometimes you'll have to (or want to) do this work when you are not in the office, so working remote can't scare you off.

Most of our systems are written in Perl, Java, and C, but we have pieces of Python, Clojure and server-side Javascript as well. Hopefully you have deep expertise in at least one of these; you'll definitely need to have a desire to quickly learn and work on systems written in all of the above.

You should also have worked with and/or designed service oriented architectures, advanced db schemas, big data processing, highly scalable and available web services and are well aware of the issues surrounding the software development lifecycle. We expect that your resume will itemize your 3+ years experience, mention your BS or MS in Computer Science and be Big Data Buzzword Compliant.

Bonus points for experience with some of the technologies we work with:

  • Hadoop
  • NodeJS
  • MySql
  • Solr/Lucene
  • RabbitMQ
  • MongoDB
  • Thrift
  • Amazon EC2
  • Memcached
  • MemcacheQ
  • Machine Learning
  • Optimization Algorithms
  • Economic Modeling

Posted via email from aicoder - nealrichter's blog

Monday, February 28, 2011

A note on software teams and individuals

I'm currently running a loosely coupled team of people all working on a common initiative. While this is not my first time running a team, the same set of things seem to happen with all 'new' teams. Here's a quick set of observations.

The first major observation is that teams of engineers can quickly fall into operating like a "golf team" versus a "football team". In Golf, each team member generally competes against all other players (and their different teams) as an individual. A given team wins if it's individual players collectively do better than some other team's players. Football (or Soccer or Basketball) is very different. A team wins in the face of good opposition only if it plays as a team.

For software teams, done means one thing: the team is done with the milestone or project. Done means finished, tested and shipped code. Does does not mean "my part works", or "my tasks are done".

IMO each team member should answer these questions to the group every day:
  1. What direction I am going relative to team goals.
  2. What specific items I am working on today.
  3. Does anyone need any help from me?
  4. Do I need any help with my work?
Team managers, both the overall and functional leads, should ask or answer these questions for the group every day:
  1. Are we as a group going the right direction (towards the goal)?
  2. Will we meet the timeline and/or functional goals?
  3. Is there any functional or task ambiguity that needs working out?
  4. Are any course corrections needed?
The second observation is that there are two major indicators of if a given individual is a good addition to the team:
  1. Does this person communicate well and often?
  2. Does this person have the capability and desire to resolve ambiguity on their own when possible?
The second skill, resolving ambiguity, is in my opinion the primary question that a software hiring manager needs to answer in the affirmative about a given candidate... assuming of course the candidate has the needed skills.

Much of this also circles back on a blog post that Jordan Mitchell wrote years ago when I was hip-deep in code at Others Online.

Posted via email from aicoder - nealrichter's blog

Monday, February 07, 2011

RightNow - Our cowboys ride code


This is a neat little ad in the January Delta Sky Magazine for RightNow Technologies, where I worked from 1999-2007.
RightNow Technologies serves about 10 billion [customer interactions] a year through he companies and institutions it works with. “Every person in North America has used one of our solutions about 25 times,” says Gianforte.
Why RightNow keeps its headquarters in Bozeman, MT
The quality of life here is a huge advantage, but more importantly, says Gianforte, “there’s a ranch saying around here that goes,‘When something needs to get done, well then, we’re just gonna get ‘er done.’ In many environments, they have to form a committee, pull in consultants and such to make things happen, but our clients appreciate that when something needs to get done, we can easily make that hap pen because of the work ethic here.”

Sunday, January 30, 2011

The Provenance of Data, Data Branding and "Big Data" Hype

The credibility of where data comes from in all these "big data" plays is absolutely crucial. Waving hands re "algorithms" won't cut it. @nealrichter Jan 27, 1010 Tweet

To expand on this tweet here's the argument: If one of your key products as a startup or business is to "crunch data" and derive or extract value from it then you should be concerned about data provenance. This is true whether you are crunching your own data or third-party data.

Some examples:
  • Web analytics - crunch web traffic and distill visitation and audience analytics reports for web site owners. Often they use these summaries to make decisions and sell their ad-space to advertisers.
  • Semantic Web APIs - crunch webpages, tweets etc and return topical and semantic annotations of the content
  • Comparison shopping - gather up product catalogs and pricing to aggregate for visitors
  • Web publishers - companies who run websites
  • Prediction services - companies that use data to predict something
In each of the above categories the provenance of the input data and brand of the output data is key. For each of the above one could name a company with either solid-gold data OR a powerful brand-name and good-enough data. Conversely we can find examples of companies with great tech but crappy data or a weak brand.

For web publishers, those that host user-generated content have poor provenance in general compared to news sites (for example). A notable exception is Wikipedia who has a pure "UGC" model but a solid community process and standards to improve provenance of their articles (those without references are targeted for improvement).

In comparison shopping Kayak.com has good data (directly from the airlines) and has built a good brand. The same is true of PriceGrabber and Nextag. TheFind.com on the other hand appears to have great data and tech, but no well known brand.

(I'm refraining from going into specific examples or opinions on big data companies to avoid poking friends in the eye.)

The issue of Provenance and Branding is especially important in sales situations where you are providing a tool (analytics) that helps your customer (a sales person) sell something to a third-party (their customer). If the input data you are using either has a demonstrable provenance or a good brand you'll have an easier time convincing people that the output of your product is worth having (and reselling).

The old saying for this in computer science is Garbage In, Garbage Out.

In "big data" world of startups that is blowing by Web 2.0 as the new hotness there is a startling lack of concern about data provenance. The essentially ethos is that if we (the Data Scientists) accumulate enough data and crunch it with magical algorithms then solid-gold data will come out... or at least that's what the hype machine says.

The lesson from the financial melt down is that magical algorithms making CDOs, CMOs and other derivatives should be viewed with a lens of mistrust. The GIGO principle was forgotten and no one even cared about the provenance (read credit quality) of the base financial instruments making up the derivatives. The credit rating agencies were just selling their brand and cared little about quality.

In my opinion, there is a clear parallel here to "big data". Trust must be part of the platform and not just tons of CPUs and disk-space. A Brand is a brittle object that is easily broken, so concentrate on quality.

Posted via email from nealrichter's posterous

Friday, January 14, 2011

Finance for Engineers

Last summer I took a great mini-course at MIT Sloan on Finance. It's essentially a breadth-first review of the MBA course complete with three case studies and a review of project evaluation methods via net present value analysis. Approximately 80% of the attendees were engineers/techies with 10+ years experience.. and maybe 25% w/ PhDs.

The first case study is Wilson Lumber from Harvard. The material is copyrighted, yet these links look like accurate distillations by business students.
The initial position is that Wilson Lumber growing small business with good suppliers and loyal customers. Volume and revenue are all up period over period. Question is should the bank increase is line of credit to fund the business. Once you break down the financial statements and model the business, the answer is No. Essentially Mr Wilson is over extended by many measures and is growing at the expense of his balance sheet, loaning him money will only make the problem bigger down the road. His basic options are to take in a partner as co-owner for cash, go broke or raise prices to lower volume and improve margins and slowly rebuild the balance sheet.

We then went through two NPV exercises. The first was a basic analysis of go/no-go on an engineering project with a bottom up analysis via putting all cost/benefit assumptions in a model and iterating though possibilities. The second was an analysis of a joint-venture between two biotech companies. Everything from external capital, deal structure to market penetration projections were worked in. Very informative and pretty interesting work for engineers to do once the terminology and methods were explained.

Professor Jenter shared two amusing anecdotes:
  • His MIT and Stanford MBA students often run off to found start-ups and forget the basic Wilson Lumber case. By the time they approach him for help it's too late and they are in Mr Wilson's position: shut-down, take in $$ and lots of equity dilution (and loss of control) or slow growth dramatically.
  • Also a quote along the lines of "Startups founded by MIT PhDs fail at a rate above far average".
This certainly hammered home the lesson that strategic planning for growth is very important, even for what look like non hyper-growth (software) companies. I'd recommend this course to any engineer wanting a quick structured intro to basic financial management.

Posted via email from nealrichter's posterous