aicoder: natural language processing

Friday, March 28, 2008

Semantic Features for Contextual Advertising

Andrei Broder's group at Yahoo! Research has a focus on Computational Advertising. At SIGIR 2007 they released a paper on using Semantic Taxonomy to do contextual matching for advertising. This is a similar problem to the previous post about MS Research, deriving lists of keywords from a document to use as queries to an advertising system. Unlike the MS Research paper, Yahoo has built a large taxonomy of "commercial interest queries" with 6000 nodes and approx 100 items attached to each node.

The essential approach is to classify a document into the taxonomy as well as all of the ads and match ads to documents on the basis of topical distance. The distance score is combined with a more standard IR type approach forming a combined score. The top-k matching ads ordered by lowest distance are the ads displayed the page.

The TaxScore() function is fairly interesting, it attempts to generalize the given term within the taxonomy. It seems that this type of approach could work well with using WordNet's Hypernyms in a more regular IR/Search setting.

I have to read it again more carefully to see if I missed it, however I did not see anywhere in the formulas using any weighting of a keyword's bid value (or advert count). Maybe this was omitted for trade secrecy?? .. it seems obvious that it should be used to some degree to maximize $$ yield or eCPM of clicked ads. The idea is not to let it affect the matching of the ads to keywords, just the final rank order to some degree.

In my own experiments @ OO, using some proxy for bid value seems to increase eCPM. The biggest challenge is getting comprehensive data for your dictionary if you are not Google, Yahoo or MS.

Postscript:
I have it confirmed from two independent sources (current and ex Y!ers) that Yahoo is working in a new Content Match codebase as the old version didn't work. Hard to say what status Broder's above technique is in (production usage or internal testing).. or if it was part of the old system?

Scraping Documents for Advertising Keywords

Lately I've been working on extracting keywords from text that would be associated with good keyword advertising performance. This is fairly related to the 'text summarization' problem, yet that usually works towards a goal of readable summaries of documents. This is a simpler problem as I don't want to build readable summaries.

'Finding Advertising Keywords on Web Pages' from MS Research (Yih, Goodman, and Carvalho) was interesting reading. To boil it down to its essence, the authors used a collection of standard text indexing and NLP techniques and datasets to derive 'features' from the documents, then used a feature-selection method to decide what features were best in deciding good advertising keywords in a document. They judged the algorithms against a human generated set of advertising keywords associated with a group of web pages. Their 'annotators' read the documents then chose prominent words from the document to use as viable keyword advertising inputs.

Note that this is not an attempt to do topic classification, where you could produce a keyword describing a document that did not exist in the document.. for example labeling a news article about the Dallas Cowboys with 'sports' or 'event tickets' if those labels did not exist in the article.

Interestingly the algorithm learned that the most important features predicting a word's advertising viability was the query frequency in MSN Live Search (a dead obvious conclusion now supported by experiments), and the TF-IDF metric. Other features like capitalization, link text, phrase & sentence length and title/headings words were not as valuable alone.. yet (unsurprisingly) the best system used nearly all features. The shocker was that the part-of-speech information was best left unused.

I emailed the lead author and learned that the MS lawyers killed the idea of releasing the list of labeled URLs.

Post Script: The second author is Joshua Goodman, who had a hilarious exchange with some authors from La Sapienza University in Rome. They wrote a 2002 Physical Review Letters paper on using gzip for analyzing the similarity of human languages. Goodman responded with this critique, causing the original authors to respond with this response. Looks like there are other follow ups by third-parties. The mark of an effective paper is that it is talked about and remembered.