Friday, March 28, 2008

Scraping Documents for Advertising Keywords

Lately I've been working on extracting keywords from text that would be associated with good keyword advertising performance. This is fairly related to the 'text summarization' problem, yet that usually works towards a goal of readable summaries of documents. This is a simpler problem as I don't want to build readable summaries.

'Finding Advertising Keywords on Web Pages' from MS Research (Yih, Goodman, and Carvalho) was interesting reading. To boil it down to its essence, the authors used a collection of standard text indexing and NLP techniques and datasets to derive 'features' from the documents, then used a feature-selection method to decide what features were best in deciding good advertising keywords in a document. They judged the algorithms against a human generated set of advertising keywords associated with a group of web pages. Their 'annotators' read the documents then chose prominent words from the document to use as viable keyword advertising inputs.

Note that this is not an attempt to do topic classification, where you could produce a keyword describing a document that did not exist in the document.. for example labeling a news article about the Dallas Cowboys with 'sports' or 'event tickets' if those labels did not exist in the article.

Interestingly the algorithm learned that the most important features predicting a word's advertising viability was the query frequency in MSN Live Search (a dead obvious conclusion now supported by experiments), and the TF-IDF metric. Other features like capitalization, link text, phrase & sentence length and title/headings words were not as valuable alone.. yet (unsurprisingly) the best system used nearly all features. The shocker was that the part-of-speech information was best left unused.

I emailed the lead author and learned that the MS lawyers killed the idea of releasing the list of labeled URLs.

Post Script: The second author is Joshua Goodman, who had a hilarious exchange with some authors from La Sapienza University in Rome. They wrote a 2002 Physical Review Letters paper on using gzip for analyzing the similarity of human languages. Goodman responded with this critique, causing the original authors to respond with this response. Looks like there are other follow ups by third-parties. The mark of an effective paper is that it is talked about and remembered.

No comments: