Thursday, April 17, 2008

Text Summarization and Advertising

Recently read another CompAdvert paper (CIKM'07) from the Yahoo group, Just-in-Time Contextual Advertising. They describe a system where the current page is scraped on-line via a javascript tag, summarized and then that summary is passed to servers to match with Ad listings. Interesting points are:
  • 5% of page text carefully chosen can yield 97+% of full-text advert matching relevance
  • Best parts of the document are URL, referring URL, title, Meta and headings.
  • Adding in a classification against a topical taxonomy adds to the accuracy of the ad matches.
  • They judged the ad matching relevance against human judgments of ad to page relevance.
I found these papers within the last few months as OthersOnline focused on behavioral based advertising. In many ways their finding are interesting, affirming and unsurprising. Interesting in that they are pushing the state of the art in advert matching, and affirming in that we @ OO are on the right track. Unsurprising in that using the document-fields about is the classic approach to indexing webpages and documents.
Of course internet search engines used this for years (it defines the SEO industry's eco-system), and the old/retired open source engine HtDig has had special treatment of those fields since the late 90s. The difference now is the direction, the documents are the "query" and the hits are the ads. Best part about the method is that it's cheap... javascript + the browser becomes your distributed spider and summarizer of the web.
I do love finding these papers.. we just don't have the time or resources to have a study like this and confirm the approach with a paid human factors study. Just go forward on gut educated feel day to day and the human measure is if we get clicks on the ads.
This approach is similar to the one we outlined and implemented before finding this paper. The difference is what we do with the resulting "query", using the signal to learn a predictive interest model of users.
Still no mention of any relative treatment of words within the same field... one would assume this would move the needle on relevance as well.
I still believe that this type of summarization approach can be used to make an implicit page tagger and social recommender like del.icio.us ... if you can filter the summary based upon some knowledge of the users real (as opposed to statistical) interests. Key route to auto-personalization of the web.

No comments: