Sunday, March 30, 2008

OpenOffice and LaTeX

I am stunned. I am working on the dissertation and need to make a LaTeX monograph out of some publications. The biggest pain is that I had written two of the papers in MS Word, and previous experience cut-and-pasting from Word to a text file was horrible. I also no longer have MS Word installed on my new machine (didn't come pre-installed from Dell), so I have been using OpenOffice. So how do I get from Word to LaTeX when I only have OO?

More as procrastination I did a search and found Writer2LaTeX. Some saint named Henrik Just created this to convert OO docs (either swx or odf format) to LaTeX. Looks like he has been working on it for a while with a set of faithful followers submitting feature requests. Between OO 2.4 being able to flawlessly open the MS Word docs and save to ODF plus Writer2LaTeX, I probably saved myself 2 days of irritating work.

It's not perfect as I need to convert some equations images to proper LaTeX, yet the mind-numbing job of cut and paste is rendered unnecessary.

Thanks Henrik and the rest of the OpenOffice team!

Postscript:
While not open source, TeXaide from Design Science is a very nice free LaTeX equation editor. Write your equation, the highlight and copy it... then it pastes magically as text markup into your LaTeX code. Now if only I could just drag and drop an image on it and have it convert the equation in the image.

Friday, March 28, 2008

Semantic Features for Contextual Advertising

Andrei Broder's group at Yahoo! Research has a focus on Computational Advertising. At SIGIR 2007 they released a paper on using Semantic Taxonomy to do contextual matching for advertising. This is a similar problem to the previous post about MS Research, deriving lists of keywords from a document to use as queries to an advertising system. Unlike the MS Research paper, Yahoo has built a large taxonomy of "commercial interest queries" with 6000 nodes and approx 100 items attached to each node.

The essential approach is to classify a document into the taxonomy as well as all of the ads and match ads to documents on the basis of topical distance. The distance score is combined with a more standard IR type approach forming a combined score. The top-k matching ads ordered by lowest distance are the ads displayed the page.

The TaxScore() function is fairly interesting, it attempts to generalize the given term within the taxonomy. It seems that this type of approach could work well with using WordNet's Hypernyms in a more regular IR/Search setting.

I have to read it again more carefully to see if I missed it, however I did not see anywhere in the formulas using any weighting of a keyword's bid value (or advert count). Maybe this was omitted for trade secrecy?? .. it seems obvious that it should be used to some degree to maximize $$ yield or eCPM of clicked ads. The idea is not to let it affect the matching of the ads to keywords, just the final rank order to some degree.

In my own experiments @ OO, using some proxy for bid value seems to increase eCPM. The biggest challenge is getting comprehensive data for your dictionary if you are not Google, Yahoo or MS.

Postscript:
I have it confirmed from two independent sources (current and ex Y!ers) that Yahoo is working in a new Content Match codebase as the old version didn't work. Hard to say what status Broder's above technique is in (production usage or internal testing).. or if it was part of the old system?

Scraping Documents for Advertising Keywords

Lately I've been working on extracting keywords from text that would be associated with good keyword advertising performance. This is fairly related to the 'text summarization' problem, yet that usually works towards a goal of readable summaries of documents. This is a simpler problem as I don't want to build readable summaries.

'Finding Advertising Keywords on Web Pages' from MS Research (Yih, Goodman, and Carvalho) was interesting reading. To boil it down to its essence, the authors used a collection of standard text indexing and NLP techniques and datasets to derive 'features' from the documents, then used a feature-selection method to decide what features were best in deciding good advertising keywords in a document. They judged the algorithms against a human generated set of advertising keywords associated with a group of web pages. Their 'annotators' read the documents then chose prominent words from the document to use as viable keyword advertising inputs.

Note that this is not an attempt to do topic classification, where you could produce a keyword describing a document that did not exist in the document.. for example labeling a news article about the Dallas Cowboys with 'sports' or 'event tickets' if those labels did not exist in the article.

Interestingly the algorithm learned that the most important features predicting a word's advertising viability was the query frequency in MSN Live Search (a dead obvious conclusion now supported by experiments), and the TF-IDF metric. Other features like capitalization, link text, phrase & sentence length and title/headings words were not as valuable alone.. yet (unsurprisingly) the best system used nearly all features. The shocker was that the part-of-speech information was best left unused.

I emailed the lead author and learned that the MS lawyers killed the idea of releasing the list of labeled URLs.

Post Script: The second author is Joshua Goodman, who had a hilarious exchange with some authors from La Sapienza University in Rome. They wrote a 2002 Physical Review Letters paper on using gzip for analyzing the similarity of human languages. Goodman responded with this critique, causing the original authors to respond with this response. Looks like there are other follow ups by third-parties. The mark of an effective paper is that it is talked about and remembered.

Frequent Pattern Mining

In the weekly RightNow AI Colloquium @ MSU CS , we read a paper by Jiawei Han et al. called
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach.
Basically the problem is this, given a sequence of transactions involving multiple items per transactions, what are the frequent itemsets? Itemsets are groups of m items that tend to be purchased together. The earlier SQL-based version of FP-Tree looks interesting as well.

FP-Tree out competes in time complexity the Apriori Association Rule Miner Algorithm (Java Code). Not sure how it compares with this raft of other algorithms available at the FIMI.

I'd love to use an algorithm like this to extract pairs and triples of keywords from documents in a clickstream.. basically looking for recurring patterns in browsing/hunting bevavior in document repositories.

I would say the biggest issue using either of the two above algorithms in a 'production' system is that they are not incremental. Ideally one could process a batch of transactions a record at a time and form a rolling frequent itemset collection. Itemsets would move in and out of the collection as they achieved or lost 'support', and the cost of adding another transaction is minimal... as in not having to rescan all previous transactions.

My initial idea how to do this would be to devise an incremental approximation addition to any association miner. At the end of a full-run of your miner, you would keep the final itemsets AND the itemsets that just missed the support threshold. The incremental algorithm would process up to a tolerance level of new transactions, say log(n) of the original transaction set size, and look to promote the 'just missed' itemsets if support arrives. Maybe some attempt could be made to remove itemsets if the additional transactions sank their support level to below the cut-line. After more than log(n) new transactions arrive, you can reprocess the entire set or trim off the first log(n) of the old transactions plus the new ones.

There are likely some speedups to be had in subsequent reprocessings. If from the previous full-run you found that a collection of itemsets made the cut, you could prune out those itemsets from the old transactions.. restarting the algorithm with the previous run's itemsets in the "found pool".

Of course with an algorithm like FP-Tree you must save the tree for the next run, and devise a tree rebalancing algorithm to make it incremental (relative frequencies of items change with new information). It gets messy quick.

Monday, March 24, 2008

Road Coloring Problem solved

I'm not sure why this hit my buttons, as it's not my area of expertise, yet I found it very interesting. Avraham Trahtman, 63 year old Russian mathematician who used to work as a plumber & maintenance guy in Israel, has solved (paper with solution) an interesting problem in graph theory. Unlike other famous problems with few applications, this appears to have some (network routing) and he's published an updated paper with a sub-quadratic algorithm.

I think it has been captured in news coverage more as a result of his age and story than importance of the problem.