Sunday, December 28, 2008

Statutory Inventions and the Public Domain

This is interesting: USPTO Statutory Invention Registration

Basically it's a way of publishing an invention to the public via the USPTO. Rarely used for obvious reasons.

Question #1: Why don't open source people apply for these USPTO invention declarations? It seems to be the patent law equivalent of a BSD/MIT license.. ie "use/extend it for any purpose but it's still my work and you can't claim it as your own work."

The more interesting tidbit on there is, while obvious, a potential source of great technical material. Since 1999 when you file a patent it is published 18 months after the file date. Once an application is abandoned the application and is published by the USPTO and it becomes public domain.

Question #2: Are rejected patents whose appeals have run out then public domain? I can't seem to find a clear answer.

How many rejected/abandoned software patents etc out there from Microsoft, Oracle, IBM, etc are there that contain very valuable algorithms and techniques that are now public domain? Yes this is a bit like looking for gold in the trash can...

Notes:

The European Patent Office by treaty publishes many USPTO patent apps.. and honestly has a better interface for getting the status of your patent than I have yet found at the USPTO.

The Patent Reform Act of 2005 (Republican sponsored) was an attempt to close the publication 'loophole'. The Patent Reform Act of 2007 (Democrat sponsored) keeps the current publication system in place. Neither is law (yet).

Saturday, December 27, 2008

New Year's Resolutions - Blog more

New Years resolution - blog more. I've been very busy doing cool stuff at Others Online and in the process developed some tunnel vision to stay on task.

Here's to hoping that more blogging will cause me to see things in a different light more easily as well as get me more in 'writing mode' to finish the PhD before summer 2009.

Twittering has replaced blogging as my outlet for the second half of the year.. yet the 140 char format isn't much good for a personal musing and research blog.

Marshall Kirkpatrick and Data Mining

[somehow this got lost in drafts in my blog - back in August]

Marshall KirkPatrick's RRW post Four Ad-Free Ways that Mined Data Can Make Money is interesting.

In approx 2002 I wrote version 1.0 of the RightNow Tech sentiment analysis software to analyze the positive versus negative overall tone of incoming support requests/emails in the CRM system. I called it 'Emotix', but the Marketing people renamed it SmartSense. Later Steve Durbin and I bolted on a POS tagger to get a bit more accuracy given language forms like 'I am not very happy' and 'I am very angry' require the modifiers be taken into account.

Basically Emotix was tasked to attach a numerical positive/negative emotional score to each incoming request.. such that the queue of requests could be ordered to service angry customers first. We weren't interesting in extreme accuracy.. just a decent ordering that was fairly predictive.

There were two interesting stories to version 1.0. The first concerned the negative/neutral/positive word dictionary. Basically my office mate and I sat down and compiled a list of every positive and negative word we could find and put them on a wide numerical scale. When it was time for swear words, we shut the door and howled in laughter as we threw mock insults around the room. The Wicked Words book was an invaluable source of inspiration.

When it came time to litmus test our word ratings we put co-workers in front of a terminal that would put random words from the list and ask them to agree to disagree with the rating. Needless to say we had to forewarn everyone that it WOULD be offensive and that this did not constitute any form of harassment. Watching the process was excruciating and hilarious.

The second humorous story concerned testing and training the system on real support messages. My favorite data set was from a well known customer that made specialty ice cream. Their customers tended to begin each service contact with a large block of text extolling the virtues of the company and its ice cream.. with the negative comments on their experience with the ice cream last... usually written in apologetic terms. Obviously the creamery wanted to get the custmers with real negative experience problems to the top of the queue.. ice cream is all about the eating experience. But how do you filter out and bias for the overall 'fan mail' tone of 90% of the requests? Fun stuff to work on. Some details here, others here.

Recently Steve extended it to work with both the RNT Voice product and the marketing-automation product as well. The big lesson here as an engineer is that good enough can be just fine and often it will be used in unexpected ways down the line. The largely un-refactored code is still running and processing billions of textual contacts every year. Ok this is exagerated a bit.. but it hasn't really been rewritten, just optimized frequently.

Monday, September 08, 2008

A story about Don Haskins and my father

This is not about AI, but a personal story.

Don Haskins died yesterday at 78 years old. AP Obituary

I grew up for the first 14 years in El Paso, Texas before moving back to the family home in Montana. Don Haskins and UTEP basketball were a big deal. Coach Haskins is of course the guy who coached UTEP (then Texas Western College) to the 1966 NCAA championship with five black players... which was wonderfully rendered into the movie Glory Road. It's hard to overstate what an effect he had on El Paso and UTEP.

I've always been more of a fan of coaches than players, and Haskins was at the top of my biased list of basketball idols. Years ago my father and I finally met him at my grandmother's 70th birthday party. [Photo of me bringing in flowers he brought to the party] I'd read every book, and article about him but managed to mostly stutter upon meeting him. My dad gave me some crap about that.

When Glory Road opened on Friday the 13th 2006, the family went to see it. Great flick even with the inaccuracies. After the movie Dad asked me how I liked it, I said "It was great". He then said "A movie about your hero Don Haskins", I said "yep.. but oh Dad you are my hero". We both snorted and chuckled as grown men do about emotions. A few minutes later we parted ways with 'I love you' and hugs.

Dad died the next day and those were the last words we exchanged. I didn't watch Glory Road again till a couple months ago.. too much emotion. When I heard the news that Haskins died, time stopped a bit as I thought about him, my childhood idols and my father.

The stories about this man that circulated in El Paso and in the family were legendary. Here's an example, or two.

RIP Coach Haskins (and I still miss ya Dad!)

Friday, August 29, 2008

Open Source Search Engine Rodeo: Solr v. Sphinx v. MySQL-FT

Last summer Anthony Arnone and I did a study on the performance of three open source search engines.

We chose these three as two of them have close ties to MySQL and the other is a well used and performant offering from Apache. There are many that we skipped.

Here's the Report in PDF.

Solr was the clear winner. Sphinx was in a close second with blindingly fast indexing times.

At this point the report's results are somewhat dated as both Sphinx and Solr are readying new releases. So your mileage may vary, and I'm sure Peter Zaitsev and the Sphinx team could show us how to improve the performance of their engine.

Updates: The Sphinx team contacted me and suggested some ways to improve Sphinx performance. New results will be published some time soon. They will likely also publish a test using Wikipedia as the document repository.

More Updates: I have started a new Solr project and may test Sphinx again.

Tuesday, August 12, 2008

Great comment on MapReduce

Spot on comment from Iván de Prado on the Database People Hating on MapReduce blog post.

I think this document is comparing things that are not comparable. They are talking about MapReduce as if it were a distributed database. But that's completely wrong. Hadoop is a distributed computed platform, not a distributed database prepared for OLAP.

MapReduce is a re-implementation of LISP's map and reduce in a parallel setting. Now the function/task that you give to Map is where the rubber meets the road of reading data from some data store.

MapReduce versus RDBMS - Round 2

Round 2 (for me anyway - this discussion has been raging for a while). Nice read on how Rackspace Now Uses MapReduce and Hadoop. They started with shell scripts, evolved to remote RPCs of shell scripts, moved to MySQL, interated on MySQL and then jumped to a heterogeneous Hadoop + Solr + HDFS. Terrabytes of data.

The MySQL evolution was interesting as I'm going through a similar process of attempting/planning to continually refine MySQL performance. We need UPDATE statements in a big way, so it's a bit different than appending to log structures.

I've been playing with a daily summarizing and distributed ETL with MySQL. Basically with creative use of Views and the Federated Engine one can do a scheduled daily map and reduce. I hold no hope that this is a solution for adhoc queries, it's not at all that flexible. Wiki page describing this system coming soon.

Still trying to find a solution other than a 'union view' across the federated tables from the n data partition servers as the Map. The Reduce will be a set of stored procedures against the union-view. Perhaps this post and hacked code holds promise for gluing Hadoop to JDBC/MySQL storage engines. This would better enable ad-hoc queries.

I also wonder if MySQL Proxy is useful here.. looking into it.. but at first glance it doesn't inherently pattern the distributed Map operation well.

Open question: If one could publish MySQL stores to a column oriented DB like MonetDB or LucidDB and then do Hadoop map-reduce operations then do I have what I want for ad-hoc queries?

Monday, August 11, 2008

MapReduce versus RDBMS

I managed to stumble upon an interesting article while looking for a MySQL multi-database federated query tool.

David DeWitt and Michael Stonebraker write: MapReduce: A major step backwards.

They rightly point out that MapReduce is a 25 year old idea. Lisp has had this functionality for decades.. and it's actually at least 30 years old. Griss & Kessler 1978 is apparently the earliest description of a parallel Reduce function. That said, it's only in the last 10 years that an idea this great could have been implemented widely with the advent of cheap machines.

Their second point is that MapReduce is a poor implementation as it doesn't support or utilize indexes.

One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.

In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.

Great point and point taken. However, where are the open source implementations of the things you mention? This is a bit of the 'if a tree falls in the woods and no one is there to hear it' problem. A major reason MapReduce has seen uptake (other than being a child of Google) is that an example implementation is available for the Horde to steal, copy, improve & translate.

The modern user generated content web is mostly built on Open Source these days, so the fact that I can get the above technology in commercial databases is a non-starter.

I'm a SQL junkie and am searching in vain (it seems so far) for decent extension to MySQL that does cross-database query and reduction of tables I know to be neatly partitioned. No luck so far. Starting to look into other SQL engines as their maybe a ODBC wrapper for the federation layer. It's got to be mostly functional and EASY to adopt.. or you'll continue to have people spouting the MapReduce dogma.

Post scripts:

Nice summary of Dr. Stonebraker's accomplishments here
Funny link to comp.lang.lisp some newbies asking about if Lisp has Map Reduce.

Monday, August 04, 2008

Improving Software Release Management

I just found this in my inbox, 7 Ways to Improve Your Software Release Management. It's an excellent overview of 'doing things differently'. Pretty similar to the change that I experienced at the last job... and something that Mike Dierken at the current job just seems to know instinctively.

I need to find some stuff on the best ways to do personal lightweight processes. I'd like to be more efficient in producing software... especially software that is based upon speculative ideas. So much of the time data mining and machine learning coding is subject to the vagaries of the data set you are working against and it's difficult to know ahead of time if a given algorithm will work well.. how much data cleaning needs to be done... etc.

How can you adapt lightweight processes and the things mentioned above for producing software that is not so cut and dried in what needs to be done? In those situations I tend to ping-pong between little process (seat of the pants coding) and too much (excessive research and design before coding start).

Post script: Had to add this:

Five Things Linus Torvalds Has Learned About Managing Software Projects

Tuesday, July 29, 2008

How not to launch software

From cnet:

Cuil shows us how not to launch a search engine

Google challenger Cuil launched last night in blaze of glory. And it went down in a ball of flames. Immediately after launch, the criticism started to pile on: results were incomplete, weird, and missing.

The various articles on Cuil's failure revealed much about their architecture. Apparently they are categorizing a user query into a topic and shipping that out to topical servers. While this sort of 'topical partitioning' is interesting, it has zilch to do with relevance ranking... and suffers from a failure-point issue.. if that topic server goes down then queries against that topic will get junk results or zero results

Questions and points of discussion:

Is it really true that a data schema partition results in a better engine than Google? No, a better engine is made by better relevance. Perhaps this is what the PR/Marketing people focused on rather than relevance.
How do you simulate load post launch load when you have no idea how widely the free press will be distributed?
Free post launch press is invaluable to your buisiness.. squandering it might be a deathblow.
Why not launch more quietly in early adopter tech-press and then go try and get mainstream press when you have proven the system?
Trading on your status as ex-Googlers (and not early ones at that) seems VERY dubious. Stand on your own feet rather than someone else's.
The absolute hottest area of information retrieval research right now is using user click-streams to improve the relevance live and on-line (learning to rank), as well as personalize results. These are differentiating features (if they result in improved relevance).
Cuil keeps ZERO user history or assigns session/user-ids. This will make it very difficult to follow this trend.. unless they are using someone else's cookies to do the identification via analytics partner (no evidence of this).
The other hot area of IR research is using semantic analysis and NLP to break away from simple keyword based inverted indicies. Hakia still seems to be doing it better than Cuil... or at least will appear to as long as the topical partitions keep crashing under the load.
Risk analysis is a fantastic tool in organizing and prioritizing your work on new products.. it seems they missed that part before deciding to launch.

I feel bad for these guys, Anna Patterson et al. have done some great work in the past and I just hate to see good people stumble like this.

I still think they are wrong to go out as a consumer engine... Enterprise is a better play.. however if their leading market differentiator is a topical partitioning of back end servers.. then they aren't even considering this as individual Enterprise customers may not be big enough to need hundreds of servers to distribute the index like that.

Hindsight is always 20/20 and hope to hell I am not standing there redfaced as software I helped create fails upon high volume launch.

Sunday, July 27, 2008

Dubious results from Cuil .. and the Majors

I searched for 'evolution recombination mutation' on cuil.com

The first 2 times, I got a no results page. After a few variations and an hour or so, I tried again and got a nice result set. One really jumped out at me: 'What's Driving Evolution; Mutations or Genetic Recombination'. The problem is that it links to an intelligent design group disputing Evolution in general.

http://nwcreation.net/geneticrecombination.html

After trying the same search on Google, Yahoo, Live.com, Ask and Hakia.. that same page shows up in the top ten.

Really? This is authoritative and the best that modern world-class search can do? I was hoping from some kind of page summarizing epic battles between Ernst Mayr and Motoo Kimura.

While this is likely caused by both keyword matches via good SEO and the fact that this page is probably highly linked to... this is a semantic failure! Imagine if a query to the Holocaust had anti-Holocaust propaganda links appearing above genuine factual information!

At least Hakia and Yahoo put up a page refuting the author of the above link in the top 10 results. Google and the rest fail to do so.

Admittedly, not everyone believes in evolution.. and that is fine with me.. but I'm not sure that fact refutes the semantic/authoritative failure of the engines.

Anna Patterson's new company - cuil.com

Just spotted a NYT article on a new search engine called cuil from Anna Patterson and Tom Costello. A while back I read Dr. Patterson's article "Why Writing a Search Engine is Hard". Nice quick read on the issues. I also stumbled upon several patents she wrote.

So the recent challengers of note are Ask.com, Powerset, Hakia, Wikia, Mahalo and now Cuil. I'm rooting for the algorithmic ones, not really sure how Wikia and Mahalo can scale to be non-niche engines.

While I hope Cuil is successful (I like to see Academics go out and build companies), I'm not sure that it's possible to beat Google at this point. It seems far more likely that Microsoft will just try and swallow Hakia and Cuil and attempt to brew something out of the parts.

I recommend reading Danny Sullivan's post on Cuil, he hit most of the obvious points.

My thoughts:

I still think building intelligence on top of an existing index/engine is the way to go and I'm not sure that bragging about your index size or your back end architecture is going to get you any meaningful marketshare.

Also, Enterprise Search is still far behind in NLP technology vis-a-vis the big 4 and the NLP startups. It's still 1999 there, enterprises are just now figuring out how to expose their vast document sets to an internal crawler and provide a UI that is not overly simplistic for savy users. They've tried the classic approaches of commodity engines and found them wanting. Link analysis doesn't help either as most of the these documents are not web documents with links. This just cries out for an approach like Vivisimo's clustering + Hakia's semantics + Delicious' user driven tagging.

Corporate searchers need a good advanced interface and results that can be grouped by things like time and originating department.... but not be forced to drown in overly similar hits. That is a market worth getting into with a $33M VC investment. The landscape is littered with vendors that over-promised during the sales cycle and Enterprise customers will switch products if it solves the problem better.

Look at the $500M acquisition of Verity by Autonomy in 2005. That's 5X more than Powerset (in 2005 dollars) and they actually had loads of paying customers.

I worry that going directly at Google with a consumer search engine is just so much tilting at windmills. Sometimes just selling a product to people willing to pay for it is easier.

Others Online news and welcome Rance and Vik!

I've been busy in the last few weeks getting new products launched at Others Online. We've deployed a set of new 'Audience Affinity Analytics'. By adding a simple Javascript tag to your webpages, our software delivers free audience summary reports that detail at a keyword/phrase level what people are paying attention to on your site! Short video here.

We've also hired Rance Harmon (MS student at Montana State U) and Vik Jakkula (MS student at Washington State U) as interns for the summer. Rance will be working on general web and Java coding as well as testing frameworks. Vik will be working on some new data mining algorithms. Welcome guys!

Thursday, May 15, 2008

Looking for an intern

I'm looking for resumes for a possible internship this summer with a startup company I work for - Others Online

Min 20 hours per week, possibly 40 hours.

Job Description:

Assist with Java servlet, SQL and Ajax UI coding. Data mining and search engine work possible if background allows.
Potentially assisting with highly scalable system work using partitioned architecture migrating to Amazon EC2 & S3 system.

Required Skills (4+ out of 6)

Java/JSP proficiency
SQL knowledge (MySQL)
Ajax experience (YUI, Google Web Toolkit, or from scratch javascript)
Some Sys Admin skill in Linux
Perl, Ruby or Python scripting
Familiarity with VMWare, Tomcat/Resin & Apache

Preferred Skills

Data Mining, NLP or Search Engine experience/coursework
WordNet or other taxonomy experience

Text processing, clustering and classification
On-line model building
Ant System methods

Partitioned high-availability system knowledge

Design fundamentals of this type of architecture
Amazon EC2/S3 experience

Firefox browser extensions

If you've taken a data mining or AI course, there is a very real possibility of doing some interesting work this summer with a conference research paper a likely outcome.

If you want to assist with the high-availability partitioned architecture implementation and migration, you'll be learning cutting edge design patterns that I have yet to see in a textbook. The other software engineer (Mike Dierken) worked at Amazon for a few years implementing very scalable systems.

Splitting time between the above two work areas is possible and in fact likely. Either way you'll come out with some real world skills desirable in industry.

We're a startup trying to explode on the scene this summer and the outcome post-summer is unknown. It could turn into a full-time salary job or continued part-time work in the fall.

Thursday, May 08, 2008

A Subtle Art

I loved this quote from a NYT Blog post:

"By attracting a commanding share of the search advertising activity, Google also has the best data with which to create equations that maximize the money it makes from each search. It turns out that picking which ad to display when is a subtle art that can have a great effect."

I've been working on effectively harvesting and choosing keywords to best power and adnetworks like Google/Yahoo/MSN. I believe it's much harder than simply producing good search results. Search results are targeted at people explicitly looking for information, so in that sense it's like an auto-generated yellowpages entry, people want to click on something. The adverts on these pages benefit from the explicit nature of the search.

How do you do the same for display advertising on the rest of the non-SERP pages of the web? You get one shot at providing value, a bit like trying to hit the green from the rough and through a stand of trees. A subtle art indeed, and great fun to work on.

Thursday, April 24, 2008

AI Luminaries and their claims

This is a bit of a rant. In our weekly AI colloquium we have covered two high profile AI people, Dr. Robert Hecht-Nielsen and his Confabulation theory and Jeff Hawkins and his book On Intelligence plus his ideas around Hierarchical Temporal Memory.

Neither of the two above people are terribly famous in AI academic circles per se, but they do get some press. They are both founders of very successful business in HNC (Now FairIssac) and Palm/PalmPilot. Particularly in the case of HNC's use of AI for fraud detection in financial transactions, I am very impressed. Yet this is where it gets ugly.

If you read Hawkin's book or listen to a Youtube lecture of Dr. Hecht-Nielsen you would think these guys have invented the next great AI of the 'I have modeled the brain' variety. They both demonstrate very interesting neural network inspired architectures and basic computation. They are both complete with deep layering, feedback loops and other structures that NN people have known will work for years.

Yet both of them will just repeat similar claims that this is how the brain actually works and with this architecture real cognition is possible, even potentially trivial? Hogwash. Batshit Insanity.

These aren't my words, they were used to describe Stephen Wolfram's work on Cellular Automata and his claims that his flavor of CAs can build anything and describe all physics. He makes lots of strong claims against a theory he spend decades toiling on nearly alone.

Would Peter Norvig ever make these types of claims? I tend to think no way.

First, how can you make the claim that this architecture is really how the brain works and as such will lead to cognition or reasoning? To me that is what they are.. architectures of computation.

Where are the programs? Where is the embedded/encoded/learned method that can actually reason in some logical or nearly logical fashion? Ie chaining facts together to create derived knowledge? Picking apart disparate background statements/data to answer queries for information? How does this architecture answer the question of what are necessary and sufficient conditions to produce computerized general reasoning?

It's a massive leap of faith for me to take for granted that a neural architecture, with all it's bells and whistles, will just lead to reasoning. Doesn't it take just one counter-example of something that such an architecture and associated learning methods can not learn correctly to break the bold claims?

To me that is a central lesson of genetic algorithm theory, people for years went around insisting that the GA was a better optimizer (mousetrap) than all that had come before. They invented theories to describe it's behavior and made bold claims. Yet Wolpert and Macready come along and show the No Free Lunch theorems. It basically blew many of the bolder claims of GA superiority to hell. It has now spread into general Machine Learning as well.

I think people, particularly ones in AI with monster egos, need to exercise some humility and not make such strong claims. Didn't that type of behavior contribute to the AI Winter?

Every time I hear or read these types of claims my mind sees an image of Tom Hanks in Cast Away dancing around his fire and saying "Yes! Look what I have created! I have made fire!! I... have made fire!". The horrible irony of the scene is that he remains fundamentally lost, couldn't find his island on a map and is no closer to finding home as a result of making his fire.

Sunday, April 20, 2008

Series of Economist Stories on Digital Nomads

During a flight to Silicon Valley last week for AdTech I read a special report in the Economist on Digital Nomads. The intro article sets the tone and talks about how the mobility of cellular devices are obviously quickly changing the world. I love the description of early digital nomads as more akin to astronauts who must carry everything with them (cables, disks, dongles etc).

My comments:
I do tend to carry a good inch of paper everywhere I go, 90% of which are CS/AI research papers or dissertation notes to work on as I have the time. Opening the notebook is way to0 much of a hassle as a mobile reading device (on an airplane). He also states that engineers at Google tend to carry only a smart phone of some kind and no laptop when traveling. Is this true? I find this a bit hard to believe unless they are managers and not active coders.. I can't imagine writing code on a blackberry or worse and iPhone.
I'll admit I have no smart phone and do carry a laptop everywhere. I'd like it to be smaller, but not at the cost of horsepower or decent display size. I tried using a Microsoft powered Samsung smart phone and hated it, what it gave me in increased cool functions I sacrificed in phone function.

The next article in the series discusses the new found benefits and costs of our ability to work everywhere and anywhere.

My comments:
My laptop is definitely a desktop replacement for me, I want mobility.. but I am old fashioned enough to want a regular desk to work at.. even if that is my current count of three different ones at different times of the week. I have tried the coffee shop thing.. it works to a point but I find myself only being efficient for the first 2-3 hours then it goes to hell as I start hearing and being distracted by the conversations around me. I suppose it would be fine if I needed to do mostly emailing and short attention span coding. I also think that most business conversations are sensitive enough that talking in a public space is not acceptable in my mind... nothing to hide, yet why broadcast every mundane and not so mundane detail to the people around you?
The parts about culture clashes from old cubical work and new mobile work are spot on. It points to the need to trust every employee and set a tone that what matters is production and not the act of working.

The third article in the series explores the need for new types of spaces and architecture for this new way of working.

My comments:
I loved this one, even if it assumes that I 100% embrace the nomad ethos. Perhaps if I had such a space to work, and people to work with in that space I'd abandon some of my older ways. The closest thing I can imagine these 'third places' being is a combination of a student-union building and a college library. Very non uniform places with nooks and crannies for every type of 'work'. The Bozeman paper just had an article on a new business for 'on demand offices' and I saw two others described in a Seattle tech magazine at the airport. Winning idea.. 50% of the reason I go to campus two days a week is for socialization.

Great quote:
These places are "physically inhabited buy psychologically evacuated" ... leaving people feeing "more isolated that if the cafe where merely empty".

Great food for thought in the articles. I do notice that in my travels I see one thing that disturbs me. It is some people's inability to ignore their cell phone or crackberry when the are engaged in a face to face conversation or meeting. I was very appreciative of one executive's recent demonstration of 100% ignoring his device when it rang or vibrated.. he didn't even flinch. This was the exception to the rule over the past week.

I need to let this brew more.. at some point I am sure it will spur some good ideas. Perhaps there is an algorithm or platform waiting to be discovered that will spur us to look up from our devices and engage each other again. I suspect a big part of our addiction to them is that it's a much more high bandwidth information pipe that simple conversations are.

Friday, April 18, 2008

Using human relevance judgements in search and advertising

This is old news on a couple of dimensions. Read Write Web had a post on how Google uses human relevance studies to help judge/QA their search results. This resulted from an interview that Peter Norvig gave to MIT Technology Review and caused some commenting in the blogosphere (NewYorkTimes Tech Blog, Goolge Blogoscoped). Old news on old news.

We now know that both Yahoo and Microsoft are using (to some degree) human studies to evaluate computational advertising algorithms (see this and this). Evaluating the correlation of what informational item an algorithm predicts, vs what humans think, is relevant to a context is the performance metric of your algorithm.

Question: When will TREC have a computational advertising contest?

Thursday, April 17, 2008

Text Summarization and Advertising

Recently read another CompAdvert paper (CIKM'07) from the Yahoo group, Just-in-Time Contextual Advertising. They describe a system where the current page is scraped on-line via a javascript tag, summarized and then that summary is passed to servers to match with Ad listings. Interesting points are:

5% of page text carefully chosen can yield 97+% of full-text advert matching relevance
Best parts of the document are URL, referring URL, title, Meta and headings.
Adding in a classification against a topical taxonomy adds to the accuracy of the ad matches.
They judged the ad matching relevance against human judgments of ad to page relevance.

I found these papers within the last few months as OthersOnline focused on behavioral based advertising. In many ways their finding are interesting, affirming and unsurprising. Interesting in that they are pushing the state of the art in advert matching, and affirming in that we @ OO are on the right track. Unsurprising in that using the document-fields about is the classic approach to indexing webpages and documents.
Of course internet search engines used this for years (it defines the SEO industry's eco-system), and the old/retired open source engine HtDig has had special treatment of those fields since the late 90s. The difference now is the direction, the documents are the "query" and the hits are the ads. Best part about the method is that it's cheap... javascript + the browser becomes your distributed spider and summarizer of the web.
I do love finding these papers.. we just don't have the time or resources to have a study like this and confirm the approach with a paid human factors study. Just go forward on gut educated feel day to day and the human measure is if we get clicks on the ads.
This approach is similar to the one we outlined and implemented before finding this paper. The difference is what we do with the resulting "query", using the signal to learn a predictive interest model of users.
Still no mention of any relative treatment of words within the same field... one would assume this would move the needle on relevance as well.
I still believe that this type of summarization approach can be used to make an implicit page tagger and social recommender like del.icio.us ... if you can filter the summary based upon some knowledge of the users real (as opposed to statistical) interests. Key route to auto-personalization of the web.

Friday, April 04, 2008

Itemset mining in data streams

We covered a nice paper by de Graaf et al of Leiden University in the AI colloquium. Clustering Co-occurrences of Maximal Frequent Patterns in Streams. It deals with the problem of finding frequent itemsets in a datastream. Ideally you want an incremental algorithm for this with an 'updatable model' rather than being forced to reprocess the entire data/transaction sequence when adding new data. The paper's approach has the extra benefit that the itemsets are clustered by similarity as well. I really enjoy using and learning about algorithms that have nice side-effects.

A rough overview of the algorithm is that it does three basic operations with incoming data. First it builds a support model of patterns encountered. It does this with a with a reinforcement and decay technique, reinforcing support for patterns encountered in the current transaction and decaying those that don't. Second it maintains a geometric organization of itemsets according to a distance metric in a boxed 2-D area. As new data is processed itemsets' coordinates in the (x,y) box move and shift around according to their similarity with other patterns. Third it performs a pattern merging/splitting mechanism to derive new patterns for the model to track. new patterns get a random (x,y) position.

At the termination of processing some amount of data, you are left with a list of itemsets and their support/frequency as well as a nice grouping by similarity.

One advantage of his presentation is that it is stripped of all excess complexity. They well note that it learns an approximation of what you would get from a full-data-scan of a traditional itemset miner. Fine with me.. I don't get hung up on exactness and have lots of faith that incremental model building works well in practice.

The minor flaw of the paper is that they fail to point out (or notice??) that what they have built is a hybrid of a Swarm Intelligence and a Self Organizing Map. The Swarm/Ant portion comes from the reinforcement & decay of the support model, and the SOM from the geometric clustering of the itemsets. On could duplicate this algorithm in spirit by implementing Ant System + SOM with the merging/splitting for new pattern production. By 'Ant System' here I refer to the spirit of an ant system where you use pheromone reinforcement and decay of a model, actual ants traversing a path in a graph are not necessary. The cells in the SOM would contain a sparse vector of itemsets and apply the standard rules for updating.

Yet, even as I see the connection, this is a pointy-headed comment. The paper is nice in that the algorithm is presented without flourish in a straight forward way... sometimes using the word 'hybrid' and casting your work that way is just a form of academic buzzword bingo.

I'll definitively look to implement something similar to this ASAP. I may skip the merging/splitting and use a standard itemset miner offline over the 'last X transactions' and form a itemset pattern dictionary. Only itemsets in the dictionary will be tracked and clustered with the data stream, and every so often run the offline algorithm to learn new patterns and support to merge into the dictionary.

Sunday, March 30, 2008

OpenOffice and LaTeX

I am stunned. I am working on the dissertation and need to make a LaTeX monograph out of some publications. The biggest pain is that I had written two of the papers in MS Word, and previous experience cut-and-pasting from Word to a text file was horrible. I also no longer have MS Word installed on my new machine (didn't come pre-installed from Dell), so I have been using OpenOffice. So how do I get from Word to LaTeX when I only have OO?

More as procrastination I did a search and found Writer2LaTeX. Some saint named Henrik Just created this to convert OO docs (either swx or odf format) to LaTeX. Looks like he has been working on it for a while with a set of faithful followers submitting feature requests. Between OO 2.4 being able to flawlessly open the MS Word docs and save to ODF plus Writer2LaTeX, I probably saved myself 2 days of irritating work.

It's not perfect as I need to convert some equations images to proper LaTeX, yet the mind-numbing job of cut and paste is rendered unnecessary.

Thanks Henrik and the rest of the OpenOffice team!

Postscript:
While not open source, TeXaide from Design Science is a very nice free LaTeX equation editor. Write your equation, the highlight and copy it... then it pastes magically as text markup into your LaTeX code. Now if only I could just drag and drop an image on it and have it convert the equation in the image.

Friday, March 28, 2008

Semantic Features for Contextual Advertising

Andrei Broder's group at Yahoo! Research has a focus on Computational Advertising. At SIGIR 2007 they released a paper on using Semantic Taxonomy to do contextual matching for advertising. This is a similar problem to the previous post about MS Research, deriving lists of keywords from a document to use as queries to an advertising system. Unlike the MS Research paper, Yahoo has built a large taxonomy of "commercial interest queries" with 6000 nodes and approx 100 items attached to each node.

The essential approach is to classify a document into the taxonomy as well as all of the ads and match ads to documents on the basis of topical distance. The distance score is combined with a more standard IR type approach forming a combined score. The top-k matching ads ordered by lowest distance are the ads displayed the page.

The TaxScore() function is fairly interesting, it attempts to generalize the given term within the taxonomy. It seems that this type of approach could work well with using WordNet's Hypernyms in a more regular IR/Search setting.

I have to read it again more carefully to see if I missed it, however I did not see anywhere in the formulas using any weighting of a keyword's bid value (or advert count). Maybe this was omitted for trade secrecy?? .. it seems obvious that it should be used to some degree to maximize $$ yield or eCPM of clicked ads. The idea is not to let it affect the matching of the ads to keywords, just the final rank order to some degree.

In my own experiments @ OO, using some proxy for bid value seems to increase eCPM. The biggest challenge is getting comprehensive data for your dictionary if you are not Google, Yahoo or MS.

Postscript:
I have it confirmed from two independent sources (current and ex Y!ers) that Yahoo is working in a new Content Match codebase as the old version didn't work. Hard to say what status Broder's above technique is in (production usage or internal testing).. or if it was part of the old system?

Scraping Documents for Advertising Keywords

Lately I've been working on extracting keywords from text that would be associated with good keyword advertising performance. This is fairly related to the 'text summarization' problem, yet that usually works towards a goal of readable summaries of documents. This is a simpler problem as I don't want to build readable summaries.

'Finding Advertising Keywords on Web Pages' from MS Research (Yih, Goodman, and Carvalho) was interesting reading. To boil it down to its essence, the authors used a collection of standard text indexing and NLP techniques and datasets to derive 'features' from the documents, then used a feature-selection method to decide what features were best in deciding good advertising keywords in a document. They judged the algorithms against a human generated set of advertising keywords associated with a group of web pages. Their 'annotators' read the documents then chose prominent words from the document to use as viable keyword advertising inputs.

Note that this is not an attempt to do topic classification, where you could produce a keyword describing a document that did not exist in the document.. for example labeling a news article about the Dallas Cowboys with 'sports' or 'event tickets' if those labels did not exist in the article.

Interestingly the algorithm learned that the most important features predicting a word's advertising viability was the query frequency in MSN Live Search (a dead obvious conclusion now supported by experiments), and the TF-IDF metric. Other features like capitalization, link text, phrase & sentence length and title/headings words were not as valuable alone.. yet (unsurprisingly) the best system used nearly all features. The shocker was that the part-of-speech information was best left unused.

I emailed the lead author and learned that the MS lawyers killed the idea of releasing the list of labeled URLs.

Post Script: The second author is Joshua Goodman, who had a hilarious exchange with some authors from La Sapienza University in Rome. They wrote a 2002 Physical Review Letters paper on using gzip for analyzing the similarity of human languages. Goodman responded with this critique, causing the original authors to respond with this response. Looks like there are other follow ups by third-parties. The mark of an effective paper is that it is talked about and remembered.

Frequent Pattern Mining

In the weekly RightNow AI Colloquium @ MSU CS , we read a paper by Jiawei Han et al. called
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach.
Basically the problem is this, given a sequence of transactions involving multiple items per transactions, what are the frequent itemsets? Itemsets are groups of m items that tend to be purchased together. The earlier SQL-based version of FP-Tree looks interesting as well.

FP-Tree out competes in time complexity the Apriori Association Rule Miner Algorithm (Java Code). Not sure how it compares with this raft of other algorithms available at the FIMI.

I'd love to use an algorithm like this to extract pairs and triples of keywords from documents in a clickstream.. basically looking for recurring patterns in browsing/hunting bevavior in document repositories.

I would say the biggest issue using either of the two above algorithms in a 'production' system is that they are not incremental. Ideally one could process a batch of transactions a record at a time and form a rolling frequent itemset collection. Itemsets would move in and out of the collection as they achieved or lost 'support', and the cost of adding another transaction is minimal... as in not having to rescan all previous transactions.

My initial idea how to do this would be to devise an incremental approximation addition to any association miner. At the end of a full-run of your miner, you would keep the final itemsets AND the itemsets that just missed the support threshold. The incremental algorithm would process up to a tolerance level of new transactions, say log(n) of the original transaction set size, and look to promote the 'just missed' itemsets if support arrives. Maybe some attempt could be made to remove itemsets if the additional transactions sank their support level to below the cut-line. After more than log(n) new transactions arrive, you can reprocess the entire set or trim off the first log(n) of the old transactions plus the new ones.

There are likely some speedups to be had in subsequent reprocessings. If from the previous full-run you found that a collection of itemsets made the cut, you could prune out those itemsets from the old transactions.. restarting the algorithm with the previous run's itemsets in the "found pool".

Of course with an algorithm like FP-Tree you must save the tree for the next run, and devise a tree rebalancing algorithm to make it incremental (relative frequencies of items change with new information). It gets messy quick.

Monday, March 24, 2008

Road Coloring Problem solved

I'm not sure why this hit my buttons, as it's not my area of expertise, yet I found it very interesting. Avraham Trahtman, 63 year old Russian mathematician who used to work as a plumber & maintenance guy in Israel, has solved (paper with solution) an interesting problem in graph theory. Unlike other famous problems with few applications, this appears to have some (network routing) and he's published an updated paper with a sub-quadratic algorithm.

I think it has been captured in news coverage more as a result of his age and story than importance of the problem.

Tuesday, February 12, 2008

Refocusing of Others Online to Behavioral Targeting and Computational Advertising

On the work/professional front Others Online is going into a bit of a transformation.

A rethinking exercise demonstrated that what we do best was connect people with content based upon implicit attention streams (this can be anything, click streams, blog posts, twitter, searches). That content was other people (social referrals), content (web pages & blogs) and ads.

Optimizing the process of selecting ads for individual people/groups and context rather than broad categories is what we'll focus on for a while, particularly me as concentrate on being the computational advertising guy at OO.

As for the social networking part of the software, we're refocusing on adding value to existing community and topical sites and other like groups that tend to connect people around a specific topic or geography.

Personally I think the refocusing on computational advertising and targeting is a great choice. When was the last time a web advertisement was really relevant to you? Do you really want to see noisy advertisements for mortgage refinancing and Viagra-like products? The key to all of this is to be socially responsible and give users complete control so that we don't fall into the Facebook-Beacon backlash.

Well, that and some clever algorithms to do the learning.

Dual Purpose trip to Germany

I was recently in Germany for two purposes. The first was for the 'Theory of Evolutionary Algorithms' conference at Dagstuhl Castle in Saarland Germany. This was a joy as I was able to focus nearly entirely on my PhD research and dissertation progress. At this point I'm in the home stretch.. I think I have two provable theorems sketched out and a set of new tractable items to explore to finish off the meat of the dissertation.

The second week in Germany was a beer tourist adventure through Bavaria and Bohemia. We hit Munich on day one and drank expensive tourist beer, then took a train to the Czech Republic to see Plzen (home of Pilsener beer) and Budweiss (Ceske Budejovice - original home of Budweiser beer). Plzen was very interesting and Budweiss was an adventure of train & bus travel. Two days later we took a train back into Germany to Bamberg. Bamberg is stunningly beautiful and is the home of a wide selection of locally crafted smoked lager beers. More on this trip in future posts.