Sunday, January 30, 2011

The Provenance of Data, Data Branding and "Big Data" Hype

The credibility of where data comes from in all these "big data" plays is absolutely crucial. Waving hands re "algorithms" won't cut it. @nealrichter Jan 27, 1010 Tweet

To expand on this tweet here's the argument: If one of your key products as a startup or business is to "crunch data" and derive or extract value from it then you should be concerned about data provenance. This is true whether you are crunching your own data or third-party data.

Some examples:
  • Web analytics - crunch web traffic and distill visitation and audience analytics reports for web site owners. Often they use these summaries to make decisions and sell their ad-space to advertisers.
  • Semantic Web APIs - crunch webpages, tweets etc and return topical and semantic annotations of the content
  • Comparison shopping - gather up product catalogs and pricing to aggregate for visitors
  • Web publishers - companies who run websites
  • Prediction services - companies that use data to predict something
In each of the above categories the provenance of the input data and brand of the output data is key. For each of the above one could name a company with either solid-gold data OR a powerful brand-name and good-enough data. Conversely we can find examples of companies with great tech but crappy data or a weak brand.

For web publishers, those that host user-generated content have poor provenance in general compared to news sites (for example). A notable exception is Wikipedia who has a pure "UGC" model but a solid community process and standards to improve provenance of their articles (those without references are targeted for improvement).

In comparison shopping Kayak.com has good data (directly from the airlines) and has built a good brand. The same is true of PriceGrabber and Nextag. TheFind.com on the other hand appears to have great data and tech, but no well known brand.

(I'm refraining from going into specific examples or opinions on big data companies to avoid poking friends in the eye.)

The issue of Provenance and Branding is especially important in sales situations where you are providing a tool (analytics) that helps your customer (a sales person) sell something to a third-party (their customer). If the input data you are using either has a demonstrable provenance or a good brand you'll have an easier time convincing people that the output of your product is worth having (and reselling).

The old saying for this in computer science is Garbage In, Garbage Out.

In "big data" world of startups that is blowing by Web 2.0 as the new hotness there is a startling lack of concern about data provenance. The essentially ethos is that if we (the Data Scientists) accumulate enough data and crunch it with magical algorithms then solid-gold data will come out... or at least that's what the hype machine says.

The lesson from the financial melt down is that magical algorithms making CDOs, CMOs and other derivatives should be viewed with a lens of mistrust. The GIGO principle was forgotten and no one even cared about the provenance (read credit quality) of the base financial instruments making up the derivatives. The credit rating agencies were just selling their brand and cared little about quality.

In my opinion, there is a clear parallel here to "big data". Trust must be part of the platform and not just tons of CPUs and disk-space. A Brand is a brittle object that is easily broken, so concentrate on quality.

Posted via email from nealrichter's posterous

1 comment:

Christopher Smith said...

I wonder how much was really wrong with the provenance of the data as opposed to the models themselves. I would argue part of the problem isn't so much the quality of the data as to how you build a model from it.

For example, I'd argue that UGC is actually loaded with data and in the right contexts is much richer than other data types, provided you look at it through the right lens. Sure, you shouldn't treat it like a credit report, and you need to make sure the incentives for creating content aren't completely out of whack, but the act of writing/creating/etc. is so much more telling than the comparatively passive activities such as found on most quality content sites. Even if everything you write online is a lie, the *way that you lie and the things you lie about* are quite telling. I've seen experimental results that back this. I'm sure it isn't hard to figure out a ton of things about me merely from analyzing this entry.

Part of the problem we have is that the techniques and models employed are all labeled as "proprietary" for obvious reasons, which makes it hard to evaluate the quality of them. As a consequence, all people *can* evaluate is branding and provenance.

The solution in my mind is to start independent evaluations of data quality. This doesn't have to be very hard, but no one has invested enough time in it to really make it a winner. The real test of big data is whether it gives you predictive information that you can later prove actually panned out or didn't. Even the greatest brand with the best provenance is nothing more than a sales pitch if it doesn't deliver on that benefit. Unfortunately, sales teams tend to shy away from even opening this Pandora's box because it exposes them to risk on something they typically don't control.