The credibility of where data comes from in all these "big data" plays is absolutely crucial. Waving hands re "algorithms" won't cut it. @nealrichter Jan 27, 1010 Tweet
To expand on this tweet here's the argument: If one of your key products as a startup or business is to "crunch data" and derive or extract value from it then you should be concerned about data provenance. This is true whether you are crunching your own data or third-party data.
Some examples:
- Web analytics - crunch web traffic and distill visitation and audience analytics reports for web site owners. Often they use these summaries to make decisions and sell their ad-space to advertisers.
- Semantic Web APIs - crunch webpages, tweets etc and return topical and semantic annotations of the content
- Comparison shopping - gather up product catalogs and pricing to aggregate for visitors
- Web publishers - companies who run websites
- Prediction services - companies that use data to predict something
In each of the above categories the provenance of the input data and brand of the output data is key. For each of the above one could name a company with either solid-gold data OR a powerful brand-name and good-enough data. Conversely we can find examples of companies with great tech but crappy data or a weak brand.
For web publishers, those that host user-generated content have poor provenance in general compared to news sites (for example). A notable exception is Wikipedia who has a pure "UGC" model but a solid community process and standards to improve provenance of their articles (those without references are targeted for improvement).
In comparison shopping Kayak.com has good data (directly from the airlines) and has built a good brand. The same is true of PriceGrabber and Nextag. TheFind.com on the other hand appears to have great data and tech, but no well known brand.
(I'm refraining from going into specific examples or opinions on big data companies to avoid poking friends in the eye.)
The issue of Provenance and Branding is especially important in sales situations where you are providing a tool (analytics) that helps your customer (a sales person) sell something to a third-party (their customer). If the input data you are using either has a demonstrable provenance or a good brand you'll have an easier time convincing people that the output of your product is worth having (and reselling).
The old saying for this in computer science is Garbage In, Garbage Out.
In "big data" world of startups that is blowing by Web 2.0 as the new hotness there is a startling lack of concern about data provenance. The essentially ethos is that if we (the Data Scientists) accumulate enough data and crunch it with magical algorithms then solid-gold data will come out... or at least that's what the hype machine says.
The lesson from the financial melt down is that magical algorithms making CDOs, CMOs and other derivatives should be viewed with a lens of mistrust. The GIGO principle was forgotten and no one even cared about the provenance (read credit quality) of the base financial instruments making up the derivatives. The credit rating agencies were just selling their brand and cared little about quality.
In my opinion, there is a clear parallel here to "big data". Trust must be part of the platform and not just tons of CPUs and disk-space. A Brand is a brittle object that is easily broken, so concentrate on quality.