Sunday, January 30, 2011

The Provenance of Data, Data Branding and "Big Data" Hype

The credibility of where data comes from in all these "big data" plays is absolutely crucial. Waving hands re "algorithms" won't cut it. @nealrichter Jan 27, 1010 Tweet

To expand on this tweet here's the argument: If one of your key products as a startup or business is to "crunch data" and derive or extract value from it then you should be concerned about data provenance. This is true whether you are crunching your own data or third-party data.

Some examples:
  • Web analytics - crunch web traffic and distill visitation and audience analytics reports for web site owners. Often they use these summaries to make decisions and sell their ad-space to advertisers.
  • Semantic Web APIs - crunch webpages, tweets etc and return topical and semantic annotations of the content
  • Comparison shopping - gather up product catalogs and pricing to aggregate for visitors
  • Web publishers - companies who run websites
  • Prediction services - companies that use data to predict something
In each of the above categories the provenance of the input data and brand of the output data is key. For each of the above one could name a company with either solid-gold data OR a powerful brand-name and good-enough data. Conversely we can find examples of companies with great tech but crappy data or a weak brand.

For web publishers, those that host user-generated content have poor provenance in general compared to news sites (for example). A notable exception is Wikipedia who has a pure "UGC" model but a solid community process and standards to improve provenance of their articles (those without references are targeted for improvement).

In comparison shopping has good data (directly from the airlines) and has built a good brand. The same is true of PriceGrabber and Nextag. on the other hand appears to have great data and tech, but no well known brand.

(I'm refraining from going into specific examples or opinions on big data companies to avoid poking friends in the eye.)

The issue of Provenance and Branding is especially important in sales situations where you are providing a tool (analytics) that helps your customer (a sales person) sell something to a third-party (their customer). If the input data you are using either has a demonstrable provenance or a good brand you'll have an easier time convincing people that the output of your product is worth having (and reselling).

The old saying for this in computer science is Garbage In, Garbage Out.

In "big data" world of startups that is blowing by Web 2.0 as the new hotness there is a startling lack of concern about data provenance. The essentially ethos is that if we (the Data Scientists) accumulate enough data and crunch it with magical algorithms then solid-gold data will come out... or at least that's what the hype machine says.

The lesson from the financial melt down is that magical algorithms making CDOs, CMOs and other derivatives should be viewed with a lens of mistrust. The GIGO principle was forgotten and no one even cared about the provenance (read credit quality) of the base financial instruments making up the derivatives. The credit rating agencies were just selling their brand and cared little about quality.

In my opinion, there is a clear parallel here to "big data". Trust must be part of the platform and not just tons of CPUs and disk-space. A Brand is a brittle object that is easily broken, so concentrate on quality.

Posted via email from nealrichter's posterous

Friday, January 14, 2011

Finance for Engineers

Last summer I took a great mini-course at MIT Sloan on Finance. It's essentially a breadth-first review of the MBA course complete with three case studies and a review of project evaluation methods via net present value analysis. Approximately 80% of the attendees were engineers/techies with 10+ years experience.. and maybe 25% w/ PhDs.

The first case study is Wilson Lumber from Harvard. The material is copyrighted, yet these links look like accurate distillations by business students.
The initial position is that Wilson Lumber growing small business with good suppliers and loyal customers. Volume and revenue are all up period over period. Question is should the bank increase is line of credit to fund the business. Once you break down the financial statements and model the business, the answer is No. Essentially Mr Wilson is over extended by many measures and is growing at the expense of his balance sheet, loaning him money will only make the problem bigger down the road. His basic options are to take in a partner as co-owner for cash, go broke or raise prices to lower volume and improve margins and slowly rebuild the balance sheet.

We then went through two NPV exercises. The first was a basic analysis of go/no-go on an engineering project with a bottom up analysis via putting all cost/benefit assumptions in a model and iterating though possibilities. The second was an analysis of a joint-venture between two biotech companies. Everything from external capital, deal structure to market penetration projections were worked in. Very informative and pretty interesting work for engineers to do once the terminology and methods were explained.

Professor Jenter shared two amusing anecdotes:
  • His MIT and Stanford MBA students often run off to found start-ups and forget the basic Wilson Lumber case. By the time they approach him for help it's too late and they are in Mr Wilson's position: shut-down, take in $$ and lots of equity dilution (and loss of control) or slow growth dramatically.
  • Also a quote along the lines of "Startups founded by MIT PhDs fail at a rate above far average".
This certainly hammered home the lesson that strategic planning for growth is very important, even for what look like non hyper-growth (software) companies. I'd recommend this course to any engineer wanting a quick structured intro to basic financial management.

Posted via email from nealrichter's posterous