Friday, August 28, 2009

The GPL and scripting languages

I was recently asked by an associate about how the GPL and LGPL impacts scripting languages. The short answer is that it's not clear.

My general take on the LGPL and scripting languages is that if one cut-and-pastes LGPL code A into code B, then code B falls under the LGPL. However if one script-includes code A into code B in the only real mechanism scripting languages allow then code B does not fall under the LGPL.

php: include 'Foo.php';
perl: require Foo::Bar;
python: import Foo
ruby: require "foo"
javascript: <script type="text/javascript" src="external.js"></script>


This differs from the application of the LGPL to compiled languages like C/C++ as the programmer decides how the machine code is combined: static versus dynamically linked. In Java if one has an LGPL jar file, the the code can be used in the standard way. If your code is combined with the contents of the LGPL jar file in a new jar, then your code falls into the LGPL.

What does the scripting language really do under the hood? It's different in each language. Some may physically include the file and parse-interpret it as one block of code, others may parse-interpret separately and resolve references between them much like a dynamic linker. Do the details matter? Unknown, yet one could reasonably assume the author of a piece of script code under the LGPL must have intended others to make script-include style uses OK. If in doubt check with the author and save that email.

What about the GPL? More clear.. no mixing of any kind can be done. One could argue that interpreted script is not linked in the way that compiled languages are... yet conservatively this is a "thin reed" to stand on.

What does this mean about the myriad of GPL javascript out there intermingling with non-GPL and proprietary javascript within browsers visiting script heavy websites everywhere? It's a dog's breakfast of legal issues.

The post Twenty questions about the GPL is pretty informative of the issues, worth a read... deeper than the above.

Posted via email from nealrichter's posterous

Thursday, August 20, 2009

What can we learn from the Google File System?

I found this via my twitter herd and I've been thinking about it for days. Kirk McKusick interviews Sean Quinlan about the Google File System. Fascinating stuff.

We're very fortunate to have storage scalability challenges ourselves at 'Undisclosed' (formerly OthersOnline). We're amassing mountains of modest chunks of information via a set of many many hundreds of millions of keys. We've evolved thusly:

  1. MySQL table with single key and many values, one row per value.
  2. MySQL key/value table with one value per key
  3. memcachedb - memached backed by Berkeley DB
  4. Nginx + Webdav system
  5. Our own Sam Tingleff's valkyrie - Consistent Hashing + Tokyo Tyrant
#5 is the best performing, yet we still aren't going to escape the unpredictable I/O performance of EC2 disks. To my understanding this serves a role much like the chunk servers of GFS. We need low latency read access to storage on one side, and high throughput write access on the other side.

Combining the insights with the above interview and the excellent Pathologies of Big Data, I'm left with the impression that one must absolutely limit the number of disk seeks and do some 'perfect' number of I/O ops against of X MB chunks.

Random questions:
What is X for some random hardware in the cloud? How do we find it? What if X changes as the disk gets full? What kind of mapping function do we send the application key into to return a storage key to get the best amount of sequential disk access on write and best memory caching of chunks on read? What about fragmentation?

It does seem as though the newish adage is very true.. web 2.0 is mostly about exposing old-school unix commands and system calls over HTTP. I keep thinking this must have been solved a dozen times before. cycbuf feature of usenet servers?

Posted via email from nealrichter's posterous

Sunday, August 02, 2009

Response to Dr Lance Fortnow's CACM opinion

Dr. Lance Fortnow published a strong argument against the current 'strong conference' CS system in the August issue of CACM. His essential desire is that CS move to a system similar to the hard sciences and engineering. Conferences should accept any paper of reasonable quality, not publish proceedings and hold Journals as the sole vehicle for publications.

My Questions:

  1. While CS should have a stronger Journal system, why should that come at the expense of quality conferences?


  2. The reputation of CS conferences as a method of publication means that it is both acceptable to publish and cite papers from these confs. The result of this is that one can read a conf paper, compose a citing follow-up and publish it within 1 year. This fluidity of ideas without the sometimes 18 month to 2 year wait on Journal publication is a great advantage!

    Yes other disciplines publish pre-prints to arXiv.org.. is this really a solution when a paper has been rejected yet avail on arXiv.org for 18 months?

  3. Is this problem really that acute in all communities? Can't it be solved at the community level?


  4. Certainly in AI, Machine Learning, Data Mining and Evolutionary Computation I perceive that the Journals are held in high regard (all papers are great) and that the conference proceedings can be a mixed bag with strong and weak papers. I am getting strong pressure from my advisors and peers to consider Journal versions of some of my conf papers. EC specifically has a non-proceedings conference to meet and discuss less settled results.

    If Dr. Fortnow feels that his area of theoretical computer science is too fragmented, then a solution would be to found more Journals and push the best conference papers into those Journals more heavily. Perhaps that would mean that the conference system of that area would shrink in response as the Journals established themselves.

  5. Why would industrial researchers and scientists participate in publications with such long time cycles?


  6. Again the fluidity is an advantage. Were CS suddenly to switch to a soft conference system (low benefit to participation other than networking) I fear that participation of industry in publication venues would suffer. The time scales of Journals in general mean that at publication time the results were done an eternity ago in industry terms. Convincing your supervisor that participation and publication in a conference is a far easier sell than an extended Journal submission effort.

    One also wonders if the Journal system is not implicitly biased towards academic communities where participants are chasing tenure. This 'ranking of people' referenced by Dr. Fortnow is very much for academic institutions and not of much value to industry IMHO.

  7. Why do we want ANY such top-down forcing of CS organization?


  8. The culture of CS is much more aligned with self-organization and communities forming out of a bird-of-a-feather effect. This also aligns with the changing face of corporate cultures and culture in general. Such a top-down driven reorg would likely both fail and break the inherent fluidity of ideas and results in CS.

    I also have an unfounded suspicion that such a top-down forced re-org would result in a clustering of power and influence towards traditional centers of power in academia. If one picks up conference proceedings in my favorite CS areas and does a frequency count of the author's institutions the distribution is very much long tail. The 'elite' universities are not dominating the results meaning that the 'in crowd' effect is much lesser in CS.

    Feudal systems are dying fast for a reason.

  9. Obviously the current system is serving a need, doesn't that speak for itself?


  10. If CS researchers and scientists continue to attend, publish at and found conferences is this not evidence that it is serving a real need?



While Dr. Fortnow is correct on his points w.r.t. the problems faced by conference 'PC' committees... the correct response is best done within that community. Found some Journals and compete with the conferences for publications and reputation. I don't accept that a strong Journal system can only be created by first wiping away a very fluid and successful conference system.

My personal solution to strengthening the Journal system? I'll set a goal of submitting a Journal publication or two in the next year. I am completely remiss in not yet submitting anything to a Journal.

As a counter point to the above arguments.. if I could have a wish it would be a Journal system with fast review times, immediate publication of accepted papers and that markets itself by cherry picking great papers from conferences and encouraging those authors to submit to the Journal. Something like the Journal of Machine Learning Research. The publishing of referee reviews also sounds interesting.