<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-17598372</id><updated>2012-01-24T00:49:44.507-08:00</updated><category term='czech'/><category term='journals'/><category term='solr'/><category term='beer'/><category term='bigdata'/><category term='htdig'/><category term='ai history'/><category term='resolutions'/><category term='data mining'/><category term='search engines'/><category term='news'/><category term='ant system'/><category term='influence metrics'/><category term='dogma'/><category term='scripting languages'/><category term='recommender systems'/><category term='meaning'/><category term='snakeoil'/><category term='analytics'/><category term='refocusing'/><category term='open source'/><category term='people search'/><category term='mapreduce'/><category term='cs'/><category term='hadoop'/><category term='classification'/><category term='itemset mining'/><category term='devices'/><category term='computational advertising'/><category term='collective search'/><category term='tokyo tyrant'/><category term='keyword'/><category term='AI'/><category term='attention data'/><category term='consulting'/><category term='software engineering'/><category term='natural language processing'/><category term='mahalo'/><category term='som'/><category term='germany'/><category term='hype'/><category term='startups'/><category term='apache'/><category term='facebook'/><category term='implicit web'/><category term='enterprise search'/><category term='learning from past mistakes'/><category term='interns'/><category term='scalability'/><category term='personal'/><category term='mysql'/><category term='adword'/><category term='reranking'/><category term='nomads'/><category term='law'/><category term='vacation'/><category term='bozeman'/><category term='chacha'/><category term='semantic web'/><category term='culture'/><category term='social search'/><category term='human relevance'/><category term='streams'/><category term='learning structure'/><category term='key-value stores'/><category term='sentiment analysis'/><category term='learning to rank'/><category term='human metrics'/><category term='blog'/><category term='query understanding'/><category term='personalized search'/><category term='databases'/><category term='scoble'/><category term='montana'/><category term='introductions'/><category term='copyright'/><category term='patent'/><category term='memoriam'/><category term='monopoly'/><category term='food'/><category term='text summarization'/><category term='twitter'/><category term='behavior'/><category term='intellectual property'/><category term='seattle'/><category term='rightnow'/><category term='old ideas'/><category term='evolutionary algorithms'/><category term='machine learning'/><category term='othersonline'/><category term='community sites'/><category term='highly available'/><category term='conferences'/><category term='google'/><title type='text'>aicoder</title><subtitle type='html'>Musings about artificial intelligence, search engines, machine learning, computational advertising, intellectual property law, social media &amp;amp; widgets, and good beer.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>86</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-17598372.post-5284234227984640920</id><published>2012-01-07T09:13:00.001-08:00</published><updated>2012-01-07T09:28:36.040-08:00</updated><title type='text'>Software pattern for proportional control of QPS in a webservice</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;h3 style="padding-left: 0px; padding-right: 0px; margin-right: 0px; padding-top: 0px; text-align: left; font-size: 15pt; margin-left: 0px; margin-bottom: 4px; font-family: Helvetica,Arial,sans-serif; margin-top: 28px; padding-bottom: 0px;"&gt; Problem Statement&lt;/h3&gt;&lt;p style="padding-right: 0px; padding-left: 0px; padding-top: 0px; text-align: left; margin-bottom: 10px; padding-bottom: 0px; line-height: 17px; margin-right: 0px; font-size: 13px; margin-left: 0px; font-family: Helvetica,Arial,sans-serif; margin-top: 10px;"&gt; Imagine you are writing a webservice that must call a back-end service, such as a data store.  Let's assume (with out loss of generality) that the data store (and your hardware supporting it) has some limit in QPS that it can handle.  We'd like the client system (your web service) to impose a limit on the QPS to the back-end service.  Also assume that this is a distributed webservice, lots of worker threads on lots of different machines.&lt;/p&gt; &lt;h3 style="padding-left: 0px; padding-right: 0px; margin-right: 0px; padding-top: 0px; text-align: left; font-size: 15pt; margin-left: 0px; margin-bottom: 4px; font-family: Helvetica,Arial,sans-serif; margin-top: 28px; padding-bottom: 0px;"&gt; &lt;a name="133c4f9fd80a0777_DesignPatternforaQPSController-Requirements"&gt;&lt;/a&gt;Requirements&lt;/h3&gt;&lt;ol style="line-height: 17px; text-align: left; font-size: 13px; font-family: Helvetica,Arial,sans-serif;"&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;Given a goal in QPS manage the maximum outgoing requests per second to that goal.&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;&lt;i&gt;Be fast.&lt;/i&gt; Maintain a fast controller settling time when the goal or queries change.&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;&lt;i&gt;Be adaptive.&lt;/i&gt;  Respond to swings of incomming requests that need to be queried against the service.&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;&lt;i&gt;Be distributed.&lt;/i&gt; Locally active against global numbers without knowing the number of workers.&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;&lt;i&gt;Be robust.&lt;/i&gt;  Handle additions and subtractions of worker/clients to the system without coordination. Minimize overshoot.&lt;/li&gt; &lt;/ol&gt;&lt;p&gt;&lt;/p&gt;&lt;div style="text-align: left;"&gt;&lt;img src="https://wiki.rubiconproject.com/download/attachments/9654871/feedback_controller.jpg?version=1&amp;amp;modificationDate=1321489117000" border="1" style="font-family: Helvetica, Arial, sans-serif; font-size: 13px; line-height: 17px; background-color: rgb(255, 255, 255);" width="500" /&gt;&lt;/div&gt; &lt;div style="text-align: left;"&gt; &lt;p style="padding-left: 0px; padding-right: 0px; padding-top: 0px; margin-right: 0px; margin-left: 0px; margin-bottom: 10px; margin-top: 10px; padding-bottom: 0px;"&gt;&lt;/p&gt;&lt;h3 style="font-family: Helvetica,Arial,sans-serif; line-height: normal; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-size: 15pt; margin-top: 28px; margin-right: 0px; margin-bottom: 4px; margin-left: 0px;"&gt; Design&lt;/h3&gt;&lt;div&gt;&lt;span style="line-height: 17px;"&gt;Assume that querying this backend service, while valuable and mostly needed, is optional under duress.  It's far more important for your front-end service to be responsive and return some error or 'No Content' than hang on a busted back-end.   As result we'll use a sampling rate 'r' to denote the % of time that the web service should query the back-end.  Under normal conditions this rate is 1 (100%).  Also assume that the goal in QPS to the back-end is set in some configuration area in your system.  Under duress the rate r will be adaptively tuned to obey the QPS goal.  Also assume you have smartly implemented some monitoring system like Ganglia/Nagios/Cacti and are emitting events to it when you call the back-end service.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;p style="padding-left: 0px; padding-right: 0px; padding-top: 0px; margin-right: 0px; line-height: 17px; font-size: 13px; margin-left: 0px; margin-bottom: 10px; font-family: Helvetica,Arial,sans-serif; margin-top: 10px; padding-bottom: 0px;"&gt; &lt;b&gt;Inputs&lt;/b&gt;&lt;/p&gt;&lt;ul style="line-height: 17px; font-size: 13px; font-family: Helvetica,Arial,sans-serif;"&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; G = Goal QPS&lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;M = Current measured QPS (from Ganglia).&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;r = Current sampling rate [0,1]&lt;/li&gt; &lt;/ul&gt;&lt;p style="padding-left: 0px; padding-right: 0px; padding-top: 0px; margin-right: 0px; line-height: 17px; font-size: 13px; margin-left: 0px; margin-bottom: 10px; font-family: Helvetica,Arial,sans-serif; margin-top: 10px; padding-bottom: 0px;"&gt; &lt;b&gt;Outputs&lt;/b&gt;&lt;/p&gt;&lt;ul style="line-height: 17px; font-size: 13px; font-family: Helvetica,Arial,sans-serif;"&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; r_new = A sampling rate [0,1]&lt;/li&gt;&lt;/ul&gt;&lt;p style="padding-left: 0px; padding-right: 0px; padding-top: 0px; margin-right: 0px; line-height: 17px; font-size: 13px; margin-left: 0px; margin-bottom: 10px; font-family: Helvetica,Arial,sans-serif; margin-top: 10px; padding-bottom: 0px;"&gt; &lt;b&gt;Adaptive Mechanism&lt;/b&gt;&lt;/p&gt;&lt;ul style="line-height: 17px; font-size: 13px; font-family: Helvetica,Arial,sans-serif;"&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; r_new = r * (G/M)&lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; r_new = MAX(0,MIN(1,r_new))    //clamp r_new between [0,1]&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;p style="padding-left: 0px; padding-right: 0px; padding-top: 0px; margin-right: 0px; line-height: 17px; font-size: 13px; margin-left: 0px; margin-bottom: 10px; font-family: Helvetica,Arial,sans-serif; margin-top: 10px; padding-bottom: 0px;"&gt; &lt;b&gt;Benefits&lt;/b&gt;&lt;/p&gt;&lt;ul style="line-height: 17px; font-size: 13px; font-family: Helvetica,Arial,sans-serif;"&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; Needs only global G and M as inputs.&lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; No-coordination needed between workers/servers other than the globally observed M.&lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; Adaptively moves the per-worker sampling rate independent of all other worker's rates.  &lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; Workers can have different incoming QPS rates from a load balancer, the controller will adapt.&lt;/li&gt;&lt;/ul&gt;&lt;p style="padding-left: 0px; padding-right: 0px; padding-top: 0px; margin-right: 0px; line-height: 17px; font-size: 13px; margin-left: 0px; margin-bottom: 10px; font-family: Helvetica,Arial,sans-serif; margin-top: 10px; padding-bottom: 0px;"&gt; &lt;b&gt;Failure Modes&lt;/b&gt;&lt;/p&gt;&lt;ul style="line-height: 17px; font-size: 13px; font-family: Helvetica,Arial,sans-serif;"&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; If the sensor for M fails to be updated then the controller is blind&lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; If the Goal is set or re-set to zero, then the controller will stop traffic&lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; &lt;em&gt;Both of these can be addressed easily.&lt;/em&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;b style="font-family: Helvetica,Arial,sans-serif; font-size: 13px; line-height: 17px; background-color: rgb(255,255,255);"&gt;Desired Response&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul style="font-size: 13px; line-height: 17px; font-family: Helvetica,Arial,sans-serif; background-color: rgb(255,255,255);"&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;The overshoot/undershoot is called 'ringing'&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;The time to approach the goal is called the 'settling time'&lt;/li&gt; &lt;/ul&gt;&lt;p style="font-size: 13px; line-height: 17px; margin-top: 10px; margin-right: 0px; margin-bottom: 10px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: Helvetica,Arial,sans-serif; background-color: rgb(255,255,255);"&gt; &lt;span class="image-wrap"&gt;&lt;img src="https://wiki.rubiconproject.com/download/attachments/9654871/Adaptive_Response.png?version=1&amp;amp;modificationDate=1321548383000" border="1" /&gt;&lt;/span&gt;&lt;/p&gt;&lt;h3 style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-size: 15pt; margin-top: 28px; margin-right: 0px; margin-bottom: 4px; margin-left: 0px; font-family: Helvetica,Arial,sans-serif; background-color: rgb(255,255,255);"&gt; &lt;a name="DesignPatternforaQPSController-ImplementationNotes"&gt;&lt;/a&gt;Implementation Notes&lt;/h3&gt;&lt;ul style="background-color: rgb(255, 255, 255); "&gt;&lt;li style="font-family: Helvetica, Arial, sans-serif; font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "&gt; Emit a ganglia/graphite/XXX counter for both requests sent and skipped&lt;/li&gt;&lt;li style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="line-height: 13pt;"&gt;Use a smoothed average of the measured QPS to &lt;/span&gt;&lt;span class="Apple-style-span" style="line-height: 17px;"&gt;smooth out&lt;/span&gt;&lt;span class="Apple-style-span" style="line-height: 13pt;"&gt; controller jitter.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li style="font-family: Helvetica, Arial, sans-serif; font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "&gt;(Optional) Each worker should do a bit of randomization of when it performs its sampling-rate-update to smooth out any startup/restart ringing.&lt;/li&gt;&lt;/ul&gt;&lt;h3 style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-size: 15pt; margin-top: 28px; margin-right: 0px; margin-bottom: 4px; margin-left: 0px; font-family: Helvetica,Arial,sans-serif; background-color: rgb(255,255,255);"&gt; &lt;a name="DesignPatternforaQPSController-Invertability"&gt;&lt;/a&gt;Invertability&lt;/h3&gt;&lt;p style="font-size: 13px; line-height: 17px; margin-top: 10px; margin-right: 0px; margin-bottom: 10px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; font-family: Helvetica,Arial,sans-serif; background-color: rgb(255,255,255);"&gt; This design could be inverted for a server implementation.  If your webservice has a set of APIs that are heavy to execute, then this controller could be used to control the incomming QPS that are delegated to the heavy work.&lt;/p&gt; &lt;ol style="font-size: 13px; line-height: 17px; font-family: Helvetica,Arial,sans-serif; background-color: rgb(255,255,255);"&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt; Receive request&lt;/li&gt;&lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;Submit to sampling rate&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;If Yes then delegate the request to the executor&lt;/li&gt; &lt;li style="font-size: 10pt; line-height: 13pt; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;If No then respond to the client with 'HTTP 204 No Content' or equivalent empty reponse.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;/div&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5284234227984640920?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5284234227984640920/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5284234227984640920' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5284234227984640920'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5284234227984640920'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2012/01/software-pattern-for-proportional.html' title='Software pattern for proportional control of QPS in a webservice'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-7606394556795092235</id><published>2011-12-27T09:10:00.001-08:00</published><updated>2011-12-27T09:10:44.040-08:00</updated><title type='text'>Standford's Introduction to Computational Advertising course</title><content type='html'>&lt;div class='posterous_autopost'&gt;&lt;div&gt;&lt;div&gt;&lt;b style="font-size: 13px; font-family: verdana,arial,helvetica,sans-serif; background-color: rgb(255,255,255);"&gt;&lt;a href="http://research.yahoo.com/Andrei_Broder"&gt;Andrei Broder&lt;/a&gt; and &lt;/b&gt;&lt;a href="http://research.yahoo.com/Vanja_Josifovski" style="font-size: 13px; font-family: verdana,arial,helvetica,sans-serif; background-color: rgb(255,255,255);"&gt;&lt;b&gt;Vanja Josifovski&lt;/b&gt;&lt;/a&gt;&lt;span style="font-size: 13px; font-family: verdana,arial,helvetica,sans-serif; background-color: rgb(255,255,255);"&gt;, of Yahoo! Research, have wrapped up their Stanford course again this fall.  As always the lecture slides are a great intro to the area.&lt;/span&gt;&lt;/div&gt; &lt;/div&gt;&lt;p /&gt;&lt;div&gt;&lt;a href="http://www.stanford.edu/class/msande239/"&gt;MS&amp;amp;E 239: Introduction to Computational Advertising&lt;/a&gt;&lt;br /&gt; &lt;div&gt;&lt;p /&gt;&lt;div&gt;&lt;div&gt;&lt;span style="font-size: 14px;"&gt;Computational advertising is an emerging new scientific sub-discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central problem of computational advertising is to find the &amp;quot;best match&amp;quot; between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine (&amp;quot;sponsored search&amp;quot;), a user reading a web page (&amp;quot;content match&amp;quot; and &amp;quot;display ads&amp;quot;), a user watching a movie on a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. Thus, depending on the definition of &amp;quot;best match&amp;quot; this problem leads to a variety of massive optimization and search problems, with complicated constraints, and challenging data representation and access problems. The solution to these problems provides the scientific and technical foundations for the $20 billion online advertising industry.&lt;/span&gt;&lt;/div&gt; &lt;p /&gt;&lt;div&gt;&lt;span style="font-size: 14px;"&gt;This course aims to provide a good introduction to the main algorithmic issues and solutions in computational advertising, as currently applied to building platforms for various online advertising formats. At the same time we intend to briefly survey the economics and marketplace aspects of the industry, as well as some of the research frontiers. The intended audience are students interested in the practical and theoretical aspects of web advertising.&lt;/span&gt;&lt;/div&gt; &lt;/div&gt;&lt;/div&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/standfords-introduction-to-computational-adve"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-7606394556795092235?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/7606394556795092235/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=7606394556795092235' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7606394556795092235'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7606394556795092235'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/12/standford-introduction-to-computational.html' title='Standford&amp;#39;s Introduction to Computational Advertising course'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3091712282576090854</id><published>2011-12-27T09:04:00.001-08:00</published><updated>2011-12-27T09:04:36.142-08:00</updated><title type='text'>Summer Program on Computational Advertising</title><content type='html'>&lt;div class='posterous_autopost'&gt;&lt;div&gt;&lt;a href="http://research.yahoo.com/Deepak_K_Agarwal"&gt;Deepak Agarwal of Yahoo Research&lt;/a&gt; is organizing a summer program on Computational Advertising.  It appears to be geared for grad students.&lt;/div&gt;&lt;p /&gt;&lt;a href="http://www.samsi.info/programs/summer-program-august-6-17-2012-computational-advertising"&gt;http://www.samsi.info/programs/summer-program-august-6-17-2012-computational-advertising&lt;/a&gt;&lt;br /&gt; &lt;p /&gt;&lt;div&gt;&lt;span style="color: rgb(51,51,51); font-family: Arial,sans-serif; font-size: 13px; line-height: 18px; background-color: rgb(255,255,255);"&gt;This two-week program will run from August 6 to August 17, 2012. The first week will be at the &lt;/span&gt;&lt;a href="http://www.radisson.com/samsi" target="_blank" style="color: rgb(0,51,102); font-family: Arial,sans-serif; font-size: 13px; line-height: 18px; background-color: rgb(255,255,255);"&gt;Radisson RTP&lt;/a&gt;&lt;span style="color: rgb(51,51,51); font-family: Arial,sans-serif; font-size: 13px; line-height: 18px; background-color: rgb(255,255,255);"&gt; in the Research Triangle Park, NC. The location is in close proximity to SAMSI. The first three days will be spent on technical presentations by leading researchers and industry experts, to bring everyone up to speed on the currently used methodology. On the fourth day, the participants will self-organize into working groups, each of which will address one of the key problem areas (it is permitted that people join more than one group, and the organizers will try to arrange the working group schedules to faciliate that).  The second week will be spent at SAMSI headquarters in Research Triangle Park.&lt;/span&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/summer-program-on-computational-advertising"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3091712282576090854?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3091712282576090854/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3091712282576090854' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3091712282576090854'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3091712282576090854'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/12/summer-program-on-computational.html' title='Summer Program on Computational Advertising'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1614580115658149392</id><published>2011-11-15T08:24:00.001-08:00</published><updated>2011-11-15T08:35:18.033-08:00</updated><title type='text'>What Software Engineers should know about Control Theory</title><content type='html'>&lt;div class="posterous_autopost"&gt;Over the years I've noticed an interesting lack of specific domain knowledge among CS and software people.  Other than the few co-workers that majored in Electrical Engineering, almost no one has heard of the field of 'Control Theory'.&lt;p&gt;&lt;/p&gt;&lt;div&gt;From &lt;a href="http://en.wikipedia.org/wiki/Control_theory" target="_blank"&gt;Wikipedia&lt;/a&gt;&lt;/div&gt;&lt;blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"&gt; &lt;span style="font-family: sans-serif; font-size: 13px; line-height: 19px;"&gt;&lt;b&gt;Control theory&lt;/b&gt; is an interdisciplinary branch of &lt;a href="http://wiki/Engineering" title="Engineering" target="_blank" style="text-decoration: none; color: rgb(6, 69, 173); background-color: initial;"&gt;engineering&lt;/a&gt; and &lt;a href="http://wiki/Mathematics" title="Mathematics" target="_blank" style="text-decoration: none; color: rgb(6, 69, 173); background-color: initial;"&gt;mathematics&lt;/a&gt;, that deals with the behavior of &lt;a href="http://wiki/Dynamical_system" title="Dynamical system" target="_blank" style="text-decoration: none; color: rgb(11, 0, 128); background-color: initial;"&gt;dynamical systems&lt;/a&gt;. The desired output of a system is called the &lt;i&gt;reference&lt;/i&gt;. When one or more output variables of a system need to follow a certain reference over time, a &lt;a href="http://wiki/Controller_(control_theory)" title="Controller (control theory)" target="_blank" style="text-decoration: none; color: rgb(6, 69, 173); background-color: initial;"&gt;controller&lt;/a&gt; manipulates the inputs to a system to obtain the desired effect on the output of the system.&lt;/span&gt;&lt;/blockquote&gt; &lt;div&gt; &lt;/div&gt;&lt;div&gt; &lt;div style="font-family: sans-serif; font-size: 12px; background-color: rgb(249, 249, 249); line-height: 19px;"&gt;&lt;div style="border-top-style: none; border-right-style: none; border-bottom-style: none; border-left-style: none; border-color: initial; text-align: left; line-height: 1.4em; padding-top: 3px !important; padding-right: 3px !important; padding-bottom: 3px !important; padding-left: 3px !important; font-size: 11px;"&gt; &lt;div style="float: right; border-top-style: none !important; border-right-style: none !important; border-bottom-style: none !important; border-left-style: none !important; border-color: initial !important; background-color: initial !important;"&gt; &lt;a href="http://wiki/File:Feedback_loop_with_descriptions.svg" title="Enlarge" target="_blank" style="text-decoration: none; color: rgb(6, 69, 173); background-color: initial !important; display: block; border-top-style: none !important; border-right-style: none !important; border-bottom-style: none !important; border-left-style: none !important; border-color: initial !important;"&gt;&lt;img src="http://bits.wikimedia.org/skins-1.5/common/images/magnify-clip.png" height="11" alt="" style="border-top-style: none !important; border-right-style: none !important; border-bottom-style: none !important; border-left-style: none !important; border-color: initial; vertical-align: middle; display: block; border-color: initial !important; background-color: initial !important;" width="15" /&gt;&lt;/a&gt;&lt;/div&gt; &lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Let's imagine that you write internet web services for a living.  Some Rest or SOAP APIs that take an input and give an output.&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Your boss walks up to you one day and that asks for a system that does the following:&lt;/div&gt; &lt;div&gt;&lt;ul&gt;&lt;li&gt;Create a webservice that calls another (or three) for data/inputs, then does X with them.&lt;/li&gt;&lt;li&gt;Meters the usage of the other web services.&lt;/li&gt;&lt;li&gt;Your webservice must respond within Y milliseconds with good output or a NULL output.&lt;/li&gt;&lt;li&gt;Support high concurrency, ie not use too many servers.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;The problem is that these other third-party webservices are not your own.  What is their response time?  Will they give bad data?  How should your webservice react to failures of the others?&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;Does this sound familiar?  It should to many.  This is the replicated-and-shared-connector problem (MySQL, memcached), the partitioned-services problem (federated search, and large scale search engines) and the API-as-a-service problem (&lt;a href="http://www.mashery.com/" target="_blank"&gt;Mashery&lt;/a&gt;, etc).&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;There are two basic types of controls relevant here:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Open Loop, Feed-forward: Requires good model of system inputs and response of the system.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;div class="p_embed p_image_embed"&gt; &lt;img alt="Feedforward" height="77" src="http://getfile3.posterous.com/getfile/files.posterous.com/nealrichter/3Itl7WJfodA5QyGwJuvKCw2nMkKewpvpFxsa8CvfNjXvxVYrfgDEZ0d0cQHE/feedforward.png" width="450" /&gt; &lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;Closed Loop, Feed-back&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;a href="http://wiki/File:Feedback_loop_with_descriptions.svg" target="_blank" style="text-decoration: underline; color: rgb(6, 69, 173); background-color: initial;"&gt;&lt;img src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/24/Feedback_loop_with_descriptions.svg/400px-Feedback_loop_with_descriptions.svg.png" height="101" alt="" style="border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-color: initial; vertical-align: middle; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: rgb(204, 204, 204); border-right-color: rgb(204, 204, 204); border-bottom-color: rgb(204, 204, 204); border-left-color: rgb(204, 204, 204); background-color: rgb(255, 255, 255);" width="400" /&gt;&lt;/a&gt;&lt;/div&gt; &lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Types of adaptive control are as follows:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Linear Feedback&lt;/li&gt;&lt;li&gt;Stability Analysis&lt;/li&gt;&lt;li&gt;Frequency response&lt;/li&gt;&lt;li&gt;response time&lt;/li&gt; &lt;/ul&gt;&lt;div&gt;Adaptive Schemes&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Gain Scheduling&lt;/li&gt;&lt;li&gt;Model Reference Adaptive Systems&lt;/li&gt;&lt;li&gt;Self-tuning regulators&lt;/li&gt;&lt;li&gt;Dual Control&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;Here's one &lt;a href="http://www.igi.tugraz.at/helmut/Presentations/Hauser_2004_AdaptiveControl.pdf"&gt;survey deck from a lecture&lt;/a&gt;. Unfortunately for software engineers, most of the presentations of the above are in linear system form rather than an algorithmic form.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;Dr. Joe L Hellerstein of Google and co-workers taught a course at U of Washington in 2008 that was more software focused.  He's also written a textbook on it and a few papers.&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://research.microsoft.com/en-us/um/people/liuj/cse590k2008winter/" target="_blank"&gt;http://research.microsoft.com/en-us/um/people/liuj/cse590k2008winter/&lt;/a&gt;&lt;/li&gt; &lt;li&gt;Joseph L Hellerstein et al "Feedback control of computing systems" 2004 Wiley   &lt;a href="http://books.google.com/books?id=J9aMWtoELWYC&amp;amp;printsec=frontcover&amp;amp;dq=Feedback-Control-Computing-Systems-Hellerstein&amp;amp;source=bl&amp;amp;ots=JwL91gojs8&amp;amp;sig=Mq3e1BFwVP0mDmfw5n1om7KuUnU&amp;amp;hl=en&amp;amp;ei=j_crTc7VJ4KosQPJg4GGBg&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=2&amp;amp;ved=0CB4Q6AEwAQ#v=onepage&amp;amp;q&amp;amp;f=false" target="_blank"&gt;Google Books&lt;/a&gt;    &lt;a href="http://www.amazon.com/Feedback-Control-Computing-Systems-Hellerstein/dp/047126637X" target="_blank"&gt;Amazon&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://www.research.ibm.com/PM/RC23159.pdf" target="_blank"&gt;Hellerstein 2003 IBM Tech Report "Challenges in Control Engineering of Computing Systems"&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5374029" target="_blank"&gt;Hellerstein et al  "Research challenges in control engineering of computing systems" Volume: 6 Issue: 4, 2010 IEEE Trans on Network and Service Management&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;&lt;div&gt;The course page has a collection of great links to applications papers on controllers for software systems.&lt;/div&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;I'd like to see a 'software patterns' set created for easier use by software engineers.  I'll attempt to present a couple common forms as patterns in a future blog post.&lt;/div&gt; &lt;p style="font-size: 10px;"&gt;&lt;br /&gt;&lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1614580115658149392?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1614580115658149392/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1614580115658149392' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1614580115658149392'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1614580115658149392'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/11/over-years-i-noticed-interesting-lack.html' title='What Software Engineers should know about Control Theory'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-539924906209132847</id><published>2011-11-10T10:23:00.000-08:00</published><updated>2011-11-11T10:54:41.353-08:00</updated><title type='text'>Open RTB panel - IAB Ad Ops Summit 2011</title><content type='html'>Monday November 7th I was on an &lt;a href="http://www.iab.net/events_training/2011/adops/overview"&gt;IAB Ops&lt;/a&gt; panel on OpenRTB.&lt;br /&gt;&lt;br /&gt;&lt;iframe width="420" height="315" src="http://www.youtube.com/embed/K02hD023g-Y" frameborder="0" allowfullscreen=""&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;The clip shows an exchange after Steve from the IAB asked a question about how webpage inventory is described in RTB.  I described an example of differentiating a simple commodity, barley.&lt;br /&gt;&lt;br /&gt;Two of the major uses of barley in the US are animal feed and malting for making beer.  Malting barley has specific requirements in terms of moisture content, protein percentage and other factors.  Farmers don't always know what quality their crop will finish at.  They count on having two general markets, if the tested quality meets malting standards then the premium over feed prices can be healthy.  A 2011 report noted that malting barley provided a 70% premium over feedstock barley. Growing specific varieties and/or using organic farming methods can provide additional premiums over generic feed barley. The curious can follow the links below.&lt;br /&gt;&lt;ul&gt;&lt;li&gt; &lt;a href="http://msuextension.org/publications/AgandNaturalResources/EB0186.pdf"&gt;Montana Barley Production Guide&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; &lt;a href="http://www.agmrc.org/commodities__products/grains__oilseeds/barley_profile.cfm"&gt;Agricultural Marketing Resource Center - Barley Profile&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;How does this relate to publishers and advertising and OpenRTB?  In my opinion we need several things standardized:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) Inventory registration and description API.  Allows publishers influence on how their inventory is exposed in various demand-side and trading-desk platforms.  Publishers should fully describe their inventory in a common format.  Buy-side GUIs and algorithms will benefit from increased annotation and categorization.  This can also harmonize the brand-safety ratings that are not connected between the sell and buy sides.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2) Standardization of the emerging 'Private Marketplace' models in RTB.   A set of best practices and trading procedures for PM needs to be defined such that the market can grow properly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;While the main bid request/response API of OpenRTB has been criticized as being 'too late' given the large implementations in production, it is not too late to define standards for the above.  These things will help the buy-side better differentiate quality inventory.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-539924906209132847?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/539924906209132847/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=539924906209132847' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/539924906209132847'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/539924906209132847'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/11/open-rtb-panel-iab-ad-ops-summit-2011.html' title='Open RTB panel - IAB Ad Ops Summit 2011'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://img.youtube.com/vi/K02hD023g-Y/default.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4076018834110172191</id><published>2011-07-12T22:49:00.001-07:00</published><updated>2011-07-13T08:28:26.688-07:00</updated><title type='text'>SchemaMgr - MySQL schema management tool</title><content type='html'>&lt;div class="posterous_autopost"&gt;SchemaMgr is a simple perl tool to manage schema change in MySQL DBs.  I wrote this in 2007/2008 and it has been used in production for many years.&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="https://github.com/nealrichter/schemamgr" target="_blank"&gt;https://github.com/nealrichter/schemamgr&lt;/a&gt;&lt;p&gt; Each change is assigned a version number and placed in a file.  When the SQL in the file is executed successfully, a special table is set with that version number.  Subsequent runs install only the higher versioned files.&lt;/p&gt;&lt;div&gt;&lt;p&gt;It can also be used to reinstall views and stored procedures.&lt;/p&gt;&lt;p&gt; The best practice is to copy the file and change X in the filename and in the $DB_NAME variable.&lt;/p&gt;&lt;div&gt;&lt;div class="CodeRay"&gt;&lt;div class="code"&gt;&lt;pre&gt;$ ./bin/schemamgr_X.pl&lt;br /&gt;Usage: either create or upgrade X database&lt;br /&gt;schemamgr_X.pl -i -uUSERNAME -pPASSWORD [-vVERSION] [-b]&lt;br /&gt;  updates DB of to current (default) or requested version&lt;br /&gt;schemamgr_X.pl -s -uUSERNAME -pPASSWORD&lt;br /&gt;  reinstalls all stored procedures&lt;br /&gt;schemamgr_X.pl -w -uUSERNAME -pPASSWORD&lt;br /&gt;  reinstalls all views&lt;br /&gt;schemamgr_X.pl -q -uUSERNAME -pPASSWORD&lt;br /&gt;  Requests and prints current version&lt;br /&gt;Optional Params&lt;br /&gt; -vXX -- upgrades upto a specific version number XX&lt;br /&gt; -b   -- backs up the database (with data) before upgrades&lt;br /&gt; -nYY -- runs the upgrades against database YY - default is X&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;By convention you create ONE create_X_objects_v1 file with a date.&lt;br /&gt;All other files are update files with greater than v1 numbers.&lt;br /&gt;&lt;pre&gt;build/&lt;br /&gt;|-- create_X_objects_v1_20110615.sql&lt;br /&gt;|-- update_X_objects_v2_20110701.sql&lt;br /&gt;`-- update_X_objects_v3_20110702.sql&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt; &lt;/div&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/schemamgr-mysql-schema-management-tool"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4076018834110172191?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4076018834110172191/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4076018834110172191' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4076018834110172191'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4076018834110172191'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/07/schemamgr-mysql-schema-management-tool.html' title='SchemaMgr - MySQL schema management tool'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3930012462473876263</id><published>2011-06-28T09:02:00.001-07:00</published><updated>2011-06-28T12:45:32.592-07:00</updated><title type='text'>Managing yourself to tasks and finishing them.</title><content type='html'>&lt;div class="posterous_autopost"&gt;I saw these two short articles from HBR come across my twitter stream and read them.  They stuck with me for more than a week as they triggered some connections with proof methods.&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://web.hbr.org/email/archive/managementtip.php?date=060811"&gt;Treat Every Task as Three Steps, Not One&lt;/a&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;The essence of the advice is "Prep-Do-Review".  For each task you do make and review a plan for it before starting.  Once the task completed review the plan.  Did you finish? What did you learn?  What would you do differently next time?&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://blogs.hbr.org/cs/2011/06/how_to_become_a_great_finisher.html"&gt;How to Become a Great Finisher&lt;/a&gt;&lt;/div&gt;&lt;p&gt; &lt;/p&gt;&lt;div&gt; The essence of this advice is to think in terms of "to-go" versus "to-date" performance on a task.  When you entertain to-date thinking it's very easy to see how much you have accomplished so far.  This can lead to a lowering of ongoing effort or allow yourself to become distracted and work on other tasks.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;So the optimal algorithm that combines the two is:&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;PrepForWork();&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Do {&lt;/span&gt;&lt;/div&gt; &lt;div&gt;&lt;span class="Apple-style-span"&gt;   IncrementalWork();&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;   X = DistanceToFinish();&lt;/span&gt;&lt;/div&gt; &lt;div&gt;&lt;span class="Apple-style-span"&gt;} Until (X == 0)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;ReviewResults();&lt;/span&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Don't look back until you are done!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that there is a formal method of proof in mathematics and CS that goes something like this:&lt;/div&gt;&lt;div&gt;&lt;div style="font-family: arial, sans-serif; border-collapse: collapse; color: rgb(80, 0, 80); font-size: 13px; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span"&gt;1) Define an integral metric &lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small; "&gt;f(x) &lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: small; "&gt;measuring the distance to the goal&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"&gt;2) Define the starting distance&lt;/span&gt;&lt;/div&gt;&lt;div style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"&gt;3) Show that your "algorithm/method" monotonically decreases f(x)&lt;/span&gt;&lt;/div&gt;&lt;div style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"&gt;4) Infer that the goal will be reached &lt;/span&gt;&lt;/div&gt;&lt;div style="border-collapse: collapse; "&gt;&lt;span class="Apple-style-span"&gt;5) (Optional) Calculate the minimum number of steps required to reach the goal.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="border-collapse: collapse; font-size: 13px; "&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;The important point is the rigor of the mathematical proof version.  Your idea gets no partial credit for so-far progress, the algorithm either gets to the goal or it fails.  The proof is either true or it is false.  Thus you are either &lt;i&gt;Done&lt;/i&gt; or you are &lt;i&gt;NOT Done.&lt;/i&gt;&lt;/div&gt;&lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/managing-yourself-to-tasks-and-finishing-them"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3930012462473876263?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3930012462473876263/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3930012462473876263' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3930012462473876263'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3930012462473876263'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/06/managing-yourself-to-tasks-and.html' title='Managing yourself to tasks and finishing them.'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4360642603740051525</id><published>2011-04-07T14:21:00.001-07:00</published><updated>2011-04-08T11:27:07.190-07:00</updated><title type='text'>JSON parsing speed in various Node.JS versions</title><content type='html'>&lt;div class="posterous_autopost"&gt;We use &lt;a href="http://nodejs.org/"&gt;Node.JS&lt;/a&gt; for a very high capacity service at &lt;a href="http://rubiconproject.com/"&gt;the Rubicon Project&lt;/a&gt;.  It often drives or handles in excess of 10B HTTP requests per day sending or receiving JSON data.&lt;br /&gt;Out of curiosity I ran some tests on JSON parsing speed in different versions of Node.JS&lt;p&gt;node.js code:&lt;br /&gt;&lt;/p&gt;&lt;blockquote&gt;var sys = require('sys');&lt;br /&gt;var data = "{ \"item_uuid\": \"8ec56438-d3cf-442a-bbf7-7f076f229f35\", \"return_code\": 0, \"data\": [ { \"valid\": true, \"votes\": 2345, \"date\":\"Thu, 07 Apr 2011 15:17:17 EDT\", \"headline\": \"Senate Majority Leader Harry Reid indicates there likely will be a government shutdown on Friday. Lawmakers have been unable to agree on a new federal budget\", \"source\": \"Yahoo News\", \"published\":{\"hour\":\"19\",\"timezone\":\"UTC\",\"second\":\"17\",\"month\":\"4\",\"minute\":\"17\",\"utime\":\"1302203837\",\"day\":\"7\",\"day_of_week\":\"4\",\"year\":\"2011\"} } ] }";&lt;p&gt; try {&lt;br /&gt;  for(var i = 0; i &amp;lt; 1000000; i++)&lt;br /&gt;   {&lt;br /&gt;       var tmp = JSON.parse(data);&lt;br /&gt;   }&lt;br /&gt;} catch(e) { sys.puts("ERROR: on parsing JSON with v8 parser"); }&lt;/p&gt;&lt;p&gt; sys.puts(data);&lt;br /&gt;var tmp = JSON.parse(data);&lt;br /&gt;sys.puts(JSON.stringify(tmp));&lt;br /&gt;sys.puts("\n DONE \n");&lt;br /&gt;process.exit();&lt;br /&gt;&lt;/p&gt;&lt;/blockquote&gt;Essentially this re-parses the same example JSON (I created a fake RSS like JSON pacakge) 1M times.&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Here are the results:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Node 0.1.3x:    real 0m30.050s&lt;/li&gt;&lt;li&gt;Node 0.2.6:      real 0m30.050s&lt;/li&gt;&lt;li&gt;Node 0.3.8:      real 0m9.915s&lt;/li&gt; &lt;li&gt;Node 0.4.5:      real 0m9.999s&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For reference I ran the same test against a very fast tokening parser in C called &lt;a href="http://zserge.bitbucket.org/jsmn.htm"&gt;jsmn&lt;/a&gt;, and a C++ one called &lt;a href="http://code.google.com/p/vjson/"&gt;vjson&lt;/a&gt;.&lt;/div&gt; &lt;div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;jsmn: real 0m2.276s&lt;/li&gt;&lt;li&gt;vjson: real 0m7.465s        &lt;i&gt;Note that vjson is a destructive parser, and I had to fix that first.&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Interestingly the JSON parser in node 0.4.5 and prior versions appears to be written in pure Javascript.  See the file:  node-v0.4.5/deps/v8/src/json.js&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;It's unclear if the speed improvements are a result of improvements to the parser implementation or in some efficiency/speed leap in versions of V8 included in Node versions.&lt;/div&gt;&lt;div&gt;&lt;ul&gt; &lt;li&gt;Node 0.1.33:    v8: 2010-03-17: Version 2.1.5&lt;/li&gt;&lt;li&gt;Node 0.2.6:      v8: 2010-08-16: Version 2.3.8&lt;/li&gt;&lt;li&gt;Node 0.3.8:      v8: 2011-02-02: Version 3.1.1&lt;/li&gt; &lt;li&gt;Node 0.4.5:      v8: 2011-03-02: Version 3.1.8&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/json-parsing-speed-in-various-nodejs-versions"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4360642603740051525?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4360642603740051525/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4360642603740051525' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4360642603740051525'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4360642603740051525'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/04/json-parsing-speed-in-various-nodejs.html' title='JSON parsing speed in various Node.JS versions'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3768236879648620168</id><published>2011-03-22T21:26:00.001-07:00</published><updated>2011-03-22T21:26:26.331-07:00</updated><title type='text'>Pamela Samuelson on startups and software patents</title><content type='html'>&lt;div class='posterous_autopost'&gt;Following up on my last post here is the view from Pamela Samuelson:&lt;p /&gt;&lt;a href="http://radar.oreilly.com/2010/07/why-software-startups-decide-t.html"&gt;Why software startups decide to patent ... or not&lt;/a&gt;&lt;p /&gt;&lt;blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"&gt; &lt;span style="color: rgb(51, 51, 51); font-family: Arial, sans-serif; font-size: 14px; line-height: 18px;"&gt;Two-thirds of the approximately 700 software entrepreneurs who participated in the 2008 Berkeley Patent Survey report that they neither have nor are seeking patents for innovations embodied in their products and services. These entrepreneurs rate patents as the least important mechanism among seven options for attaining competitive advantage in the marketplace. Even software startups that hold patents regard them as providing only a slight incentive to invest in innovation.&lt;/span&gt;&lt;/blockquote&gt; &lt;p /&gt;&lt;p /&gt;&lt;div&gt;She also lists a variety of reasons why these software entrepreneurs decided to forgo patenting their last invention.  It&amp;#39;s a very interesting write up. &lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/pamela-samuelson-on-startups-and-software-pat"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3768236879648620168?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3768236879648620168/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3768236879648620168' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3768236879648620168'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3768236879648620168'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/03/pamela-samuelson-on-startups-and.html' title='Pamela Samuelson on startups and software patents'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4895651453428269603</id><published>2011-03-10T14:30:00.001-08:00</published><updated>2011-03-10T14:30:42.596-08:00</updated><title type='text'>Comments re "The Noisy Channel: A Practical Rant About Software Patents"</title><content type='html'>&lt;div class='posterous_autopost'&gt;&lt;div style="color: rgb(51, 51, 51); font-family: Georgia, Times New Roman, Times, serif; font-size: 14px; line-height: 23px;"&gt;&lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt; &lt;a href="http://thenoisychannel.com/2011/03/07/a-practical-rant-about-software-patents/"&gt;The Noisy Channel: A Practical Rant About Software Patents&lt;/a&gt; - [My comments cross-posted here]&lt;/p&gt;&lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt; Daniel, nice writeup.&lt;/p&gt;&lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt;I worked for a BigCo and filed many patents. It was a mixed bag. The time horizon is so long that even after I’ve been gone for 3.5 years many of them are still lost in the USPTO. Average time for me to see granted patents was 5+ years.&lt;/p&gt; &lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt;Here are my biased opinions:&lt;/p&gt;&lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt; 1) Patents really matter for BigCos operating on a long time horizon. It’s a strategic investment.&lt;/p&gt;&lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt; 2) Patents are nearly worthless for a Startup or SmallCo. The time horizon is way past your foreseeable future, and thus the whole effort is akin to planning for an alternate reality different than the current business context. Throwing coins in a fountain for good luck is about as relevant. You simply are better off getting a filing date on a provisional design writeup and hiring an engineer with the money you’d spend on Patent lawyers.&lt;/p&gt; &lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt;3) As an Acquiring company looking at a company to acquire, Provisional or Pending Patents are a liability not an asset. They take time and resources to push to completion for a strategy of deterrence.&lt;/p&gt; &lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt;4) Patents are mostly ignored in the professional literature. Take Sentiment Analysis as one example. Sentiment Analysis exploded in 2001 w.r.t. Academic publishing, yet there are more than a few older patents discussing good technical work on Sentiment Analysis. I’ve NEVER seen an algorithm in a patent cited in a paper as previous work. And I have seen academic papers with algorithms already 90% covered by an older patent… and the papers are cited as ‘novel work’.&lt;/p&gt; &lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt;5) Finding relevant patents is ludicrously hard. It might be the most challenging problem in IR w.r.t. a corpus IMO. Different words mean the same thing and vise versa due to the pseudo-ability in a Patent to redefine a word away from the obvious meaning. With two different lawyers rendering the same technical design into a writeup and claims results in wildly different work product.&lt;/p&gt; &lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt;6) I’ve seen some doosey granted Patents. Things that appear to either be implementations of very old CS ideas into new domains.. or worse stuff that would be a class project as an undergrad.&lt;/p&gt; &lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt;It’s just plain ugly in this realm.&lt;/p&gt;&lt;p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1em; margin-left: 0px;"&gt; &lt;br /&gt;&lt;/p&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/comments-re-the-noisy-channel-a-practical-ran"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4895651453428269603?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4895651453428269603/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4895651453428269603' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4895651453428269603'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4895651453428269603'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/03/comments-re-noisy-channel-practical.html' title='Comments re &amp;quot;The Noisy Channel: A Practical Rant About Software Patents&amp;quot;'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4413902474566564148</id><published>2011-03-10T01:20:00.001-08:00</published><updated>2011-03-10T01:20:37.891-08:00</updated><title type='text'>On Strategic Plans</title><content type='html'>&lt;div class='posterous_autopost'&gt;This needs absolutely no comment.&lt;p /&gt;&lt;div&gt; &lt;span style="font-family: Lucida Grande, Lucida Sans Unicode, verdana, sans-serif; font-size: 12px; line-height: 22px;"&gt;“We have a ‘&lt;em&gt;strategic plan&lt;/em&gt;.’ It’s called &lt;strong&gt;&lt;em&gt;doing things&lt;/em&gt;&lt;/strong&gt;.” ~ &lt;em&gt;Herb Kelleher&lt;/em&gt;&lt;/span&gt;&lt;br /&gt; &lt;p /&gt;&lt;div&gt;&lt;div style="color: rgb(86, 60, 19); font-family: Helvetica, Arial, Verdana, sans-serif; font-size: 10px; line-height: 10px;"&gt;&lt;img class="photo" src="http://distillery.s3.amazonaws.com/media/2011/03/09/ebe98db490154bbb9cd008f55ca5e0f8_7.jpg" style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-color: initial; font-family: inherit; font-size: 10px; font-style: inherit; font-weight: inherit; line-height: 1; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; text-align: left; vertical-align: baseline; height: 480px;" /&gt;&lt;p /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/on-strategic-plans"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4413902474566564148?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4413902474566564148/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4413902474566564148' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4413902474566564148'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4413902474566564148'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/03/on-strategic-plans.html' title='On Strategic Plans'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-5081932943540629538</id><published>2011-03-06T10:20:00.001-08:00</published><updated>2011-03-06T10:20:05.657-08:00</updated><title type='text'>Hilarious system calls in the BeOS</title><content type='html'>&lt;div class='posterous_autopost'&gt;These system calls in the BeOS still make me smile.&lt;p /&gt;&lt;div&gt;&lt;span style=""&gt;int32&lt;b&gt;&lt;tt&gt; is_computer_on(&lt;/tt&gt;&lt;/b&gt;void&lt;b&gt;&lt;tt&gt;)&lt;/tt&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt; &lt;blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"&gt;&lt;div&gt;&lt;p style="font-family: Times New Roman; font-size: medium;"&gt;Returns 1 if the computer is on. If the computer isn&amp;#39;t on, the value returned by this function is undefined.&lt;/p&gt; &lt;/div&gt;&lt;/blockquote&gt;&lt;div&gt;&lt;div&gt;&lt;span style="font-family: Times New Roman; font-size: medium;"&gt;&lt;span style=""&gt;double&lt;b&gt;&lt;tt&gt; is_computer_on_fire(&lt;/tt&gt;&lt;/b&gt;void&lt;b&gt;&lt;tt&gt;)&lt;/tt&gt;&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt; &lt;/div&gt;&lt;p /&gt;&lt;blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"&gt;&lt;div&gt;&lt;div&gt;&lt;span style="font-family: Times New Roman; font-size: medium;"&gt;Returns the temperature of the motherboard if the computer is currently on fire. If the computer isn&amp;#39;t on fire, the function returns some other value.&lt;/span&gt;&lt;/div&gt; &lt;/div&gt;&lt;/blockquote&gt;&lt;div&gt;&lt;p /&gt;&lt;div&gt;&lt;blockquote&gt;&lt;span style="font-size: 15px;"&gt;#include &amp;lt;stdio.h&amp;gt; &lt;br /&gt; &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;#include &amp;lt;be/kernel/OS.h&amp;gt; &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;&lt;br /&gt; &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;int main() &lt;br /&gt;&lt;/span&gt;&lt;span style="font-size: 15px;"&gt;{ &lt;br /&gt; &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;printf(&amp;quot;[%d] = &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;is_computer_on()&lt;/span&gt;&lt;span style="font-size: 15px;"&gt;\n&amp;quot;, is_computer_on()); &lt;br /&gt; &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;printf(&amp;quot;[%f] = &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;is_computer_on_fire()&lt;/span&gt;&lt;span style="font-size: 15px;"&gt;\n&amp;quot;, is_computer_on_fire()); &lt;br /&gt; &lt;/span&gt;&lt;span style="font-size: 15px;"&gt;} &lt;/span&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt; These functions serve a similar purpose to getpid() in Unix, essentially no-op calls that can be used to test the kernel&amp;#39;s intrinsic response time under load.&lt;/div&gt;&lt;p /&gt;&lt;div&gt;Write up of &lt;a href="http://en.wikipedia.org/wiki/BeOS"&gt;BeOS history&lt;/a&gt; is here, &lt;a href="http://www.haiku-os.org/"&gt;Haiku&lt;/a&gt; is an open source clone of the BeOS that is curiously under &lt;a href="http://dev.haiku-os.org/changeset"&gt;active development&lt;/a&gt;.&lt;/div&gt; &lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/hilarious-system-calls-in-the-beos"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5081932943540629538?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5081932943540629538/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5081932943540629538' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5081932943540629538'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5081932943540629538'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/03/hilarious-system-calls-in-beos.html' title='Hilarious system calls in the BeOS'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3410708172321763012</id><published>2011-03-04T09:21:00.001-08:00</published><updated>2011-03-04T09:21:43.852-08:00</updated><title type='text'>Contractor Needed: HTML/CSS/Javascript Ninja</title><content type='html'>&lt;div class='posterous_autopost'&gt;The Rubicon Project is looking for an in-browser HTML/CSS/Javascript Ninja to restructure the workflow of an application GUI.  The server side code is perl/mod_perl.  Please contact me if you are interested and available.  The contract is 4-6 weeks. &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/contractor-needed-htmlcssjavascript-ninja"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3410708172321763012?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3410708172321763012/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3410708172321763012' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3410708172321763012'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3410708172321763012'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/03/contractor-needed-htmlcssjavascript.html' title='Contractor Needed: HTML/CSS/Javascript Ninja'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-2065546214091168604</id><published>2011-03-03T12:26:00.001-08:00</published><updated>2011-03-03T12:41:31.353-08:00</updated><title type='text'>Job Post: Software Engineer/Scientist: Ad Serving, Optimization and Core Team</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;h1&gt;&lt;span class="Apple-style-span" style="font-size: 16px; font-weight: normal; "&gt;&lt;strong&gt;LOCATION:&lt;/strong&gt; &lt;span style="font-weight: normal;"&gt;the Rubicon Project HQ in West Los Angeles or Salt Lake City&lt;/span&gt;&lt;/span&gt;&lt;/h1&gt;  &lt;p&gt;&lt;span style="font-weight: normal;"&gt;the Rubicon Project is on a mission to automate buying and selling for the $65 billion global online advertising industry. Backed by $42 million in funding, we are currently looking for the best engineers in the world to work with us.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Team Description&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;br /&gt;&lt;span style="font-weight: normal;"&gt;The mission of the Core Team is to build robust, scalable, maintainable and well documented systems for ad serving, audience analytics, and market analysis. Every day we serve billions of ads, process terabytes of data and provide valuable data and insights to our publishers. If building software that touches 500+ million people every month is interesting to you, you'll fit in well here.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;Some of the custom software we've built to solve these problems include:&lt;/p&gt;&lt;p&gt;A patented custom ad engine delivering thousands of ad impressions per second with billions of real time auctions daily&lt;br /&gt;A real time bid engine designed to scale out to billions of bid requests daily&lt;br /&gt;Optimization Algorithms capable of scheduling and planning adserving opportunities to maximize revenue&lt;br /&gt;Client side Javascript that performs real-time textual analysis of web pages to extract semantically meaningful data and structures&lt;br /&gt;A web-scale key value store based on ideas from the Amazon Dynamo paper used to store 100s of millions of data points&lt;br /&gt;Unique audience classification system using various technologies such as Solr and Javascript for rich, real-time targeting of web site visitors&lt;br /&gt;Data Mining buying and selling strategies from a torrent of transactional data&lt;br /&gt;Analytics systems capable of turning a trillion data points into real business insight&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Job Description&lt;/strong&gt;&lt;/p&gt; &lt;p&gt;&lt;br /&gt;&lt;span style="font-weight: normal;"&gt;Your job, should you accept it, is to build new systems, new features and extend the functionality of our existing systems. You will be expected to architect new systems from scratch, add incremental features on existing systems, fix bugs in other people's code and help manage production operations of the services you build. Sometimes you'll have to (or want to) do this work when you are not in the office, so working remote can't scare you off.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;Most of our systems are written in Perl, Java, and C, but we have pieces of Python, Clojure and server-side Javascript as well. Hopefully you have deep expertise in at least one of these; you'll definitely need to have a desire to quickly learn and work on systems written in all of the above.&lt;/p&gt;&lt;p&gt;You should also have worked with and/or designed service oriented architectures, advanced db schemas, big data processing, highly scalable and available web services and are well aware of the issues surrounding the software development lifecycle. We expect that your resume will itemize your 3+ years experience, mention your BS or MS in Computer Science and be Big Data Buzzword Compliant.&lt;/p&gt;&lt;p&gt;Bonus points for experience with some of the technologies we work with:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Hadoop&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;NodeJS&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;MySql&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Solr/Lucene&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;RabbitMQ&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;MongoDB&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Thrift&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Amazon EC2&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Memcached&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;MemcacheQ&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Machine Learning&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Optimization Algorithms&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-weight: normal;"&gt;Economic Modeling&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;span style="font-weight: normal;"&gt;&lt;a href="http://www.rubiconproject.com/about/hiring/software-engineer-core-team/"&gt;Apply Now! Click the Apply button!&lt;/a&gt;&lt;/span&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/job-post-software-engineerscientist-ad-servin"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-2065546214091168604?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/2065546214091168604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=2065546214091168604' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2065546214091168604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2065546214091168604'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/03/job-post-software-engineerscientist-ad.html' title='Job Post: Software Engineer/Scientist: Ad Serving, Optimization and Core Team'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-5452104154334367104</id><published>2011-02-28T13:06:00.001-08:00</published><updated>2011-02-28T13:17:16.717-08:00</updated><title type='text'>A note on software teams and individuals</title><content type='html'>&lt;div class="posterous_autopost"&gt;I'm currently running a loosely coupled team of people all working on a common initiative.  While this is not my first time running a team, the same set of things seem to happen with all 'new' teams.  Here's a quick set of observations.&lt;/div&gt;&lt;div class="posterous_autopost"&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;The first major observation is that teams of engineers can quickly fall into operating like a "golf team" versus a "football team".  In Golf, each team member generally competes against all other players (and their different teams) as an individual.  A given team wins if it's individual players collectively do better than some other team's players.  Football (or Soccer or Basketball) is very different.  A team wins in the face of good opposition &lt;i&gt;only if it plays as a team.&lt;/i&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;For software teams, done means one thing:  the team is done with the milestone or project.  Done means finished, tested and shipped code. Does does not mean "my part works", or "my tasks are done".&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;IMO each team member should answer these questions to the group every day:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;What direction I am going relative to team goals.&lt;/li&gt;&lt;li&gt;What specific items I am working on today.&lt;/li&gt;&lt;li&gt; Does anyone need any help from me?&lt;/li&gt;&lt;li&gt;Do I need any help with my work?&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;Team managers, both the overall and functional leads, should ask or answer these questions for the group every day:&lt;/div&gt;&lt;div&gt; &lt;ol&gt;&lt;li&gt;Are we as a group going the right direction (towards the goal)?&lt;/li&gt;&lt;li&gt;Will we meet the timeline and/or functional goals?&lt;/li&gt;&lt;li&gt;Is there any functional or task ambiguity that needs working out?&lt;/li&gt;&lt;li&gt;Are any course corrections needed?&lt;/li&gt; &lt;/ol&gt;&lt;/div&gt;&lt;div&gt;The second observation is that there are two major indicators of if a given individual is a good addition to the team:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Does this person communicate well and often?&lt;/li&gt;&lt;li&gt;Does this person have the capability and desire to resolve ambiguity on their own when possible?&lt;/li&gt; &lt;/ol&gt;&lt;div&gt;The second skill, resolving ambiguity, is in my opinion the primary question that a software hiring manager needs to answer in the affirmative about a given candidate... assuming of course the candidate has the needed skills.&lt;/div&gt; &lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Much of this also circles back on a blog post that Jordan Mitchell wrote years ago when I was hip-deep in code at &lt;a href="http://www.othersonline.com/"&gt;Others Online&lt;/a&gt;.&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt; &lt;a href="http://kickstand.typepad.com/metamuse/2008/12/actual-vs-perceived-progress.html"&gt;Actual vs. Perceived Progress&lt;/a&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/a-note-on-software-teams-and-individuals"&gt;aicoder - nealrichter's blog&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5452104154334367104?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5452104154334367104/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5452104154334367104' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5452104154334367104'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5452104154334367104'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/02/note-on-software-teams-and-individuals.html' title='A note on software teams and individuals'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1895155204282241993</id><published>2011-02-07T15:13:00.000-08:00</published><updated>2011-02-07T16:23:27.701-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rightnow'/><category scheme='http://www.blogger.com/atom/ns#' term='montana'/><category scheme='http://www.blogger.com/atom/ns#' term='bozeman'/><title type='text'>RightNow - Our cowboys ride code</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_eZUtZFDoqLA/TVB9QaIgLOI/AAAAAAAAADw/RT6U_xbkoKA/s1600/RightNowCodeCowboys.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 258px;" src="http://4.bp.blogspot.com/_eZUtZFDoqLA/TVB9QaIgLOI/AAAAAAAAADw/RT6U_xbkoKA/s400/RightNowCodeCowboys.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5571090459903667426" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;This is a neat little ad in the &lt;a href="http://msp.imirus.com//Mpowered/imirus.jsp?volume=ds11&amp;amp;issue=1&amp;amp;page=0"&gt;January Delta Sky Magazine&lt;/a&gt; for RightNow Technologies, where I worked from 1999-2007.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div&gt;RightNow Technologies serves about 10 billion [customer interactions] a year through he companies and institutions it works with. “Every person in North America has used one of our solutions about 25 times,” says Gianforte.&lt;/div&gt;&lt;/blockquote&gt;&lt;div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Why RightNow keeps its headquarters in Bozeman, MT&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;div&gt; The quality of life here is a huge advantage, but more importantly, says Gianforte, “there’s a ranch saying around here that goes,‘When something needs to get done, well then, we’re just gonna get ‘er done.’ In many environments, they have to form a committee, pull in consultants and such to make things happen, but our clients appreciate that when something needs to get done, we can easily make that hap pen because of the work ethic here.”&lt;/div&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1895155204282241993?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1895155204282241993/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1895155204282241993' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1895155204282241993'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1895155204282241993'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/02/rightnow-our-cowboys-ride-code.html' title='RightNow - Our cowboys ride code'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_eZUtZFDoqLA/TVB9QaIgLOI/AAAAAAAAADw/RT6U_xbkoKA/s72-c/RightNowCodeCowboys.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-9025897904671449897</id><published>2011-01-30T23:45:00.001-08:00</published><updated>2011-01-31T00:40:04.073-08:00</updated><title type='text'>The Provenance of Data, Data Branding and "Big Data" Hype</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"&gt; The credibility of where data comes from in all these "big data" plays is absolutely crucial. Waving hands re "algorithms" won't cut it.  @nealrichter Jan 27, 1010 Tweet&lt;/blockquote&gt;&lt;p&gt; &lt;/p&gt;&lt;div&gt;To expand on this tweet here's the argument:  If one of your key products as a startup or business is to "crunch data" and derive or extract value from it then you should be concerned about data provenance.  This is true whether you are crunching your own data or third-party data.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;Some examples:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Web analytics - crunch web traffic and distill visitation and audience analytics reports for web site owners.  Often they use these summaries to make decisions and sell their ad-space to advertisers.&lt;/li&gt; &lt;li&gt;Semantic Web APIs - crunch webpages, tweets etc and return topical and semantic annotations of the content&lt;/li&gt;&lt;li&gt;Comparison shopping - gather up product catalogs and pricing to aggregate for visitors&lt;/li&gt;&lt;li&gt;Web publishers - companies who run websites&lt;/li&gt; &lt;li&gt;Prediction services - companies that use data to predict something&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;In each of the above categories the provenance of the input data and brand of the output data is key.  For each of the above one could name a company with either solid-gold data OR a powerful brand-name and good-enough data.  Conversely we can find examples of companies with great tech but crappy data or a weak brand. &lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;For web publishers, those that host user-generated content have poor provenance in general compared to news sites (for example).  A notable exception is Wikipedia who has a pure "UGC" model but a solid community process and standards to improve provenance of their articles (those without references are targeted for improvement).&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;In comparison shopping Kayak.com has good data (directly from the airlines) and has built a good brand.  The same is true of PriceGrabber and Nextag.  TheFind.com on the other hand appears to have great data and tech, but no well known brand.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;(I'm refraining from going into specific examples or opinions on big data companies to avoid poking friends in the eye.)&lt;/i&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;The issue of Provenance and Branding is especially important in sales situations where you are providing a tool (analytics) that helps your customer (a sales person) sell something to a third-party (their customer).  If the input data you are using either has a demonstrable provenance or a good brand you'll have an easier time convincing people that the output of your product is worth having (and reselling).&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;The old saying for this in computer science is &lt;a href="http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out" target="_blank"&gt;Garbage In, Garbage Out&lt;/a&gt;.  &lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;In "big data" world of startups that is blowing by Web 2.0 as the new hotness there is a startling lack of concern about data provenance.  The essentially ethos is that if we (the Data Scientists) accumulate enough data and crunch it with magical algorithms then solid-gold data will come out... or at least that's what the hype machine says.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;The lesson from the financial melt down is that magical algorithms making CDOs, CMOs and other derivatives should be viewed with a lens of mistrust.  The GIGO principle was forgotten and no one even cared about the provenance (read credit quality) of the base financial instruments making up the derivatives.  The credit rating agencies were just selling their brand and cared little about quality.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;In my opinion, there is a clear parallel here to "big data".  Trust must be part of the platform and not just tons of CPUs and disk-space.  A Brand is a brittle object that is easily broken, so concentrate on quality.&lt;/div&gt; &lt;div&gt; &lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/the-provenance-of-data-data-branding-and-big"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-9025897904671449897?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/9025897904671449897/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=9025897904671449897' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9025897904671449897'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9025897904671449897'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/01/provenance-of-data-data-branding-and.html' title='The Provenance of Data, Data Branding and &amp;quot;Big Data&amp;quot; Hype'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1256027623960545937</id><published>2011-01-14T00:23:00.001-08:00</published><updated>2011-01-14T00:59:20.557-08:00</updated><title type='text'>Finance for Engineers</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;div&gt;Last summer I took a great mini-course at MIT Sloan on Finance.  It's essentially a breadth-first review of the &lt;a href="http://ocw.mit.edu/courses/sloan-school-of-management/15-402-finance-theory-ii-spring-2003/" target="_blank"&gt;MBA course&lt;/a&gt; complete with three case studies and a review of project evaluation methods via &lt;a href="http://en.wikipedia.org/wiki/Net_present_value" target="_blank"&gt;net present value analysis&lt;/a&gt;.  Approximately 80% of the attendees were engineers/techies with 10+ years experience.. and maybe 25% w/ PhDs.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://mitsloan.mit.edu/execed/coursedetails.php?id=779" target="_blank"&gt;Fundamentals of Finance for the Technical Executive&lt;/a&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;TextBook was: &lt;a href="http://www.amazon.com/Analysis-Financial-Management-Mcgraw-Hill-Insurance/dp/0077297652/ref=dp_ob_title_bk"&gt;Higgins - Analysis for Financial Management&lt;/a&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;The first case study is &lt;a href="http://hbr.org/product/wilson-lumber-co/an/286122-PDF-ENG" target="_blank"&gt;Wilson Lumber from Harvard&lt;/a&gt;.  The material is copyrighted, yet these links look like accurate distillations by business students. &lt;/div&gt; &lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.detobey.com/work/Clarkson.pdf" target="_blank"&gt;Clarkson Lumber Overview PPT&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family: arial, sans-serif; color: rgb(14, 119, 74); line-height: 15px;"&gt;&lt;a href="http://elvis.sob.tulane.edu/Documents/Assignments/Butler.xls" target="_blank"&gt;Butler Lumber Spreadsheet&lt;/a&gt;&lt;/span&gt;&lt;/li&gt; &lt;/ul&gt;&lt;/div&gt; &lt;div&gt;The initial position is that Wilson Lumber growing small business with good suppliers and loyal customers.  Volume and revenue are all up period over period.  Question is should the bank increase is line of credit to fund the business.  Once you break down the financial statements and model the business, the answer is No.  Essentially Mr Wilson is over extended by many measures and is growing at the expense of his balance sheet, loaning him money will only make the problem bigger down the road.  His basic options are to take in a partner as co-owner for cash, go broke or raise prices to lower volume and improve margins and slowly rebuild the balance sheet.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;We then went through two NPV exercises.  The first was a basic analysis of go/no-go on an engineering project with a bottom up analysis via putting all cost/benefit assumptions in a model and iterating though possibilities. The second was an analysis of a joint-venture between two biotech companies.  Everything from external capital, deal structure to market penetration projections were worked in.  Very informative and pretty interesting work for engineers to do once the terminology and methods were explained.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://www.stanford.edu/~djenter/index.html"&gt;Professor Jenter&lt;/a&gt; shared two amusing anecdotes:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;His MIT and Stanford MBA students often run off to found start-ups and forget the basic Wilson Lumber case.  By the time they approach him for help it's too late and they are in Mr Wilson's position: shut-down, take in $$ and lots of equity dilution (and loss of control) or slow growth dramatically.&lt;/li&gt; &lt;li&gt;Also a quote along the lines of "Startups founded by MIT PhDs fail at a rate above far average".&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;This certainly hammered home the lesson that strategic planning for growth is very important, even for what look like non hyper-growth (software) companies.  I'd recommend this course to any engineer wanting a quick structured intro to basic financial management.&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/finance-for-engineers"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1256027623960545937?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1256027623960545937/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1256027623960545937' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1256027623960545937'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1256027623960545937'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2011/01/finance-for-engineers.html' title='Finance for Engineers'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-475019776368968841</id><published>2010-12-30T09:34:00.001-08:00</published><updated>2010-12-30T09:46:02.029-08:00</updated><title type='text'>List of Best Paper awards in CS/AI/ML conferences</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;div&gt;The below is a great list of best paper awards for WWW, SIGIR, CIKM, AAAI, CHI, KDD, SIGMOD, ICML, VLDB, IJCAI, UIST since 1996&lt;/div&gt;&lt;p&gt;&lt;a href="http://jeffhuang.com/best_paper_awards.html"&gt;http://jeffhuang.com/best_paper_awards.html&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Interesting thing to note:  Google is ranked last in frequency, Microsoft first.&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;This needs NIPS and possibly UAI added to it.&lt;/div&gt;&lt;div&gt;&lt;a href="http://nips.cc/ConferenceInformation/PaperAwards"&gt;http://nips.cc/ConferenceInformation/PaperAwards&lt;/a&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/list-of-best-paper-awards-in-csaiml-conferenc"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-475019776368968841?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/475019776368968841/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=475019776368968841' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/475019776368968841'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/475019776368968841'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/12/list-of-best-paper-awards-in-csaiml.html' title='List of Best Paper awards in CS/AI/ML conferences'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3265675031363218973</id><published>2010-12-29T11:18:00.001-08:00</published><updated>2010-12-29T11:48:43.363-08:00</updated><title type='text'>Managing Open Source Licenses</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;div&gt;From time to time I have helped companies do Open Source code audits in their own source code.  Basically this consists of auditing their code to find open source code.  &lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;These code audits are particularly important during software releases and M&amp;amp;A events.  I've helped companies do this for releases and been on both sides of M&amp;amp;A event driven audits.&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;If the developers have kept the attributions with any open source code they have re-used then grep is a fine tool for auditing.  However this is a big IF.  If your developers are sloppy and do not keep the attributions (ie copyright and license notices) with code they lift from open source you have a problem.  A software tool needs to be used to scan the corporate source for hits in open source repositories. &lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;There are at least three companies providing software to do this:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.blackducksoftware.com/"&gt;BlackDuck Software&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.palamida.com/"&gt;Palamida&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://protecode.com/"&gt;Protecode&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Ideally the outcome of this process is as follows:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;A clear company policy is set on what open source licenses are allowed and how developers can use open source come or components.&lt;/li&gt;&lt;li&gt;The corporate code is cleanly annotated with any third party attributions (see below).&lt;/li&gt; &lt;li&gt;Open Source code that has bad licenses for commercial usage is identified and removed before release.&lt;/li&gt;&lt;li&gt;A Bill of Materials is created for each release listing third-party software in the release.&lt;/li&gt;&lt;li&gt;Necessary copyright or other notices appear in About dialogs, manuals or product websites.&lt;/li&gt; &lt;/ol&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;Example comment block:&lt;/div&gt;&lt;blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"&gt;        &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt;/*&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * XYZ.com Third-party or Open Source Declaration&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Name: Bart Simpson&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Date of first commit: 04/25/2009&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Release: 3.5 “The Summer Lager Release”&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Component: tinyjson&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Description: C++ JSON object serializer/deserializer  &lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Homepage: &lt;a href="http://blog.beef.de/projects/tinyjson/"&gt;http://blog.beef.de/projects/tinyjson/&lt;/a&gt;&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * License: MIT style license&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Copyright: Copyright (c) 2008 Thomas Jansen (&lt;a href="mailto:thomas@beef.de"&gt;thomas@beef.de&lt;/a&gt;)&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; * Note: See below for original declarations from the code&lt;/p&gt; &lt;p style="margin-bottom: 0in; line-height: 100%;"&gt; */&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;If the above were upgraded to be in a javadoc style comment then a tool could be built to auto-magically generate a Bill of Materials for each release.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;There is one grey area in all this: how to handle developers using code from discussion sites like PHP.net, CodeProject, StackOverflow and similar sites.  Generally code put in these type of forums has no defined license.  In this case the code is either copyrighted by the site or the author of the post... and developers should not use the code without getting an explicit license.  However developers generally feel like people put the code up there to share.  This conflict means the company policy on usage of this type of code must be clearly communicated to all developers.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;This is a nice review article of other considerations for open source auditing:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;a href="http://www.drdobbs.com/article/printableArticle.jhtml;jsessionid=JG1RLBQMN0ERNQE1GHOSKHWATMY32JVN?articleId=228000261&amp;amp;dept_url=/open-source/"&gt;Dr Dobbs: Managing Open Source Licensing by Kamal Hassin&lt;/a&gt; &lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/managing-open-source-licenses"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3265675031363218973?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3265675031363218973/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3265675031363218973' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3265675031363218973'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3265675031363218973'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/12/managing-open-source-licenses.html' title='Managing Open Source Licenses'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-8609216717587597872</id><published>2010-12-17T09:11:00.001-08:00</published><updated>2010-12-17T09:14:25.646-08:00</updated><title type='text'>Stochastic Universal Sampling/Selection</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;div&gt;Stochastic Universal Sampling is a method of weighted random sampling exhibiting less bias and spread that classic roulette wheel sampling.  The intuition is a roulette wheel with n equally spaced steel balls spinning in unison around the wheel.  This method has better properties and is more efficient that doing repeated samples from the wheel with or without replacement of the selected items.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;p style="font-family: Times New Roman; font-size: medium;"&gt;&lt;img src="http://docs.happycoders.org/unsorted/ai/genetic_algorithms/geatbx/selsus.gif" align="BOTTOM" alt="Stochastic universal sampling" width=400/&gt;&lt;/p&gt; &lt;dl style="font-family: Times New Roman; font-size: medium;"&gt;&lt;/dl&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style="font-family: sans-serif; font-size: 13px; line-height: 19px;"&gt;Baker, James E. (1987). "Reducing Bias and Inefficiency in the Selection Algorithm". &lt;i&gt;Proceedings of the Second International Conference on Genetic Algorithms and their Application&lt;/i&gt; (Hillsdale, New Jersey: L. Erlbaum Associates): 14–21.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;Reference implementations on the web are scare, so here are a few:&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://www.borgelt.net/"&gt;Christian Borgelt&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://fuzzy.cs.uni-magdeburg.de/studium/ga/src/sus.c"&gt;http://fuzzy.cs.uni-magdeburg.de/studium/ga/src/sus.c&lt;/a&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="https://github.com/dwdyer"&gt;Dan Dyer&lt;/a&gt;&lt;/div&gt;&lt;a href="https://github.com/dwdyer/watchmaker/blob/master/framework/src/java/main/org/uncommons/watchmaker/framework/selection/StochasticUniversalSampling.java"&gt;https://github.com/dwdyer/watchmaker/blob/master/framework/src/java/main/org/uncommons/watchmaker/framework/selection/StochasticUniversalSampling.java&lt;/a&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://epr.adaptive.cs.unm.edu/asm/code.html"&gt;University of New Mexico&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://epr.adaptive.cs.unm.edu/asm/code.html"&gt;http://epr.adaptive.cs.unm.edu/asm/code.html&lt;/a&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://cs.gmu.edu/~eclab/projects/ecj/"&gt;GMU's ECJ&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://cs.gmu.edu/~eclab/projects/ecj/docs/classdocs/ec/select/SUSSelection.html"&gt;http://cs.gmu.edu/~eclab/projects/ecj/docs/classdocs/ec/select/SUSSelection.html&lt;/a&gt;&lt;/div&gt; &lt;div&gt;See the SUSSelection.java buried in the latest tarball.&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/stochastic-universal-samplingselection"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-8609216717587597872?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/8609216717587597872/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=8609216717587597872' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8609216717587597872'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8609216717587597872'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/12/stochastic-universal-samplingselection.html' title='Stochastic Universal Sampling/Selection'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-5832506008616792955</id><published>2010-11-22T15:58:00.001-08:00</published><updated>2010-11-22T16:00:11.164-08:00</updated><title type='text'>Computing, economics and the financial meltdown (a collection of links)</title><content type='html'>&lt;div class="posterous_autopost"&gt;This editor's letter from CACM last year is interesting: &lt;a href="http://cacm.acm.org/magazines/2009/9/38884-the-financial-meltdown-and-computing/fulltext" target="_blank"&gt;The Financial Meltdown and Computing by Moshe Y. Vardi&lt;/a&gt; &lt;div&gt;&lt;br /&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;Information technology has enabled the development of a global financial system of incredible sophistication. At the same time, it has enabled the development of a global financial system of such complexity that our ability to comprehend it and assess risk, both localized and systemic, is severely limited. Financial-oversight reform is now a topic of great discussion. The focus of these talks is primarily over the structure and authority of regulatory agencies. Little attention has been given to what I consider a key issue—the opaqueness of our financial system—which is driven by its fantastic complexity. The problem is not a lack of models. To the contrary, the proliferation of models may have created an illusion of understanding and control, as is argued in a recent report titled "&lt;a href="http://ideas.repec.org/p/kie/kieliw/1489.html" target="_blank"&gt;The Financial Crisis and the Systemic Failure of Academic Economics.&lt;/a&gt;"&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;Krugman's essay at the time &lt;a href="http://www.nytimes.com/2009/09/06/magazine/06Economic-t.html?_r=1&amp;amp;em=&amp;amp;pagewanted=all" target="_blank"&gt;How Did Economists Get It So Wrong?&lt;/a&gt; gave a nice history of economic ideas, the models behind and his interpretations of their correctness.&lt;p&gt; &lt;/p&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;The theoretical model that finance economists developed by assuming that every investor rationally balances risk against reward — the so-called Capital Asset Pricing Model, or CAPM (pronounced cap-em) — is wonderfully elegant&lt;/blockquote&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;[snip]&lt;br /&gt;&lt;/blockquote&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt; Economics, as a field, got in trouble because economists were seduced by the vision of a perfect, frictionless market system.&lt;br /&gt;&lt;/blockquote&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt; [snip]&lt;br /&gt;&lt;/blockquote&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;H. L. Mencken: “There is always an easy solution to every human problem — neat, plausible and wrong.”&lt;/blockquote&gt; &lt;div&gt;  &lt;/div&gt;I read this months ago and it's been percolating in my thoughts since then. &lt;a href="http://nymag.com/news/business/55687/" target="_blank"&gt;My Manhattan Project - How I helped build the bomb that blew up Wall Street by Michael Osinski.&lt;/a&gt; Osinski wrote much of the software and models used to form &lt;a href="http://en.wikipedia.org/wiki/Collateralized_mortgage_obligation" target="_blank"&gt;CMOs&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Collateralized_debt_obligation" target="_blank"&gt;CDOs&lt;/a&gt;. Essentially the software aggregates debt instruments from mortgage and other debt markets and allowed a bond designer to issue tailor-made portfolio of debt while mitigating default risk of the debt via that aggregation.  He called it his sausage grinder.&lt;p&gt; &lt;/p&gt;&lt;blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"&gt;“You put chicken into the grinder”—he laughed with that infectious Wall Street black humor—“and out comes sirloin.”&lt;/blockquote&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;Here's a large collection of links from that period that are worth reading.  My thought at the moment is this nugget from Twitter:&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"&gt; &lt;span style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;&lt;a href="http://twitter.com/Poormojo" class="tweet-url screen-name" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px;"&gt;Poormojo&lt;/a&gt;&lt;/span&gt; &lt;span class="Apple-style-span" style="font-family: 'Lucida Grande', sans-serif; font-size: 14px; line-height: 16px; "&gt;"Any sufficiently advanced financial instrument is indistinguishable from fraud."&lt;/span&gt;&lt;/blockquote&gt;&lt;blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Lucida Grande', sans-serif; font-size: 14px; line-height: 16px; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;a href="http://www.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all" target="_blank"&gt;Recipe for Disaster: The Formula That Killed Wall Street&lt;/a&gt;&lt;p&gt;&lt;a href="http://www.nytimes.com/2009/09/13/business/13unboxed.html?em=&amp;amp;pagewanted=print" target="_blank"&gt;Wall Street’s Math Wizards Forgot a Few Variables &lt;/a&gt;&lt;/p&gt;&lt;p&gt; &lt;a href="http://www.nytimes.com/2009/09/13/business/13lehman.html?_r=1&amp;amp;hp=&amp;amp;adxnnlx=1252814676-J4YWny4n5Bq8ehtgH1Fu3A&amp;amp;pagewanted=print" target="_blank"&gt;Tales From Lehman’s Crypt &lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://www.nytimes.com/2009/09/13/business/economy/13view.html?src=linkedin" target="_blank"&gt;Economic View: Flaw in Free Markets: Humans&lt;/a&gt;&lt;/div&gt; &lt;p&gt;&lt;a href="http://freakonomics.blogs.nytimes.com/2009/01/09/this-is-your-brain-on-prosperity-andrew-lo-on-fear-greed-and-crisis-management/?pagemode=print" target="_blank"&gt;Andrew Low: This is your brain on prosperity&lt;/a&gt;&lt;br /&gt;&lt;/p&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://bit.ly/1PwZ5V" target="_blank"&gt;A crisis of politics, not economics: Complexity, Ignorance, and policy failure.&lt;/a&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt; &lt;a href="http://www.newsweek.com/id/200015" target="_blank"&gt;Revenge of the Nerd: Paul Wilmott is out to save Wall Street's soul—one dork at a time.&lt;/a&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://cacm.acm.org/magazines/2009/10/42365-a-conversation-with-david-e-shaw/abstract" target="_blank"&gt;A Conversation with David E. Shaw&lt;/a&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://www.forbes.com/2008/10/07/securities-quants-models-oped-cx_ss_1008shreve.html" target="_blank"&gt;Don't Blame The Quants by Steven Shreve&lt;/a&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://www.nytimes.com/2009/10/14/opinion/14trillin.html?em=&amp;amp;adxnnl=1&amp;amp;adxnnlx=1255543552-DPpTSk3i4f5lEJZALsigRA" target="_blank"&gt;http://www.nytimes.com/2009/10/14/opinion/14trillin.html?em=&amp;amp;adxnnl=1&amp;amp;adxnnlx=1255543552-DPpTSk3i4f5lEJZALsigRA&lt;/a&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a&gt;Sciam: Does Economics Violate the Laws of Physics?&lt;/a&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://online.wsj.com/article/SB10001424052748704204304574543503520372002.html" target="_blank"&gt;Systemic Risk and Fannie Mae&lt;/a&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;a href="http://www.reuters.com/article/marketsNews/idUSN1641435720091202?pageNumber=2&amp;amp;virtualBrandChannel=0&amp;amp;sp=true" target="_blank"&gt;Geeks trump alpha males as algos dominate Wall St&lt;/a&gt;&lt;/div&gt; &lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/computing-economics-and-the-financial-meltdow"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5832506008616792955?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5832506008616792955/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5832506008616792955' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5832506008616792955'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5832506008616792955'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/11/computing-economics-and-financial.html' title='Computing, economics and the financial meltdown (a collection of links)'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-8562294602962586814</id><published>2010-10-27T00:02:00.001-07:00</published><updated>2010-10-27T09:18:11.533-07:00</updated><title type='text'>Review of "Learning to Rank with Partially-Labeled Data"</title><content type='html'>&lt;div class="posterous_autopost"&gt;&lt;br /&gt;&lt;div&gt;I've been attending the University of Utah &lt;a href="http://www.cs.utah.edu/~suresh/mediawiki/index.php/MLRG/fall10"&gt;Machine Learning semina&lt;/a&gt;r (when I can) this fall. PhD student &lt;a href="http://www.cs.utah.edu/~piyush/"&gt;Piyush Kumar Rai&lt;/a&gt; is organizing it.&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;I volunteered to take the group though &lt;span style=""&gt;&lt;a href="https://ssli.ee.washington.edu/people/duh/papers/sigir.pdf" title="https://ssli.ee.washington.edu/people/duh/papers/sigir.pdf" class="external text" rel="nofollow" style="text-decoration: none; color: rgb(51, 102, 187); background-image: ; background-color: initial; padding-right: 16px; background-position: 100% 50%;"&gt;Learning to Rank with Partially-Labeled Data&lt;/a&gt; by Duh &amp;amp; Kirchhoff.  I have some experience researching and implementing LTR algorithms, mostly using reinforcement learning or ant-system type approaches.  Some general intro &lt;a href="http://en.wikipedia.org/wiki/Learning_to_rank"&gt;here&lt;/a&gt;.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style=""&gt;The paper presents the main &lt;a href="http://en.wikipedia.org/wiki/Transduction_(machine_learning)"&gt;Transductive Learning&lt;/a&gt; algorithm as a framework, then fills in the blanks with &lt;a href="http://en.wikipedia.org/wiki/Kernel_principal_component_analysis"&gt;Kernel PCA&lt;/a&gt; and &lt;a href="http://jmlr.csail.mit.edu/papers/volume4/freund03a/freund03a.pdf"&gt;RankBoost&lt;/a&gt;.  Several Kernels are used:  Linear, polynomial, radial basis function and knn-diffusion.  RankBoost learns a kind of ensemble of 'weak learners' with simple thresholds.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style=""&gt;The main reason to read the paper if you are already familiar with LTR is the use of the transductive algorithm.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;span class="Apple-style-span" style="font-size: 16px; "&gt;&lt;img src="http://posterous.com/getfile/files.posterous.com/nealrichter/LfPNH6yZ7w2ku6Nkuw2V647tJK9XAtUCRaWKSncqzxRHWRqTum5gNmPREX2u/Transductive.png" width="400" height="300" /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style=""&gt;Note the DISCOVER() &amp;amp; LEARN() functions.  These are the unsupervised and supervised algorithm blanks they fill with Kernel PCA and RankBoost.  What the first actually does is learn a function we could call EXTRACT() that can extract or create features for later use.  They do show that the basic idea of layering in unlabeled data with labeled data is a net gain.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style=""&gt;There are some issues with the paper.  First the computational time performance, as they admit, is not good.  The other is that their use of Kernel PCA in an information retrieval context is a bit naive IMO.  The IR literature is full of hard-won knowledge of extracting decent features from documents.  See &lt;a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html"&gt;this book&lt;/a&gt; for example.  This is mostly ignored here.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style=""&gt;The more confusing thing is the use of K-Nearest Neighbor diffusion kernels.  Basically they take the vector of documents, form a matrix by euclidean distance and then random-walk the matrix for a set number of time-steps.  The PCA then takes this 'kernel' output and solves the eigenvalue problem, to get the eigenvectors.  This all seems a round-about way of saying they approximated the Perron-Frobenius eigenvector (sometimes call PageRank) by iterating the matrix a set number of times and zeroed out low order cells.  Or at least I see no effective difference between what they did and what I just described.  Basically they just make the matrix sparse to solve it easier (ie this is the dual).&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style=""&gt;Their use of various classic IR features like TFIDF, BM25 etc needed help.  There's pleny of IR wisdom on how to use such features, why let the DISCOVER() wander about attempting to rediscover this? The results were also muddled with only one of the three data sets showing a significant improvement over a baseline technique.&lt;/span&gt;&lt;/div&gt; &lt;p&gt;&lt;/p&gt;&lt;div&gt;&lt;span style=""&gt;All that aside, it's worth a read for the intro to the&lt;/span&gt;&lt;span style=""&gt; transductive alg used with an IR centric task.&lt;/span&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/review-of-learning-to-rank-with-partially-lab"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-8562294602962586814?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/8562294602962586814/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=8562294602962586814' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8562294602962586814'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8562294602962586814'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/10/review-of-to-rank-with-partially.html' title='Review of &amp;quot;Learning to Rank with Partially-Labeled Data&amp;quot;'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-6636858392845550846</id><published>2010-10-26T23:07:00.001-07:00</published><updated>2010-10-26T23:07:34.249-07:00</updated><title type='text'>Stanford Computational Advertising course - Fall 2010</title><content type='html'>&lt;div class='posterous_autopost'&gt;Andrei Broder and Vanja Josifovski of Yahoo Research Labs are again offering a Fall course on &lt;a href="http://www.stanford.edu/class/msande239/"&gt;Computational Advertising&lt;/a&gt; at Stanford. &lt;p /&gt;&lt;div&gt;Great intro to the area.&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/stanford-computational-advertising-course-fal"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-6636858392845550846?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/6636858392845550846/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=6636858392845550846' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6636858392845550846'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6636858392845550846'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/10/stanford-computational-advertising.html' title='Stanford Computational Advertising course - Fall 2010'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4488606948604415089</id><published>2010-08-16T12:44:00.001-07:00</published><updated>2010-08-16T12:44:35.973-07:00</updated><title type='text'>Strategic Marketing</title><content type='html'>&lt;div class='posterous_autopost'&gt;I highly recommend this ExecEd course at MIT Sloan: &lt;a href="http://mitsloan.mit.edu/execed/coursedetails.php?id=743" target="_blank"&gt;Strategic Marketing for the Technical Executive&lt;/a&gt;.  &lt;p /&gt;Duncan Simester taught it. The course looks to be a condensed form of this course: &lt;a href="http://ocw.mit.edu/courses/sloan-school-of-management/15-810-marketing-management-fall-2004/" target="_blank"&gt;15.810 Marketing Management&lt;/a&gt;&lt;p /&gt; Key takeaways:&lt;br /&gt;1) Making decisions by experimentation versus meetings+intuition is crucial.&lt;br /&gt;2) Don&amp;#39;t assume your role is to know the answer. Your role is really to work out how to find the answer as quickly as possible.&lt;br /&gt; 3) Brand is unimportant when customers can observe you are meeting their needs.&lt;br /&gt;4) Brand is important when they can&amp;#39;t search/observe and must reason with less data.&lt;br /&gt;5) Don&amp;#39;t price your products based upon cost or competition, work out your true value to the customer.&lt;div&gt; 6) Your actual efficiencies might be different than you assume&lt;br /&gt;&lt;div&gt;7) Experiment and iterate as often as you can&lt;/div&gt;&lt;/div&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/strategic-marketing"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4488606948604415089?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4488606948604415089/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4488606948604415089' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4488606948604415089'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4488606948604415089'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/08/strategic-marketing.html' title='Strategic Marketing'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-5147939697407523106</id><published>2010-07-27T17:44:00.001-07:00</published><updated>2010-08-02T12:48:36.219-07:00</updated><title type='text'>Dissertation done!</title><content type='html'>&lt;div class="posterous_autopost"&gt;The big event of the spring was defending and completing the below.&lt;br /&gt;Really happy to be done. &lt;p&gt; Advice for working professionals attempting a PhD:&lt;br /&gt;1) pick something relevant to your work.&lt;br /&gt;2) think twice about a theoretical topic.&lt;br /&gt;3) don't make it longer/bigger than necessary.&lt;br /&gt;4) don't grow your family during this time &lt;/p&gt;&lt;p&gt; I did not follow this advice and this likely resulted in a 4 year delay. The outcome was great and the topic is now relevant to the new job at Rubicon. &lt;/p&gt;&lt;p&gt; &lt;a href="http://nealrichter.com/research/dissertation/"&gt;http://nealrichter.com/research/dissertation/&lt;/a&gt; &lt;/p&gt;&lt;p&gt; On Mutation and Crossover in the Theory of Evolutionary Algorithms &lt;/p&gt;&lt;p&gt; Abstract:&lt;br /&gt;The Evolutionary Algorithm is a population-based metaheuristic optimization algorithm. The EA employs mutation, crossover and selection operators inspired by biological evolution. It is commonly applied to find exact or approximate solutions to combinatorial search and optimization problems. &lt;/p&gt;&lt;p&gt;   This dissertation describes a series of theoretical and experimental studies on a variety of evolutionary algorithms and models of those algorithms. The effects of the crossover and mutation operators are analyzed. Multiple examples of deceptive fitness functions are given where the crossover operator is shown or proven to be detrimental to the speedy optimization of a function. While other research monographs have shown the benefits of crossover on various fitness functions, this is one of the few (or only) doing the inverse. &lt;/p&gt;&lt;p&gt;   A background literature review is given of both population genetics and evolutionary computation with a focus on results and opinions on the relative merits of crossover and mutation. Next, a family of new fitness functions is introduced and proven to be difficult for crossover to optimize. This is followed by the construction and evaluation of executable theoretical models of EAs in order to explore the effects of parameterized mutation and crossover. &lt;/p&gt;&lt;p&gt;These models link the EA to the Metropolis-Hastings algorithm. Dynamical systems analysis is performed on models of EAs to explore their attributes and fixed points. Additional crossover deceptive functions are shown and analyzed to examine the movement of fixed points under changing parameters. Finally, a set of online adaptive parameter experiments with common fitness functions is presented.&lt;/p&gt;&lt;p&gt; Finalized April 19, 2010 &lt;/p&gt;&lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/dissertation-done"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5147939697407523106?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5147939697407523106/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5147939697407523106' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5147939697407523106'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5147939697407523106'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2010/07/dissertation-done.html' title='Dissertation done!'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-341023655708657021</id><published>2009-11-02T17:47:00.001-08:00</published><updated>2009-11-02T17:49:59.794-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='key-value stores'/><category scheme='http://www.blogger.com/atom/ns#' term='tokyo tyrant'/><title type='text'>RFP: Bounding memory usage in Tokyo Cabinet and Tokyo Tyrant</title><content type='html'>I'm soliciting proposals to implement an absolute (or at least soft) bound on memory usage of TT + TC. The reward is cash. Send me a proposal. The patch needs to work, not crash TT/TC or corrupt data. &lt;p&gt; Here's a start: &lt;a href="http://github.com/nealrichter/tokyotyrant_rsshack"&gt;http://github.com/nealrichter/tokyotyrant_rsshack&lt;/a&gt; &lt;/p&gt;&lt;p&gt; I've attempted to do this myself and have not had the time to finish or fully test it. I've asked Mikio for feedback/help finishing this and he's been nearly silent on the request. &lt;/p&gt;&lt;p&gt; At the moment we (myself, Sam Tingleff and Mike Dierken) work around this issue by continuing to play with various parameter TC settings and restarting the daemon when the memory usage grows beyond a comfort level. &lt;/p&gt;&lt;p&gt; I'm a C coder and have hacked the internals of BerkeleyDB in the past, so can help review code, trade ideas, etc. We (as a team) don't have the time to work on this at the moment. &lt;/p&gt;&lt;p&gt; If you are interested contact me! We've got a few other ideas for TT enhancements as well... &lt;/p&gt;&lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/rfp-bounding-memory-usage-in-tokyo-cabinet-an"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-341023655708657021?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/341023655708657021/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=341023655708657021' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/341023655708657021'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/341023655708657021'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/11/rfp-bounding-memory-usage-in-tokyo.html' title='RFP: Bounding memory usage in Tokyo Cabinet and Tokyo Tyrant'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1423157201813845704</id><published>2009-10-28T23:11:00.001-07:00</published><updated>2009-10-28T23:18:48.053-07:00</updated><title type='text'>Current Computational Advertising Course and Events</title><content type='html'>Andrei Broder and Vanja Josifovski of Yahoo Research Labs are offering a Fall 2009 course on &lt;a href="http://www.stanford.edu/class/msande239/" target="_blank"&gt;Computational Advertising at Stanford&lt;/a&gt;.  The lecture #1-#4 slides are up and it looks to be an interesting course.  Will definitely continue to follow along remotely.&lt;b&gt;&lt;/b&gt;&lt;p&gt;&lt;b&gt; &lt;/b&gt;National Institute of Statistical Sciences is holding a &lt;a href="http://www.niss.org/event/niss-affiliates-workshop-computational-advertising" target="_blank"&gt;workshop on CompAdvert&lt;/a&gt; in early November.  The upcoming  &lt;a href="http://ira09.soe.ucsc.edu/" target="_blank"&gt;WINE'09&lt;/a&gt; conference in Rome contains a few accepted papers in CompAdvert.  &lt;a href="http://www.sigkdd.org/kdd2010/"&gt;SigKDD 2010&lt;/a&gt; mentions it in the CFP..&lt;/p&gt;&lt;p&gt; Luckily the search engine results on 'Computational Advertising are still free of noise.  &lt;a href="http://www.google.com/search?hl=en&amp;amp;client=firefox-a&amp;amp;rls=com.ubuntu:en-US:unofficial&amp;amp;hs=baS&amp;amp;q=computational+advertising&amp;amp;start=1&amp;amp;sa=N" target="_blank"&gt;Google&lt;/a&gt;, &lt;a href="http://search.yahoo.com/search?p=computational+advertising&amp;amp;ei=UTF-8&amp;amp;fr=ubuntu" target="_blank"&gt;Yahoo&lt;/a&gt;, &lt;a href="http://www.bing.com/search?q=computational+advertising&amp;amp;form=OSDSRC" target="_blank"&gt;Bing&lt;/a&gt;&lt;/p&gt;&lt;p&gt; Prior Events:&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://ira09.soe.ucsc.edu/" target="_blank"&gt;http://ira09.soe.ucsc.edu/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://ijcai-09.org/tutorials/tutorial-SA4.html" target="_blank"&gt;http://ijcai-09.org/tutorials/tutorial-SA4.html&lt;/a&gt;&lt;/li&gt; &lt;li&gt; &lt;a href="http://www.sigecom.org/ec08/schedule_tutorials.html" target="_blank"&gt;http://www.sigecom.org/ec08/schedule_tutorials.html&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.ling.ohio-state.edu/acl08/cft.html#tut_1" target="_blank"&gt;http://www.ling.ohio-state.edu/acl08/cft.html#tut_1&lt;/a&gt;&lt;/li&gt; &lt;li&gt;&lt;a href="http://www.sigir2008.org/tutorials.html#t9" target="_blank"&gt;http://www.sigir2008.org/tutorials.html#t9&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/current-computational-advertising-course-and"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1423157201813845704?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1423157201813845704/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1423157201813845704' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1423157201813845704'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1423157201813845704'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/10/current-computational-advertising.html' title='Current Computational Advertising Course and Events'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4021747070190554853</id><published>2009-09-18T23:21:00.003-07:00</published><updated>2009-09-19T19:07:00.482-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='streams'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><title type='text'>References for mining from streaming data</title><content type='html'>While reading a lower quality paper on the subject I found these references worth tracking down.  Essentially the idea is that you can make one-pass through the data and must produce statistics of the data that are approximate in nature, ideally with bounded approximation error.&lt;p&gt; Gibbons et al 1997: &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.9995&amp;amp;rep=rep1&amp;amp;type=url&amp;amp;i=0"&gt;Fast Incremental Maintenance of Approximate Histograms&lt;/a&gt;&lt;br /&gt; &lt;i&gt;Incremental maintenance of histograms primarily for database query planners&lt;/i&gt;&lt;/p&gt;&lt;p&gt; Manku and Motwani 2002 &lt;a href="http://infolab.stanford.edu/%7Emanku/papers/02vldb-freq.pdf"&gt;Approximate frequency counts over data streams&lt;/a&gt;.&lt;br /&gt;&lt;i&gt;They show algorithms called sticky sampling and lossy counting with proven error bounds.&lt;/i&gt;&lt;/p&gt;&lt;p&gt; Zhu et al. 2002 &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.8732&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time&lt;/a&gt;&lt;br /&gt;&lt;i&gt;  Basic stats and high correlations between all pairs of data streams&lt;/i&gt;&lt;/p&gt;&lt;p&gt;  Cormode and Muthukrishnan's 2003 &lt;a href="http://people.cis.ksu.edu/%7Egpranshu/Trackingfrequentitems.pdf"&gt;What's Hot and What's Not: Tracking Most Frequent Items Dynamically&lt;/a&gt;&lt;br /&gt;&lt;i&gt;   Introduced groupTest probabilistic Monte Carlo algorithm for frequent item tracking&lt;/i&gt;&lt;/p&gt;&lt;p&gt;  Metwally et al 2005 &lt;a href="http://www.cs.ucsb.edu/%7Edsl/publications/2005/ICDT2005-metwally.pdf%20"&gt;Efficient Computation of Frequent and Top-k Elements in Data Streams&lt;/a&gt;&lt;br /&gt; &lt;i&gt;Uses counters to monitor data streams with a new stream-summary data structure&lt;/i&gt;&lt;/p&gt;&lt;p&gt; Cheng et al 2005 &lt;a href="http://www.dbsj.org/Japanese/DBSJLetters/vol4/no1/papers/cheng.pdf"&gt;Time-Decaying Bloom Filters for Data Streams with Skewed Distributions&lt;/a&gt;&lt;br /&gt; &lt;i&gt;Dampened Bloom-filters for frequent items&lt;/i&gt;&lt;/p&gt;&lt;p&gt; Three papers on frequent itemsets (different than frequent items):&lt;/p&gt;&lt;p&gt;Jiang and Gruenwald have a pair of 2006 papers&lt;br /&gt;&lt;a href="http://www.cs.ou.edu/%7Edatabase/documents/Research%20Issues%20in%20Association%20Rule%20Mining%20for%20Data%20Streams.pdf"&gt;Research Issues in Data Stream Association Rule Mining&lt;/a&gt;&lt;br /&gt;&lt;i&gt;  Survey paper of issues and previous results&lt;/i&gt;&lt;/p&gt;&lt;p&gt;&lt;i&gt;&lt;/i&gt;&lt;a href="http://www.cs.ou.edu/%7Edatabase/documents/CFI-Stream%20Mining%20Closed%20Frequent%20Itemsets%20in%20Data%20Streams.pdf"&gt;CFI-Stream: Mining Closed Frequent Itemsets in Data Streams&lt;/a&gt;&lt;br /&gt; &lt;i&gt;New stream based itemset miner&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;Tsai and Chen 2009 &lt;a href="http://dspace.lib.fcu.edu.tw/bitstream/2377/11247/1/ce07ics002008000165.pdf"&gt;Mining Frequent Itemsets for data streams over Weighted Sliding Windows&lt;/a&gt;&lt;br /&gt;&lt;i&gt;   Apriori like itemset miner on windows of data with differential weighting&lt;/i&gt;&lt;/p&gt;Langford et al. 2008 &lt;a href="http://jmlr.csail.mit.edu/papers/volume10/langford09a/langford09a.pdf"&gt;Sparse Online Learning via Truncated Gradient&lt;/a&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Induced sparsity in the weights of online learning algorithms with convex loss functions&lt;/span&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;(9/19) Corrected the StatStream link - hat tip &lt;a href="http://twitter.com/dataspora/"&gt;@dataspora&lt;br /&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Added Langford - hat tip &lt;a href="http://twitter.com/gappy3000/"&gt;@gappy3000&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/references-for-mining-from-streaming-data"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4021747070190554853?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4021747070190554853/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4021747070190554853' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4021747070190554853'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4021747070190554853'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/09/references-for-mining-from-streaming_18.html' title='References for mining from streaming data'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-2401061769245282308</id><published>2009-09-17T12:30:00.001-07:00</published><updated>2009-09-17T20:54:20.692-07:00</updated><title type='text'>Others Online acquired by the Rubicon Project</title><content type='html'>I'm thrilled to say that Others Online has been scooped up by &lt;a href="http://www.rubiconproject.com/" target="_blank"&gt;the Rubicon Project&lt;/a&gt;.  Press release &lt;a href="http://rubiconproject.com/about/press/the-rubicon-project-makes-others-online-its-first-acquisition/" target="_blank"&gt;here&lt;/a&gt;.  I'm joining as a Data Scientist.  Jordan authored a wrap-up post here: &lt;a href="http://blog.othersonline.com/"&gt;http://blog.othersonline.com/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Update:  Nice Forbes article on the opportunity in front of us.&lt;br /&gt;&lt;a href="http://www.forbes.com/2009/09/17/internet-advertising-rubicon-systems-business-media-rubicon.html"&gt;For Advertisers Drowning In Data, A Lifeguard &lt;/a&gt;&lt;br /&gt;&lt;p style="font-size: 10px;"&gt;&lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/others-online-acquired-by-the-rubicon-project"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-2401061769245282308?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/2401061769245282308/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=2401061769245282308' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2401061769245282308'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2401061769245282308'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/09/others-online-acquired-by-rubicon.html' title='Others Online acquired by the Rubicon Project'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4165142573163865402</id><published>2009-09-14T11:54:00.001-07:00</published><updated>2009-09-14T16:07:42.900-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='adword'/><category scheme='http://www.blogger.com/atom/ns#' term='keyword'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><category scheme='http://www.blogger.com/atom/ns#' term='computational advertising'/><title type='text'>Google AdWords now personalized</title><content type='html'>Hat Tip: Found via Greg Linden's blog: &lt;a href="http://glinden.blogspot.com/2009/09/google-adwords-now-personalized.html"&gt;Google AdWords now personalized&lt;/a&gt;. Below are my thoughts and questions:&lt;p&gt; Google is now reaching back into your previous search history and presumably choosing a better previous search if the current one is not sufficiently monetizable.  &lt;/p&gt;&lt;p&gt;Questions:&lt;br /&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Is the goal of Google to increase the fill-rate of ads or to show more valuable ads in general?&lt;br /&gt;&lt;/li&gt;&lt;li&gt;What criteria is used to reach back into a user's history?  Boolean commercial/non-commercial then select last commercial search versus choosing based upon some selection algorithm from the last N searches (see previous point).&lt;/li&gt; &lt;li&gt;Will the reach-back cross a topic boundary or is it only to enhance context for an ambiguous search?&lt;br /&gt;&lt;/li&gt;&lt;li&gt;What effect will this have on the &lt;a href="https://adwords.google.com/select/KeywordToolExternal"&gt;Google Keyword Tool&lt;/a&gt; that helps advertisers forecast demand and price for a keyword?  The volume numbers must now be adjusted by the amount of time the impressions are shifted to alternate keywords.&lt;/li&gt; &lt;li&gt;How much will this starve the long-tail of searches?  Depending on the aggressiveness of the selection then long-tail searches may suffer a decrease in volume for adwords.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Even the most modest change of merely using recent previous searches only 'about' the current search to augment the adwords auction query should have a dramatic effect on the auction process.  By definition it expands the number of bidders for a particular query.  It may also curtail the effectiveness of arbitrage done by some adwords buyers who buy ambiguous lower value keywords as proxies for high value ones due to user sessions with query reformulations.  Why?  It should have the effect of driving up prices for the penny keywords if they are sufficiently related to high value keywords.&lt;p&gt; It will be interesting to watch what happens.  This is likely not a non-trivial change in the keyword market. &lt;/p&gt;&lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/google-adwords-now-personalized"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4165142573163865402?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4165142573163865402/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4165142573163865402' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4165142573163865402'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4165142573163865402'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/09/google-adwords-now-personalized.html' title='Google AdWords now personalized'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-5804339610923689653</id><published>2009-08-28T15:36:00.001-07:00</published><updated>2009-08-28T15:47:45.538-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='scripting languages'/><title type='text'>The GPL and scripting languages</title><content type='html'>I was recently asked by an associate about how the GPL and LGPL impacts scripting languages.  The short answer is that it's not clear.&lt;p&gt;My general take on the LGPL and scripting languages is that if one cut-and-pastes LGPL code A into code B, then code B falls under the LGPL.  However if one script-includes code A into code B in the only real mechanism scripting languages allow then code B does not fall under the LGPL.&lt;/p&gt;&lt;p&gt; php: &lt;code&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;span style="color: rgb(0, 119, 0);"&gt;include &lt;/span&gt;&lt;span style="color: rgb(221, 0, 0);"&gt;'Foo.php'&lt;/span&gt;&lt;span style="color: rgb(0, 119, 0);"&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;perl: &lt;a href="http://perldoc.perl.org/functions/require.html" class="l_k" style="font-family: courier new,monospace;"&gt;require&lt;/a&gt;&lt;span style="font-family:courier new,monospace;"&gt; &lt;/span&gt;&lt;span class="w"  style="font-family:courier new,monospace;"&gt;Foo::Bar&lt;/span&gt;&lt;span class="sc"  style="font-family:courier new,monospace;"&gt;;&lt;/span&gt;&lt;br /&gt;python: &lt;span class="pykeyword" style="color: rgb(0, 0, 153);font-family:courier new,monospace;" &gt;import&lt;/span&gt;&lt;span style="font-family:courier new,monospace;"&gt; Foo&lt;/span&gt;&lt;br /&gt;ruby:&lt;span style="font-family:courier new,monospace;"&gt; &lt;span style="color: rgb(0, 153, 0);"&gt;require "foo"&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;javascript: &lt;span style="color: rgb(255, 102, 0);font-family:courier new,monospace;" &gt;&amp;lt;script type="text/javascript" src="external.js"&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;&lt;/p&gt;&lt;pre class="verbatim"&gt;&lt;br /&gt;&lt;/pre&gt;This differs from the application of the LGPL to compiled languages like C/C++ as the programmer decides how the machine code is combined: static versus dynamically linked.  In Java if one has an LGPL jar file, the the code can be used in the standard way.  If your code is combined with the contents of the LGPL jar file in a new jar, then your code falls into the LGPL.&lt;p&gt; What does the scripting language really do under the hood?  It's different in each language.  Some may physically include the file and parse-interpret it as one block of code, others may parse-interpret separately and resolve references between them much like a dynamic linker.  Do the details matter?  Unknown, yet one could reasonably assume the author of a piece of script code under the LGPL must have intended others to make script-include style uses OK.  If in doubt check with the author and save that email.&lt;br /&gt;&lt;/p&gt;&lt;p&gt; What about the GPL?  More clear.. no mixing of any kind can be done.  One could argue that interpreted script is not linked in the way that compiled languages are... yet conservatively this is a "thin reed" to stand on.&lt;/p&gt;&lt;p&gt; What does this mean about the myriad of GPL javascript out there intermingling with non-GPL and proprietary javascript within  browsers visiting script heavy websites everywhere?  It's a dog's breakfast of legal issues.&lt;/p&gt;&lt;p&gt; The post &lt;a href="http://jacobian.org/writing/gpl-questions/"&gt;Twenty questions about the GPL&lt;/a&gt; is pretty informative of the issues, worth a read... deeper than the above. &lt;/p&gt;&lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com/"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/the-gpl-and-scripting-languages"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5804339610923689653?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5804339610923689653/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5804339610923689653' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5804339610923689653'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5804339610923689653'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/08/gpl-and-scripting-languages.html' title='The GPL and scripting languages'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-7838435797875950182</id><published>2009-08-20T23:59:00.001-07:00</published><updated>2009-08-20T23:59:36.568-07:00</updated><title type='text'>What can we learn from the Google File System?</title><content type='html'>I found this via my twitter herd and I&amp;#39;ve been thinking about it for days. &lt;a href="http://queue.acm.org/detail.cfm?id=1594206"&gt;Kirk McKusick interviews Sean Quinlan&lt;/a&gt; about the &lt;a href="http://labs.google.com/papers/gfs-sosp2003.pdf"&gt;Google File System&lt;/a&gt;. Fascinating stuff.&lt;p /&gt; We&amp;#39;re very fortunate to have storage scalability challenges ourselves at &amp;#39;Undisclosed&amp;#39; (formerly OthersOnline). We&amp;#39;re amassing mountains of modest chunks of information via a set of many many hundreds of millions of keys. We&amp;#39;ve evolved thusly:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;MySQL table with single key and many values, one row per value.&lt;/li&gt;&lt;li&gt;MySQL key/value table with one value per key&lt;/li&gt;&lt;li&gt;&lt;a href="http://memcachedb.org/"&gt;memcachedb&lt;/a&gt; - memached backed by Berkeley DB&lt;/li&gt; &lt;li&gt;&lt;a href="http://nginx.net/"&gt;Nginx&lt;/a&gt; + Webdav system&lt;/li&gt;&lt;li&gt;Our own Sam Tingleff&amp;#39;s &lt;a href="http://github.com/samtingleff/tree/master"&gt;valkyrie&lt;/a&gt; - Consistent Hashing + Tokyo Tyrant&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;#5 is the best performing, yet we still aren&amp;#39;t going to escape the unpredictable I/O performance of EC2 disks. To my understanding this serves a role much like the chunk servers of GFS. We need low latency read access to storage on one side, and high throughput write access on the other side.&lt;p /&gt;Combining the insights with the above interview and the excellent &lt;a href="http://queue.acm.org/detail.cfm?id=1563874"&gt;Pathologies of Big Data&lt;/a&gt;, I&amp;#39;m left with the impression that one must absolutely limit the number of disk seeks and do some &amp;#39;perfect&amp;#39; number of I/O ops against of X MB chunks.&lt;p /&gt;Random questions:&lt;br /&gt;What is X for some random hardware in the cloud? How do we find it? What if X changes as the disk gets full? What kind of mapping function do we send the application key into to return a storage key to get the best amount of sequential disk access on write and best memory caching of chunks on read? What about fragmentation?&lt;p /&gt;It does seem as though the newish adage is very true.. web 2.0 is mostly about exposing old-school unix commands and system calls over HTTP. I keep thinking this must have been solved a dozen times before. cycbuf feature of usenet servers? &lt;p style="font-size: 10px;"&gt; &lt;a href="http://posterous.com"&gt;Posted via email&lt;/a&gt;  from &lt;a href="http://nealrichter.posterous.com/what-can-we-learn-from-the-google-file-system"&gt;nealrichter's posterous&lt;/a&gt; &lt;/p&gt;   &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-7838435797875950182?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/7838435797875950182/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=7838435797875950182' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7838435797875950182'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7838435797875950182'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/08/what-can-we-learn-from-google-file.html' title='What can we learn from the Google File System?'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1455185087590110722</id><published>2009-08-02T08:13:00.000-07:00</published><updated>2009-08-03T08:46:14.963-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='journals'/><category scheme='http://www.blogger.com/atom/ns#' term='cs'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Response to Dr Lance Fortnow's CACM opinion</title><content type='html'>Dr. Lance Fortnow published a strong argument against the current 'strong conference' CS system in the &lt;a href="http://cacm.acm.org/magazines/2009/8/34492-time-for-computer-science-to-grow-up/fulltext"&gt;August issue of CACM.&lt;/a&gt;  His essential desire is that CS move to a system similar to the hard sciences and engineering.  Conferences should accept any paper of reasonable quality, not publish proceedings and hold Journals as the sole vehicle for publications.&lt;br /&gt;&lt;br /&gt;My Questions:&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;While CS should have a stronger Journal system, why should that come at the expense of quality conferences?&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;    The reputation of CS conferences as a method of publication means that it is both acceptable to publish and cite papers from these confs.  The result of this is that one can read a conf paper, compose a citing follow-up and publish it &lt;i&gt;within&lt;/i&gt; 1 year.  This fluidity of ideas without the sometimes 18 month to 2 year wait on Journal publication is a great advantage!&lt;br /&gt;&lt;br /&gt;    Yes other disciplines publish pre-prints to arXiv.org.. is this really a solution when a paper has been rejected yet avail on arXiv.org for 18 months?&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Is this problem really that acute in all communities?  Can't it be solved at the community level?&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;    Certainly in AI, Machine Learning, Data Mining and Evolutionary Computation I perceive that the Journals are held in high regard (all papers are great) and that the conference proceedings can be a mixed bag with strong and weak papers.  I am getting strong pressure from my advisors and peers to consider Journal versions of some of my conf papers.  EC specifically has a non-proceedings &lt;a href="http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=08051"&gt;conference&lt;/a&gt; to meet and discuss less settled results.&lt;br /&gt;&lt;br /&gt;    If Dr. Fortnow feels that his area of theoretical computer science is too fragmented, then a solution would be to found more Journals and push the best conference papers into those Journals more heavily.  Perhaps that would mean that the conference system of that area would shrink in response as the Journals established themselves.&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Why would industrial researchers and scientists participate in publications with such long time cycles?&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;    Again the fluidity is an advantage.  Were CS suddenly to switch to a soft conference system (low benefit to participation other than networking) I fear that participation of industry in publication venues would suffer.  The time scales of Journals in general mean that at publication time the results were done an eternity ago in industry terms.  Convincing your supervisor that participation and publication in a conference is a far easier sell than an extended Journal submission effort.&lt;br /&gt;&lt;br /&gt;    One also wonders if the Journal system is not implicitly biased towards academic communities where participants are chasing tenure.  This 'ranking of people' referenced by Dr. Fortnow is very much for academic institutions and not of much value to industry IMHO.&lt;br /&gt; &lt;br /&gt;&lt;li&gt;Why do we want ANY such top-down forcing of CS organization?&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;    The culture of CS is much more aligned with self-organization and communities forming out of a bird-of-a-feather effect.  This also aligns with the changing face of corporate cultures and culture in general.  Such a top-down driven reorg would likely both fail and break the inherent fluidity of ideas and results in CS.&lt;br /&gt;&lt;br /&gt;    I also have an unfounded suspicion that such a top-down forced re-org would result in a clustering of power and influence towards traditional centers of power in academia.  If one picks up conference proceedings in my favorite CS areas and does a frequency count of the author's institutions the distribution is very much long tail.  The 'elite' universities are not dominating the results meaning that the 'in crowd' effect is much lesser in CS.&lt;br /&gt;&lt;br /&gt;    Feudal systems are dying fast for a reason.&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Obviously the current system is serving a need, doesn't that speak for itself?&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;    If CS researchers and scientists continue to attend, publish at and found conferences is this not evidence that it is serving a real need?&lt;br /&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;While Dr. Fortnow is correct on his points w.r.t. the problems faced by conference 'PC' committees... the correct response is best done within that community.  Found some Journals and compete with the conferences for publications and reputation.  I don't accept that a strong Journal system can only be created by first wiping away a very fluid and successful conference system.&lt;br /&gt;&lt;br /&gt;My personal solution to strengthening the Journal system?  I'll set a goal of submitting a Journal publication or two in the next year.  I am completely remiss in not yet submitting anything to a Journal.&lt;br /&gt;&lt;br /&gt;As a counter point to the above arguments.. if I could have a wish it would be a Journal system with fast review times, immediate publication of accepted papers and that markets itself by cherry picking great papers from conferences and encouraging those authors to submit to the Journal.  Something like the  &lt;a href="http://jmlr.csail.mit.edu/"&gt;Journal of Machine Learning Research&lt;/a&gt;.  The publishing of referee reviews also sounds interesting.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1455185087590110722?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1455185087590110722/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1455185087590110722' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1455185087590110722'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1455185087590110722'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/08/response-to-dr-lance-fortnows-cacm.html' title='Response to Dr Lance Fortnow&apos;s CACM opinion'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-7017169098304195036</id><published>2009-07-14T09:41:00.000-07:00</published><updated>2011-01-14T08:57:50.027-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='highly available'/><category scheme='http://www.blogger.com/atom/ns#' term='apache'/><title type='text'>Hacking Apache's mod_proxy_http to enforce an SLA</title><content type='html'>Following up on the last post about HTTP &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;SLAs&lt;/span&gt;, let's say you have a web-service exposing &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;ReST&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;APIs&lt;/span&gt; for your awesome data miner/processor.  It has data input/output &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;APIs&lt;/span&gt; of various kinds.  The software &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_4"&gt;architecture&lt;/span&gt; consists of front-end &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;apache&lt;/span&gt; servers and back-end tomcat plus various data stores.  Apache's mod_proxy and some load &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;balancer&lt;/span&gt; (&lt;a href="http://haproxy.1wt.eu/"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;HAProxy&lt;/span&gt;&lt;/a&gt;, &lt;a href="http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html"&gt;mod_proxy_&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;balancer&lt;/span&gt;&lt;/a&gt;) pushes the incoming requests to &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;backend&lt;/span&gt; servers.&lt;br /&gt;&lt;br /&gt;A client wants a guarantee that your &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;APIs&lt;/span&gt; will accept requests and return valid data and response codes within &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;XXms&lt;/span&gt; for 95% of requests (see &lt;a href="http://en.wikipedia.org/wiki/Service_level_agreement"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;Wikipedia's&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;SLA&lt;/span&gt;&lt;/a&gt; for other examples of service guarantees).  How can one be absolutely sure that the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;SLA&lt;/span&gt; is met?  Now add in the wrinkle that there might be different &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;SLAs&lt;/span&gt; for the various &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;APIs&lt;/span&gt;.  In addition, the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;SLA&lt;/span&gt; could specify that as close to 100% as possible of the requests return HTTP codes within the 2xx range.. suppressing any 3xx, 4xx or 5xx codes from coming back to the outside world.&lt;br /&gt;&lt;br /&gt;The issues with making &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;apache&lt;/span&gt; do this are as follows:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;ProxyTimeout&lt;/span&gt; is global or scoped to the particular &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;vhost&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;ErrorDocuments&lt;/span&gt; still return the error code (503, 404, etc)&lt;/li&gt;&lt;li&gt;No way to tie &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;ErrorDocuments&lt;/span&gt; and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;ProxyTimeouts&lt;/span&gt; to particular requests.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;A key insight from Ronald Park is to use mod_rewrite and then pass various environment arguments to mod_proxy that are specific to the URL being addressed by mod_rewrite.  This was the approach taken by Ronald Park in his attempts to solve this problem in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_24"&gt;apache&lt;/span&gt; 2.0.x &lt;a href="http://www.nabble.com/mod_proxy-timeouts-td15278208.html"&gt;here&lt;/a&gt; and &lt;a href="http://www.mail-archive.com/dev@httpd.apache.org/msg39924.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The below example is a rewrite rule that makes no changes to the URL itself for a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_25"&gt;JS&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_26"&gt;API&lt;/span&gt; presumably returning data in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_27"&gt;JSON&lt;/span&gt;.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_28"&gt;RewriteRule&lt;/span&gt; ^/api/(.*).js\?*(.*)$ http://backendproxy/api/$1.js?$2 [P,QSA,E=proxy-timeout:900ms,E=error-suppress:true,E=error-headers:/errorapi.js.HTTP,E=error-document:/errorapi.js]&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;With the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_29"&gt;SLA&lt;/span&gt; enforcement &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_30"&gt;modifications&lt;/span&gt; enabled, the URL will return data from the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_31"&gt;backend&lt;/span&gt; system within 900ms or a timeout occurs.   At this point &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_32"&gt;apache&lt;/span&gt; will stop waiting for the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_33"&gt;backend&lt;/span&gt; response and serve back the static files &lt;code&gt;/errorapi.js.HTTP&lt;/code&gt; as HTTP headers and &lt;code&gt;/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_34"&gt;errorapi&lt;/span&gt;.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_35"&gt;js&lt;/span&gt;&lt;/code&gt; as contents.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;$cat /var/www/html/errorapi.js.HTTP&lt;br /&gt;Status: 204&lt;br /&gt;Content-type: application/javascript&lt;br /&gt;&lt;br /&gt;$cat /var/www/html/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_36"&gt;errorapi&lt;/span&gt;.&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_37"&gt;js&lt;/span&gt;&lt;br /&gt;var xxx_&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_38"&gt;api&lt;/span&gt;_data={data:[]}; /* ERR */&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;There are four environment variables the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_39"&gt;SLA&lt;/span&gt; hack looks for:&lt;ul&gt;&lt;li&gt;&lt;code&gt;proxy-timeout:&lt;/code&gt; - time in seconds or &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_40"&gt;milliseconds&lt;/span&gt; to wait until timing out&lt;/li&gt;&lt;li&gt;&lt;code&gt;error-suppress:&lt;/code&gt; - true/false switch on suppressing all non 2xx errors from the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_41"&gt;backend&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;&lt;code&gt;error-headers:&lt;/code&gt; - file of syntax correct HTTP headers to return to the client&lt;/li&gt;&lt;li&gt;&lt;code&gt;error-document:&lt;/code&gt; - file of content body to be returned to the client&lt;/li&gt;&lt;/ul&gt;Leaving off the &lt;code&gt;proxy-timeout&lt;/code&gt; will only suppress errors from the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_42"&gt;backend&lt;/span&gt; after the global timeout occurs. Leaving off &lt;code&gt;error-suppress:true&lt;/code&gt; will ensure that the 5xx timeout error from mod_&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_43"&gt;proxy&lt;/span&gt;_http is returned intact to the client.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://github.com/nealrichter/mod_proxy_http_sla/tree/master"&gt;Source code here&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There are two versions checked into &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_44"&gt;github&lt;/span&gt; for &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_45"&gt;Ubuntu&lt;/span&gt; 9.04's &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_46"&gt;apache&lt;/span&gt;2 2.2.11 and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_47"&gt;Centos&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_48"&gt;el&lt;/span&gt;5.2's &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_49"&gt;httpd&lt;/span&gt; 2.2.3-11.  It's advisable to diff the changes with the 'stock' file and likely re-do hack code in your version of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_50"&gt;apache&lt;/span&gt; 2.2.  See Ron Park's code for 2.0.x and fold in the other mods supporting error-suppress etc.&lt;br /&gt;&lt;br /&gt;The hack is being tested in a production environment, stay tuned.  This will get posted to the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_51"&gt;apache&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_52"&gt;dev&lt;/span&gt; list..hopefully with responses suggesting &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_53"&gt;improvements&lt;/span&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Update for 2011:  This has handled billions of requests per month at this point and works great.  No issues.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-7017169098304195036?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/7017169098304195036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=7017169098304195036' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7017169098304195036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7017169098304195036'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/07/hacking-apaches-modproxyhttp-to-enforce.html' title='Hacking Apache&apos;s mod_proxy_http to enforce an SLA'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-6691558891819292216</id><published>2009-07-03T10:18:00.000-07:00</published><updated>2009-07-03T10:28:50.216-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='highly available'/><category scheme='http://www.blogger.com/atom/ns#' term='apache'/><title type='text'>HTTP request with SLA</title><content type='html'>Here's a (nonAI) problem I'd like to solve.  Configure apache to receive a request and proxy/forward it off to a backend app server (tomcat) .. wait a specified period of time ... if no response is received send back a static file or a return code like 204.&lt;br /&gt;&lt;br /&gt;Essentially it's a combination between the below mod_rewrite and mod_alias directives with a timeout.  Below example uses solr without loss of generality.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;RedirectMatch 204 /search/(.*)$&lt;br /&gt;RewriteRule ^/search/(.*)$ http://backend:8080/solr/$1 [P]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Logic below:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;If url matches rewrite-rule regex then&lt;br /&gt;{&lt;br /&gt;        set timer for 500ms with timer_callback()&lt;br /&gt;        force proxy to second url after rewriting it&lt;br /&gt;        if(response from proxy is &gt; 199 and &lt; 300)&lt;br /&gt;           return response&lt;br /&gt;        else&lt;br /&gt;           return 204 or static default file&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;timer_callback()&lt;br /&gt;{&lt;br /&gt;       return 204 or static default file&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Instead of returning 204 one could also serve back a static file like /noresults.xml&lt;br /&gt;&lt;br /&gt;The general idea is to expose a url that has a near-guaranteed response time limit (assuming apache is alive) where a 204 or a static default is acceptable behaviour.  I suspect that we'll need to write an apache module to do this, yet surely this question has been asked and solved before!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-6691558891819292216?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/6691558891819292216/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=6691558891819292216' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6691558891819292216'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6691558891819292216'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/07/http-request-with-sla.html' title='HTTP request with SLA'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4960399364472413264</id><published>2009-03-07T16:12:00.000-08:00</published><updated>2009-03-07T16:52:34.011-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='law'/><category scheme='http://www.blogger.com/atom/ns#' term='copyright'/><title type='text'>Copyright, Licenses and First Sale doctrine</title><content type='html'>Very interesting 'Legally Speaking' article in the &lt;a href="http://portal.acm.org/toc.cfm?id=1467247&amp;amp;coll=ACM&amp;amp;dl=ACM&amp;amp;type=issue&amp;amp;idx=J79&amp;amp;part=magazine&amp;amp;WantType=Magazines&amp;amp;title=Communications%20of%20the%20ACM&amp;amp;CFID=25070395&amp;amp;CFTOKEN=55762258"&gt;March 2009 CACM&lt;/a&gt;, &lt;a href="http://portal.acm.org/citation.cfm?id=1467247.1467258&amp;amp;coll=ACM&amp;amp;dl=ACM&amp;amp;idx=J79&amp;amp;part=magazine&amp;amp;WantType=Magazines&amp;amp;title=Communications%20of%20the%20ACM&amp;amp;CFID=25070395&amp;amp;CFTOKEN=55762258" class="medium-text" target="_self"&gt;&lt;strong&gt;When is a "license" really a sale?&lt;/strong&gt;&lt;/a&gt; by &lt;a href="http://people.ischool.berkeley.edu/%7Epam/"&gt;Pamela Samuelson&lt;/a&gt;.  She covers the principle of '&lt;a href="http://en.wikipedia.org/wiki/First-sale_doctrine"&gt;first sale&lt;/a&gt;' and the recent decisions on the UMG v. Augusto and &lt;a href="http://www.citizen.org/litigation/forms/cases/CaseDetails.cfm?cID=437"&gt;Vernor v. Autodesk&lt;/a&gt;.  Essentially both Augusto and Vernor were selling used CDs and Software on eBay and the copyright owners of the products went after them for these sales.  Both parties are using the first-sale doctrine to defend their rights to sell the media.  The copyright owners are trying to use contract law and the licenses to restrict transfer of the physical media and right to use it.  Samuelson predicts that both will end up on the 9th Circuit appeals court.&lt;br /&gt;&lt;br /&gt;I hesitate to issue an opinion here, I see both sides.  More debate &lt;a href="http://blogs.siliconvalley.com/gmsv/2008/05/this-garage-sale-held-pursuant-to-the-ruling-in-vernor-v-autodesk.html"&gt;here&lt;/a&gt;, &lt;a href="http://www.blog.cadnauseam.com/2009/02/21/vernor-v-autodesk-why-i-think-autodesk-is-right/"&gt;here&lt;/a&gt; and &lt;a href="http://williampatry.blogspot.com/2008/05/first-sale-victory-in-vernor.html"&gt;here&lt;/a&gt;.  A nice little post on &lt;a href="http://blogs.lib.berkeley.edu/shimenawa.php/2008/05/21/on_owning_books"&gt;first sale here&lt;/a&gt;.  From the perspective of books and music, I side with the first sale doctrine.. and mostly think the 'one click' license notices as having many problematic issues.  If you receive a physical copy of something for $$ or as a gift, you should be allowed to resell it at your whim.&lt;br /&gt;&lt;br /&gt;It's worth hunting down the &lt;a href="http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/s/Samuelson:Pamela.html"&gt;previous articles&lt;/a&gt; in this series.. they stretch back many years.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4960399364472413264?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4960399364472413264/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4960399364472413264' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4960399364472413264'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4960399364472413264'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/03/copyright-licenses-and-first-sale.html' title='Copyright, Licenses and First Sale doctrine'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1321339897139071143</id><published>2009-03-05T22:36:00.000-08:00</published><updated>2009-03-05T22:52:17.318-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='evolutionary algorithms'/><category scheme='http://www.blogger.com/atom/ns#' term='memoriam'/><category scheme='http://www.blogger.com/atom/ns#' term='germany'/><title type='text'>In memoriam -  Prof. Dr. Ingo Wegener, 1950-2008</title><content type='html'>&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Ingo&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Wegener&lt;/span&gt; passed away last &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;November&lt;/span&gt;, see his obituary here: &lt;a href="http://ls2-www.cs.uni-dortmund.de/%7Ewegener/obituary.html"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Nachruf&lt;/span&gt;/Obituary.&lt;/a&gt;  He taught at &lt;a href="http://www.tu-dortmund.de"&gt;University of Dortmund/TU Dortmund&lt;/a&gt; and ran a group of Evolutionary Algorithm researchers looking into the basic theory and run-time analysis of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;EAs&lt;/span&gt;.  His papers and the basic approach of his 'school' have been very influential in my PhD research and he was quite encouraging of my proposed ideas.  If at the end of our lives we are &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_5"&gt;professionally&lt;/span&gt; remembered half as fondly as he will be then that is success.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1321339897139071143?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1321339897139071143/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1321339897139071143' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1321339897139071143'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1321339897139071143'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/03/in-memoriam-prof-dr-ingo-wegener-1950.html' title='In memoriam -  Prof. Dr. Ingo Wegener, 1950-2008'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-9158106228953902599</id><published>2009-02-22T21:48:00.001-08:00</published><updated>2009-02-22T22:02:33.688-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='behavior'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><title type='text'>Predicting search engine switching behavior</title><content type='html'>Following up on the &lt;a href="http://aicoder.blogspot.com/2009/02/can-we-measure-googles-monopoly-like.html"&gt;previous post&lt;/a&gt;, I found a few interesting papers (via Google) on user switching behavior of search engines.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.springerlink.com/content/p72176q523241436/"&gt;An Analysis of Search Engine Switching Behavior Using Click Streams&lt;/a&gt;&lt;br /&gt;Juan &amp;amp;  Chang of Yahoo Inc&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.ils.unc.edu/ISSS/papers/papers/pedersen.pdf"&gt;Making Sense of Search Result Pages&lt;/a&gt; by  Pedersen of Yahoo&lt;/li&gt;&lt;li&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1367712"&gt;Defection detection: predicting search engine switching&lt;/a&gt;&lt;br /&gt;Heath &amp;amp; White of Microsoft&lt;/li&gt;&lt;li&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1390344"&gt;Enhancing web search by promoting multiple search engine use&lt;/a&gt;&lt;br /&gt;White, Heath and co-workers at Microsoft&lt;/li&gt;&lt;li&gt;&lt;a href="http://research.microsoft.com/pubs/71393/predicting-target-events-kdd08.pdf"&gt;Stream Prediction Using A Generative Model Based On. Frequent Episodes In Event Sequences&lt;/a&gt; by Laxman, Tankasali and White of Microsoft&lt;/li&gt;&lt;/ul&gt;Definitely worth reading in detail as the ways of building the models might be applicable to other behaviorally driven events.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-9158106228953902599?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/9158106228953902599/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=9158106228953902599' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9158106228953902599'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9158106228953902599'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/02/predicting-search-engine-switching.html' title='Predicting search engine switching behavior'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-7642211647852810904</id><published>2009-02-22T16:32:00.000-08:00</published><updated>2009-02-22T17:38:49.139-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='behavior'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='monopoly'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><title type='text'>Can we measure Google's monopoly like PageRank is measured?</title><content type='html'>&lt;a href="http://www.fxpal.com/?p=jeremy"&gt;Jeremy Pickens&lt;/a&gt; posted an &lt;a href="http://wiakli.net/2009/02/22/one-click-away/"&gt;interesting note on his IR new blog: &lt;/a&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;Is it really true that Google is competing on a click-by-click basis?  In the user studies that Google does, which of the following happens more often when the user types in a query to Google, and sees that Google has not succeeded in producing the information that they sought (fails): &lt;/p&gt;&lt;ol&gt;&lt;li&gt;Does the user reformulate his or her query, and click “Search Google” again (one click)?  Or,&lt;/li&gt;&lt;li&gt;Does the user leave Google (one click), and try his or her query on Yahoo or Ask or MSN (second click), instead?&lt;/li&gt;&lt;/ol&gt;&lt;/blockquote&gt;&lt;p&gt;His points about actions 1 versus 2 are very astute. I’d guess that #2 happens a LOT on the # 2-10 search engines. Meaning people give that engine a try.. maybe attempt a reformulation.. then abandon that engine and try on Google. And I’m betting that people ‘abandon’ Google at a far less rate than other engines.. ie asymmetry of abandonment.&lt;/p&gt; &lt;p&gt;I’d love to do the following analysis given a browser log of search behavior:&lt;/p&gt; &lt;p&gt;Form a graph where the major search engines are nodes in the graph &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_eZUtZFDoqLA/SaH3T_JcPEI/AAAAAAAAACo/s2Vggzab0dM/s1600-h/search_engine_graph.png"&gt;&lt;img style="cursor: pointer; width: 320px; height: 193px;" src="http://3.bp.blogspot.com/_eZUtZFDoqLA/SaH3T_JcPEI/AAAAAAAAACo/s2Vggzab0dM/s320/search_engine_graph.png" alt="" id="BLOGGER_PHOTO_ID_5305793758759763010" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;For each pair of searches found in the log at time t and time t+1 for a given user, increment the counter on the edge SearchEngine(t) -&gt; SearchEngine(t+1). Once the entire log is processed normalize the weights on all edges leaving a particular node.&lt;/p&gt; &lt;p&gt;We now have a markov chain of engine usage behavior. The directional edges in the graph represent probability of use transference to another engine, self-loops are the probability of sticking with the current engine.&lt;/p&gt; &lt;p&gt;If we calculate the stationary distribution of the adjacency matrix of probabilities, we should have a probability distribution that closely matches the market shares of the major engines. (FYI - this is what PageRank version 1.0 is - the stationary distribution of the link graph of the entire web)&lt;/p&gt; &lt;p&gt;What else can we do? We can analyze it like it’s a random walk and calculate the expected # of searches until a given user of any internet search engine will end up using Google. If the probabilities on the graph are highly asymmetric.. which I think they are.. this is a measure of the monopolistic power of people’s Google habit.&lt;/p&gt; &lt;p&gt;This should also predict the lifetime of a given ‘new’ MSN Live or Ask.com user.. meaning the number of searches they do before abandoning it for some other engine.&lt;/p&gt; &lt;p&gt;Predicted End Result: Google is the near-absorbing state of the graph.. meaning that all other engines are transient states on the route to Google sucking up market share. Of course this is patently obvious unless one of the bigs changes the game.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-7642211647852810904?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/7642211647852810904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=7642211647852810904' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7642211647852810904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7642211647852810904'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/02/can-we-measure-googles-monopoly-like.html' title='Can we measure Google&apos;s monopoly like PageRank is measured?'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_eZUtZFDoqLA/SaH3T_JcPEI/AAAAAAAAACo/s2Vggzab0dM/s72-c/search_engine_graph.png' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1053492921501455806</id><published>2009-02-21T21:37:00.000-08:00</published><updated>2009-02-21T20:38:19.631-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Scalable Analytics - random notes</title><content type='html'>As long as I'm posting about 'rethinking' any reliance on an RDBMS for 'big data' systems, here are a few more notes:&lt;br /&gt;&lt;br /&gt;This post on highscalability.com about &lt;a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data"&gt;Rackspace and MapReduce&lt;/a&gt; is highly enlightening.  The post takes you through a failed use of MySQL for large scale analytics and their conversion to Hadoop.&lt;br /&gt;&lt;br /&gt;I can't say I know (yet) the pros and cons of putting structured bigdata on a 'column store DB' versus Hadoop+HDFS.  Will probably end up using both systems in various ways.  Currently exploiting Sam Tingleff's &lt;a href="http://github.com/samtingleff/dgrid/tree/master"&gt;DGrid&lt;/a&gt; and poking at &lt;a href="http://www.luciddb.org/"&gt;LucidDB&lt;/a&gt; for "row filtering and aggregation" style analytics apps.&lt;br /&gt;&lt;br /&gt;Looking forward to setting up &lt;a href="http://hadoop.apache.org/hive/"&gt;Hive&lt;/a&gt; and &lt;a href="http://hadoop.apache.org/pig/"&gt;Pig&lt;/a&gt; next.&lt;br /&gt;&lt;br /&gt;On the plus side for MySQL, &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/federated-storage-engine.html"&gt;the federated engine&lt;/a&gt; has been quite useful for accumulating data from a sharded/partitioned MySQL setup.. as long as the data being accumulated is less than 100K rows, then it seems to hit a wall.  It's also quite brittle if your MySQL instances are having any performance issues.. failed connection can cause other ETLs that depend on that connection to fail in odd ways.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1053492921501455806?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1053492921501455806/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1053492921501455806' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1053492921501455806'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1053492921501455806'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/02/scalable-analytics-shifting-from-mysql.html' title='Scalable Analytics - random notes'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3294080475271378912</id><published>2009-02-20T15:38:00.000-08:00</published><updated>2009-02-21T15:18:06.799-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bigdata'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='scalability'/><title type='text'>Scalable data storage for model learning</title><content type='html'>I'll admit it, sometimes as an AI &amp;amp; Data wonk you want all your data structured and easily accessible via slicing a dicing, thus SQL databases and relational schemas look like a great idea.  And it works for a while... until you're dealing with a torrent of incoming data.&lt;br /&gt;&lt;br /&gt;This should not be unfamiliar, the &lt;a href="http://archive.ics.uci.edu/ml/"&gt;UC-Irvine datasets&lt;/a&gt; look like a great place to cut your teeth with Machine Learning .. until you realize that many algorithms and software packages written and litmus tested against such data totally fall down on 'big data'.&lt;br /&gt;&lt;br /&gt;This quote from &lt;a href="http://twitter.com/dcancel"&gt;Dave Cancel&lt;/a&gt; on Twitter stuck with me:  "Databases are the training wheels of software development. Fly free brother, fly free. - University of Sed, Grep, &amp;amp; Awk".  My first reaction was, meh.. databases are meant for storing lots of data.    I love and use tools like those for prototyping.. but then moved to compiled code + SQL for 'production'.&lt;br /&gt;&lt;br /&gt;Mental sea change!  Let's say you are building a massive scale system for absorbing click data, processing it and turning it into a recommender system.  &lt;a href="http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/"&gt;These are the problems you will see using MySQL at scale&lt;/a&gt;.  Hint:  it tips over at 1K read/write operations per second against the same table.&lt;br /&gt;&lt;br /&gt;Don't try and make your read store also your write store.  You may not /really/ need a low latency (for model updates) system.&lt;br /&gt;&lt;br /&gt;More tips from Dave:&lt;br /&gt;&lt;blockquote&gt;As for storage of [models], I suggest storing them in text file (format up to you), 1 per profile, then stick them behind a reverse caching proxy like Varnish. Infinite scale. For extra points store the text files on S3 and use the built-in webserver to serve them to your reverse proxy. HTTP is your friend as is REST SOAs.&lt;/blockquote&gt;Here's another dead simple storage mechanism:&lt;br /&gt;    &lt;a href="http://etherpad.com/9JrpcyvXyK"&gt;http://etherpad.com/9JrpcyvXyK&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;"simple DB" projects like &lt;a href="http://project-voldemort.com/"&gt;Voldemort&lt;/a&gt; and &lt;a href="http://tokyocabinet.sourceforge.net/spex-en.html"&gt;Tokyo Cabinet&lt;/a&gt; and &lt;a href="http://memcachedb.org/"&gt;MemcacheDB&lt;/a&gt; are options as well.&lt;br /&gt;&lt;br /&gt;If you can't depend on a room full of DBAs making your SQL DBs not be dog slow, (or buying $$$ systems from Oracle and Microsoft) you must think differently.   Pull yourself out of the Math and AI thinking and simplyify.  Big Data will eat you alive otherwise.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3294080475271378912?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3294080475271378912/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3294080475271378912' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3294080475271378912'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3294080475271378912'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/02/scalable-data-storage-for-model.html' title='Scalable data storage for model learning'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1116856575381006468</id><published>2009-02-18T21:27:00.000-08:00</published><updated>2009-02-18T23:27:22.885-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='human metrics'/><category scheme='http://www.blogger.com/atom/ns#' term='influence metrics'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><title type='text'>TunkRank Scoring Improvement</title><content type='html'>Recently &lt;a href="http://twitter.com/dtunkelang"&gt;Daniel Tunkelang&lt;/a&gt; blogged about an &lt;a href="http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/"&gt;influence rank for Twitter&lt;/a&gt;.  &lt;a href="http://twitter.com/ealdent"&gt;&lt;span class="fn"&gt;Jason Adams&lt;/span&gt;&lt;/a&gt; took up the implementation challenge and coined it &lt;a href="http://tunkrank.com/"&gt;TunkRank&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://latex.codecogs.com/gif.latex?Influence(X)%20=%20\sum_{Y\in%20Followers(X)}%201%20+%20\frac{p*Influcence(Y))}{%20\left%20\|Following(Y)%20\right%20\|%20\right}"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 384px; height: 47px;" src="http://latex.codecogs.com/gif.latex?Influence(X)%20=%20\sum_{Y\in%20Followers(X)}%201%20+%20\frac{p*Influcence(Y))}{%20\left%20\|Following(Y)%20\right%20\|%20\right}" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;After some poking at it, I'm suggesting a scoring improvement.  At the moment the primary rank is percetile in the UI, however the raw score is given as well.  I checked a few users and put together the table below, and it &lt;span style="font-style: italic;"&gt;feels&lt;/span&gt; wrong.  It saturates @ 100 too quickly and there is not enough differentiation between people with healthy versus massive influence.&lt;br /&gt;&lt;br /&gt;Why 'feel'?  Human interpretable numbers need a tactile sense to them in my opinion.  One critique of the metric system is that the English system just feels more human compatible, an inch is not too small, a foot is reasonable, a mile is a long way and 100 miles per hour is darn fast.&lt;br /&gt;&lt;br /&gt;I'm proposing two new scoring possibilities.  Both are based upon logarithms and span from 1-100.  The slight difference between them is how 'linear' the resulting rank feels across the accounts I compared.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;LEAST(100,ROUND(POWER(LN( tunkrank-raw-score +1);1.82)))&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://latex.codecogs.com/gif.latex?TunkRank(X)%20=%20ln(Influence(X)%20+%201)^{1.82}"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 307px; height: 19px;" src="http://latex.codecogs.com/gif.latex?TunkRank(X)%20=%20ln(Influence(X)%20+%201)^{1.82}" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol start="2"&gt;&lt;li&gt;LEAST(100,ROUND(LN( tunkrank-raw-score+1)/LN(3.5) * 10))&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://latex.codecogs.com/gif.latex?TunkRank(X)%20=%2010%20*%20ln(Influence(X)%20+%201)/ln(3.5)"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 374px; height: 16px;" src="http://latex.codecogs.com/gif.latex?TunkRank(X)%20=%2010%20*%20ln(Influence(X)%20+%201)/ln(3.5)" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What are the constants?  They are magic numbers to map Barak Obama to a TunkRank of 100 as well as provide an interesting spread between the test accounts below.  Comments welcome!  Which is my choice?  I can't decide.. #1 is smells more accurate, #2 tastes more natural.&lt;br /&gt;&lt;br /&gt;Yes this is an inexact science.&lt;br /&gt;&lt;br /&gt;Possible Tunkrank Bug?  Check out dewitt's rank.. looks off given his number of followers and that he's an influential guy from Google.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;NAME       PERCENTILE RAW SCORE  NEW SCORE #1 NEW SCORE #2&lt;br /&gt;BarackObama        100     277770        100        100&lt;br /&gt;wilw               100      79118         82         90&lt;br /&gt;guykawasaki        100      62543         79         88&lt;br /&gt;JasonCalaca        100      59075         78         88&lt;br /&gt;THErealDVOR        100      43207         74         85&lt;br /&gt;anamariecox        100      38177         73         84&lt;br /&gt;WilliamShat        100      13932         61         76&lt;br /&gt;fredwilson         100      13340         60         76&lt;br /&gt;abdur              100       1351         36         58&lt;br /&gt;johnhcook           99        407         26         48&lt;br /&gt;johndcook           94         61         13         33&lt;br /&gt;gutelius            84         20          8         24&lt;br /&gt;nealrichter         81         16          7         23&lt;br /&gt;ealdent             80         16          7         23&lt;br /&gt;dtunkelang          79         15          6         22&lt;br /&gt;dewitt               1          2          1          9&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1116856575381006468?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1116856575381006468/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1116856575381006468' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1116856575381006468'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1116856575381006468'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/02/tunkrank-scoring-improvement.html' title='TunkRank Scoring Improvement'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4930637324772228411</id><published>2009-01-27T07:22:00.000-08:00</published><updated>2009-01-27T07:26:34.008-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='blog'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise search'/><title type='text'>response to Noisy Channel post on Lucid Imagination</title><content type='html'>&lt;div class="entry"&gt;     &lt;p&gt;&lt;span style="font-size:85%;"&gt;Cross posted to my blog since it's a long response ;-)&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;To: Daniel Tunkelang&lt;/p&gt;&lt;p&gt;RE: Noisy Channel blog &lt;a href="http://thenoisychannel.com/2009/01/27/lucid-imagination/"&gt;post on Lucid Imagination&lt;/a&gt;&lt;br /&gt;&lt;/p&gt; &lt;p&gt; I don’t think it’s their aim to compete with Enterprise search directly (business suicide), though I suspect they might pressure the pricing in the mid-market of search. The small market has been mostly eliminated by Google/Yahoo/MSN site search and open source engines.&lt;/p&gt; &lt;p&gt; Note also that they do not seem to (yet) provide support for Nutch or Droids.. meaning that they are missing a spidering/crawling engine. Same with Tika (office document support). Search result clustering may be coming soon via SOLR-769. No content-management or versioning. (These are fixable pieces given all the open source out there)&lt;/p&gt; &lt;p&gt; There is no good native support for rich taxonomies in Solr/Lucene, nor is there native support for some of the interesting semantic-web data driven features. No self-learning or auto-personalization of results. No analytics (though one could go elsewhere for that).&lt;/p&gt; &lt;p&gt; Lucid is also not offering a hosted Solr service .. so they are not an SaaS play either.&lt;/p&gt; &lt;p&gt; All that said, they obviously have some huge wins within the software industry.. but it’s a tough road to go after accounts like Home Depot, Albertson’s, or the government entities. &lt;/p&gt; &lt;p&gt; Enterprise search is mostly about finished feature sets and a near full admin GUI for non-programmers. The question is in these lean economic times if a given customer considering “build versus buy” is willing to risk starting a professional services engagement to build what they want for cheap, versus purchase a commercial ES product with way more features than they think they need. &lt;/p&gt; &lt;p&gt; I do think that a smart customer will have new leverage during the sales cycle to credibly threaten the ‘build’ option and get the ‘buy’ price down. And Lucid certainly should affect the ability of the ES companies from getting a customer bought in then milking them for professional services, integration and customization fees… Lucid provides a credible switching threat to cut bait and start over.&lt;/p&gt; &lt;p&gt; Google, Yahoo and open source projects like Lucene have commoditized basic search, so ES is about value-added features, innovative R&amp;amp;D and taking away customer pain and complexity.&lt;/p&gt; &lt;p&gt; Some of the people in Lucid have big plans (Grant Ingersoll comes to mind), and there is absolutely no question that Lucene has made some search vendors look like dinosaurs with slow engines and archaic index structures. &lt;/p&gt; &lt;p&gt; It will be some time before open source catches up to ES.. but it just might not be as long as some would hope.&lt;/p&gt; &lt;p&gt;Disclaimer:  The above is my opinion and some fact-looking statements might be wrong.. so Lucene guys jump in!&lt;/p&gt;          &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4930637324772228411?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4930637324772228411/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4930637324772228411' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4930637324772228411'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4930637324772228411'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/01/response-to-noisy-channel-post-on-lucid.html' title='response to Noisy Channel post on Lucid Imagination'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-958226152920300844</id><published>2009-01-26T14:28:00.000-08:00</published><updated>2009-01-26T14:55:43.592-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='solr'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='startups'/><category scheme='http://www.blogger.com/atom/ns#' term='htdig'/><category scheme='http://www.blogger.com/atom/ns#' term='consulting'/><title type='text'>Lucid Imagination and Sematex</title><content type='html'>Kudos to the Solr/Lucene gang for launching &lt;a href="http://www.lucidimagination.com/"&gt;Lucid Imagination&lt;/a&gt;.  Grant Ingersoll's &lt;a href="http://lucene.grantingersoll.com/2009/01/26/lucid-imagination/"&gt;announcement&lt;/a&gt;.  People involved &lt;a href="http://www.lucidimagination.com/About/Technical-Leadership/"&gt;here&lt;/a&gt; and &lt;a href="http://www.lucidimagination.com/About/Management/"&gt;here&lt;/a&gt;.  Some time ago Otis Gospodnetić launched &lt;a href="http://www.sematext.com/"&gt;Semtext&lt;/a&gt;.   Good luck to Lucid and Sematext!&lt;br /&gt;&lt;br /&gt;Both of these companies are in the 'support and consulting' model.  This is wise, as going into Enterprise search directly is a tough road competing with Endeca, Verity(Autonomy), FAST(Microsoft), GoogleBox and the other &lt;a href="http://en.wikipedia.org/wiki/List_of_enterprise_search_vendors"&gt;vendors&lt;/a&gt; would be suicidal.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Aside:&lt;/span&gt;&lt;br /&gt;Long ago (2003) I thought of hanging up a shingle for supporting HtDig (a once popular CGI based search engine), but wisely decided that would be a mistake given that even then I could see that Doug Cutting's Java Lucene and Nutch were going to smoke the creaky 8+ year old C++ indexing kernel.  Ended up getting RightNow Tech to sponsor conversion of the guts to CLucene, where it still runs today indexing many many tens of millions of documents.  Then Solr was announced .... and HtDig development died and I started using Solr.&lt;br /&gt;&lt;br /&gt;Just touched base with Geoff Hutchinson the other day and we're going to release the 4.0 CLucene branch of HtDig, and put up an announcement of HtDig end-of-life and encourage people to migrate to Solr.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-958226152920300844?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/958226152920300844/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=958226152920300844' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/958226152920300844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/958226152920300844'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/01/lucid-imagination-and-sematex.html' title='Lucid Imagination and Sematex'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1382081250991966620</id><published>2009-01-26T09:32:00.000-08:00</published><updated>2009-01-26T09:58:18.410-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='solr'/><category scheme='http://www.blogger.com/atom/ns#' term='classification'/><category scheme='http://www.blogger.com/atom/ns#' term='text summarization'/><title type='text'>Text Classification with Solr</title><content type='html'>Starting to look at using &lt;a href="http://lucene.apache.org/solr/"&gt;Solr/&lt;/a&gt;&lt;a href="http://lucene.apache.org/"&gt;Lucene&lt;/a&gt; for text mining.  Between OpenNLP, the Python &lt;a href="http://www.nltk.org/"&gt;Natural Language Toolkit &lt;/a&gt;and various other projects it's time to toss my ad-hoc mishmash of tools and start over.&lt;br /&gt;&lt;br /&gt;Looks like Grant Ingersoll is working on similar things in his &lt;span style="text-decoration: underline;"&gt;&lt;/span&gt;&lt;a href="http://www.manning.com/ingersoll/"&gt;Taming Text&lt;/a&gt; project.  This is a nice beginner's overview of the area as Grant sees it, &lt;a href="http://www.charlottejug.org/wp-content/uploads/2008/10/taming_text-10-15-08.ppt"&gt;Search and Text Analysis PPT&lt;/a&gt;.  Also looks like others are scheming about blending &lt;a href="http://lucene.apache.org/mahout/"&gt;Mahout&lt;/a&gt; and Solr in some &lt;a href="http://http//markmail.org/message/5i2om3pxci3ejed7"&gt;future version&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The basic idea is to take an ontology/taxonomy like &lt;a href="http://www.dmoz.org/"&gt;Dmoz&lt;/a&gt; or &lt;a href="http://www.freebase.com"&gt;FreeBase&lt;/a&gt; of {label: "X", tags: "a,b,c,d,e"}, index it and then classify documents into the taxonomy by pushing parsed document into the Solr search API.  Why?  Lucene/Solr's ability to do weighted term boosting at both search and index time has lots of obvious uses here.&lt;br /&gt;&lt;br /&gt;Now that my readership (by data-mining and semantic-web geeks) is up slightly (ie above zero!) due to &lt;a href="http://twitter.com/nealrichter"&gt;Twitter&lt;/a&gt; traffic, I'm hoping people contact me with ideas, code, etc. Heh.&lt;br /&gt;&lt;br /&gt;Initial ideas:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Use More-Like-This code to 'pass in' a term vector without storing it&lt;/li&gt;&lt;li&gt;Write Solr plugin to execute search and post-process hits and do any outgoing classification and biasing math.&lt;/li&gt;&lt;/ul&gt;Once this is proven out, then the obvious next step is to figure out how to index the various RDF/OWL datasets out there.  Much of these parts has probably?? been done before, I just need to find them, examine their merits and do some LEGO style layering to get a prototype up.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1382081250991966620?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1382081250991966620/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1382081250991966620' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1382081250991966620'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1382081250991966620'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/01/text-mining-and-solr.html' title='Text Classification with Solr'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-110920107316654581</id><published>2009-01-02T10:01:00.000-08:00</published><updated>2009-01-02T14:00:38.001-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='resolutions'/><title type='text'>New Year's Resolutions and Goals</title><content type='html'>Here are this year's technical/career goals &amp;amp; resolutions.  Most of these are general and many encompass specific current and forward looking &lt;a href="http://www.othersonline.com/"&gt;work&lt;/a&gt; projects.  Others are just motivational resolutions.&lt;br /&gt;&lt;br /&gt;Goals and Resolutions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Turn in Dissertation.  It's 75% complete and the rest is all typewriter work.  Be done.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Be a better &lt;a href="http://en.wikipedia.org/wiki/The_Numerati"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Numerati&lt;/span&gt;&lt;/a&gt;.  The point of modeling is to &lt;span style="font-style: italic;"&gt;predict&lt;/span&gt;... so this goal is a lifelong career goal with a new label.&lt;/li&gt;&lt;li&gt;Practice at &lt;a href="http://steve-yegge.blogspot.com/2008/06/done-and-gets-things-smart.html"&gt;Done and Get Things Smart&lt;/a&gt; and &lt;a href="http://norvig.com/21-days.html"&gt;Teach Yourself Programming in Ten Years&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Make damn sure that &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;OthersOnline&lt;/span&gt;.com doesn't have any &lt;a href="http://www.google.com/search?q=Fail+Whale"&gt;Fail Whale&lt;/a&gt; events (technical or business).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Read more tech (academic, business and research blogs) - Seed and water the creative juices.&lt;/li&gt;&lt;li&gt;Economy willing, hire a full-time minion.&lt;/li&gt;&lt;li&gt;See if I can practice some 'startup karma' (hat tip Todd Sawicki) for other startups.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Specific items:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Suck down more '&lt;a href="http://delicious.com/search?p=tag%3Aadvertising+tag%3Acomputational&amp;amp;u=nrichter&amp;amp;chk=&amp;amp;context=userposts&amp;amp;fr=del_icio_us&amp;amp;lc=1"&gt;computational advertising&lt;/a&gt;' research and write some myself.&lt;/li&gt;&lt;li&gt;Cherry pick new &lt;a href="http://delicious.com/search?p=tag%3Asemantic+or+tag%3Asemanticweb&amp;amp;u=nrichter&amp;amp;chk=&amp;amp;context=userposts&amp;amp;fr=del_icio_us&amp;amp;lc=1"&gt;Semantic&lt;/a&gt; techniques from the rat's nets of the Semantic Web.&lt;/li&gt;&lt;li&gt;&lt;a href="http://delicious.com/search?p=tag%3Aextraction+tag%3Adatamining&amp;amp;u=nrichter&amp;amp;chk=&amp;amp;context=userposts&amp;amp;fr=del_icio_us&amp;amp;lc=1"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;NLP&lt;/span&gt; and Extraction&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Sharpen skills from classic &lt;a href="http://delicious.com/search?p=tag%3Amodeling+or+tag%3Asampling&amp;amp;u=nrichter&amp;amp;chk=&amp;amp;context=userposts&amp;amp;fr=del_icio_us&amp;amp;lc=1"&gt;modeling/filtering/sampling&lt;/a&gt; methods.&lt;/li&gt;&lt;li&gt;&lt;a href="http://delicious.com/search?p=database+column&amp;amp;u=nrichter&amp;amp;lc=1"&gt;Column &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;DBs&lt;/span&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Modern &lt;a href="http://delicious.com/search?context=userposts&amp;amp;p=mapreduce&amp;amp;lc=1&amp;amp;u=nrichter"&gt;Map-Reduce&lt;/a&gt;&lt;/li&gt;&lt;li&gt;More &lt;a href="http://delicious.com/search?context=userposts&amp;amp;p=scalability&amp;amp;lc=1&amp;amp;u=nrichter"&gt;Scalability&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Data mining from &lt;a href="http://delicious.com/nrichter/vldb"&gt;VLDB&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Contribute to &lt;a href="http://delicious.com/search?context=userposts&amp;amp;p=contribute%20opensource&amp;amp;lc=1&amp;amp;u=nrichter"&gt;open source projects&lt;/a&gt; again&lt;/li&gt;&lt;li&gt;File patent(s) and publish a paper(s)&lt;/li&gt;&lt;li&gt;Clean the Garage&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Thank heavens I have all year.  This is really a post that should evolve all year.. why can't some blog posts be Wiki-like and not so time ordered?  &lt;span style="font-style: italic;"&gt;I reserve the right to violate blog-etiquette laws and edit this post.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-110920107316654581?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/110920107316654581/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=110920107316654581' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/110920107316654581'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/110920107316654581'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2009/01/new-years-resolutions-and-goals.html' title='New Year&apos;s Resolutions and Goals'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1851380573123393654</id><published>2008-12-28T19:26:00.000-08:00</published><updated>2008-12-28T20:02:22.698-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='patent'/><category scheme='http://www.blogger.com/atom/ns#' term='intellectual property'/><title type='text'>Statutory Inventions and the Public Domain</title><content type='html'>This is interesting: &lt;a href="http://en.wikipedia.org/wiki/United_States_Statutory_Invention_Registration"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;USPTO&lt;/span&gt; Statutory Invention Registration&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Basically it's a way of publishing an invention to the public via the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;USPTO&lt;/span&gt;.  Rarely used for obvious reasons.&lt;br /&gt;&lt;br /&gt;Question #1:  Why don't open source people apply for these &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;USPTO&lt;/span&gt; invention declarations?  It seems to be the patent law equivalent of a BSD/MIT license.. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;ie&lt;/span&gt; "use/extend it for any purpose but it's still my work and you can't claim it as your own work."&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-style: italic;"&gt;more&lt;/span&gt; interesting tidbit on there is, while obvious, a potential source of great technical material.  Since 1999 when you file a patent it is published 18 months after the file date.  Once an application is abandoned the application and is published by the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;USPTO&lt;/span&gt; and it becomes public domain.&lt;br /&gt;&lt;br /&gt;Question #2:  Are rejected patents whose appeals have run out then public domain?  I can't seem to find a clear answer.&lt;br /&gt;&lt;br /&gt;How many rejected/&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_5"&gt;abandoned&lt;/span&gt; software patents etc out there from Microsoft, Oracle, IBM, etc are there that contain very valuable algorithms and techniques that are now public domain?  Yes this is a bit like looking for gold in the trash can...&lt;br /&gt;&lt;br /&gt;Notes:&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://ep.espacenet.com/"&gt;European Patent Office&lt;/a&gt; by treaty publishes many &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;USPTO&lt;/span&gt; patent apps.. and honestly has a better interface for getting the status of your patent than I have yet found at the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;USPTO&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://en.wikipedia.org/wiki/Patent_Reform_Act_of_2005#Publish_patent_applications"&gt;Patent Reform Act of 2005&lt;/a&gt; (Republican sponsored) was an attempt to close the publication 'loophole'.  The &lt;a href="http://en.wikipedia.org/wiki/Patent_Reform_Act_of_2007"&gt;Patent Reform Act of 2007&lt;/a&gt; (Democrat sponsored) keeps the current publication system in place.  Neither is law (yet).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1851380573123393654?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1851380573123393654/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1851380573123393654' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1851380573123393654'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1851380573123393654'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/12/statutory-inventions-and-public-domain.html' title='Statutory Inventions and the Public Domain'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4763033150247064379</id><published>2008-12-27T20:39:00.000-08:00</published><updated>2008-12-28T21:47:11.586-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='resolutions'/><title type='text'>New Year's Resolutions - Blog more</title><content type='html'>New Years resolution - blog more.  I've been very busy doing cool stuff at &lt;a href="http://www.othersonline.com/"&gt;Others Online&lt;/a&gt; and in the process developed some tunnel vision to stay on task.&lt;br /&gt;&lt;br /&gt;Here's to hoping that more blogging will cause me to see things in a different light more easily as well as get me more in 'writing mode' to finish the PhD before summer 2009.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://twitter.com/nealrichter"&gt;Twittering&lt;/a&gt; has replaced blogging as my outlet for the second half of the year.. yet the 140 char format isn't much good for a personal musing and research blog.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4763033150247064379?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4763033150247064379/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4763033150247064379' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4763033150247064379'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4763033150247064379'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/12/new-years-resolutions-blog-more.html' title='New Year&apos;s Resolutions - Blog more'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-2468542567800913147</id><published>2008-12-27T20:30:00.000-08:00</published><updated>2008-12-27T20:42:56.492-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sentiment analysis'/><title type='text'>Marshall Kirkpatrick and Data Mining</title><content type='html'>[somehow this got lost in drafts in my blog - back in August]&lt;br /&gt;&lt;br /&gt;Marshall &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;KirkPatrick's&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;RRW&lt;/span&gt; post &lt;a href="http://www.readwriteweb.com/archives/four_adfree_ways_that_mined_da.php"&gt;&lt;span style="font-size:100%;"&gt;Four Ad-Free Ways that Mined Data Can Make Money&lt;/span&gt;&lt;/a&gt; is interesting.&lt;br /&gt;&lt;br /&gt;In approx 2002 I wrote version 1.0 of the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;RightNow&lt;/span&gt; Tech sentiment analysis software to analyze the positive versus negative overall tone of incoming support requests/emails in the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;CRM&lt;/span&gt; system.  I called it '&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Emotix&lt;/span&gt;', but the Marketing people renamed it &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;SmartSense&lt;/span&gt;.  Later Steve &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Durbin&lt;/span&gt; and I bolted on a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;POS&lt;/span&gt; tagger to get a bit more accuracy given language forms like 'I am not very happy' and 'I am very angry' require the modifiers be taken into account.&lt;br /&gt;&lt;br /&gt;Basically &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;Emotix&lt;/span&gt; was tasked to attach a numerical positive/negative emotional score to each &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_9"&gt;incoming&lt;/span&gt; request.. such that the queue of requests could be ordered to service angry customers first.  We weren't interesting in extreme accuracy.. just a decent ordering that was fairly predictive.&lt;br /&gt;&lt;br /&gt;There were two interesting stories to version 1.0.  The first concerned the negative/&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_10"&gt;neutral&lt;/span&gt;/positive word dictionary.  Basically my office mate and I sat down and compiled a list of every positive and negative word we could find and put them on a wide numerical scale.  When it was time for swear words, we shut the door and howled in laughter as we threw mock insults around the room.  The &lt;a href="http://www.amazon.com/Wicked-Words-Put-Downs-Unprintable-Anglo-Saxon/dp/"&gt;Wicked Words&lt;/a&gt; book was an invaluable source of &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_11"&gt;inspiration&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;When it came time to litmus test our word ratings we put co-workers in front of a terminal that would put random words from the list and ask them to agree to &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_12"&gt;disagree&lt;/span&gt; with the rating.  Needless to say we had to forewarn everyone that it WOULD be offensive and that this did not constitute any form of &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_13"&gt;harassment&lt;/span&gt;.  Watching the process was &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_14"&gt;excruciating&lt;/span&gt; and &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_15"&gt;hilarious&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The second &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_16"&gt;humorous&lt;/span&gt; story concerned testing and training the system on real support messages.  My favorite data set was from a well known customer that made specialty ice cream.   Their customers tended to begin each service contact with a large block of text extolling the virtues of the company and its ice cream.. with the negative comments on their experience with the ice cream last... usually written in &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_17"&gt;apologetic&lt;/span&gt; terms.  &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_18"&gt;Obviously&lt;/span&gt; the creamery wanted to get the custmers with real negative &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_19"&gt;experience&lt;/span&gt; problems to the top of the queue.. ice cream is all about the eating experience.  But how do you filter out and &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_20"&gt;bias&lt;/span&gt; for the overall '&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_21"&gt;fan mail&lt;/span&gt;' tone of 90% of the requests?  Fun stuff to work on.  &lt;a href="http://www.cs.montana.edu/%7Erichter/affective_rating.pdf"&gt;Some details here&lt;/a&gt;, others &lt;a href="http://www.google.com/patents?id=vbGRAAAAEBAJ"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Recently Steve extended it to work with both the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;RNT&lt;/span&gt; Voice product and the marketing-automation product as well.  The big lesson here as an engineer is that good enough can be just fine and often it will be used in unexpected ways down the line.  The largely &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;un-refactored&lt;/span&gt; code is still running and processing billions of textual contacts every year.  Ok this is exagerated a bit.. but it hasn't really been rewritten, just optimized frequently.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-2468542567800913147?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/2468542567800913147/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=2468542567800913147' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2468542567800913147'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2468542567800913147'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/08/marshall-kirkpatrick-and-data-mining.html' title='Marshall Kirkpatrick and Data Mining'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-6306881584950540467</id><published>2008-09-08T00:02:00.000-07:00</published><updated>2008-09-08T01:02:38.019-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='personal'/><title type='text'>A story about Don Haskins and my father</title><content type='html'>This is not about AI, but a personal story.&lt;br /&gt;&lt;br /&gt;Don &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Haskins&lt;/span&gt; died yesterday at 78 years old. &lt;a href="http://ap.google.com/article/ALeqM5jT8txB2fK-v8iFffuPnqFkE0bS6gD9328EL80"&gt;AP Obituary&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I grew up for the first 14 years in El &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Paso&lt;/span&gt;, Texas before moving back to the family home in Montana.   &lt;a href="http://en.wikipedia.org/wiki/Don_Haskins"&gt;Don &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Haskins&lt;/span&gt;&lt;/a&gt; and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;UTEP&lt;/span&gt; basketball were a big deal.  Coach &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Haskins&lt;/span&gt; is of course the guy who coached &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;UTEP&lt;/span&gt; (then Texas Western College) to the 1966 NCAA championship with five black players... which was wonderfully rendered into the movie &lt;a href="http://en.wikipedia.org/wiki/Glory_Road_%28film%29"&gt;Glory Road&lt;/a&gt;.  It's hard to overstate what an effect he had on El Paso and UTEP.&lt;br /&gt;&lt;br /&gt;I've always been more of a fan of coaches than players, and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Haskins&lt;/span&gt; was at the top of my biased list of basketball idols.  Years ago my father and I finally met him at my grandmother's 70&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;th&lt;/span&gt; birthday party. [&lt;a href="http://www.cs.montana.edu/%7Erichter/haskins2.jpg"&gt;Photo&lt;/a&gt; of me bringing in flowers he brought to the party]  I'd read every book, and article about him but managed to mostly stutter upon meeting him.  My dad gave me some crap about that.&lt;br /&gt;&lt;br /&gt;When Glory Road opened on Friday the 13&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;th&lt;/span&gt; 2006, the family went to see it.  Great flick even with the inaccuracies.  After the movie Dad asked me how I liked it, I said "It was great".  He then said "A movie about your &lt;span class="nfakPe"&gt;hero&lt;/span&gt; Don &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;Haskins&lt;/span&gt;", I said "yep.. but oh Dad you are &lt;span class="nfakPe"&gt;my&lt;/span&gt; &lt;span class="nfakPe"&gt;hero&lt;/span&gt;".  We both snorted and chuckled as grown men do about emotions. A few minutes later we parted ways with 'I love you' and hugs.&lt;br /&gt;&lt;br /&gt;Dad died the next day and those were the last words we exchanged.  I didn't watch Glory Road again till a couple months ago.. too much emotion.  When I heard the news that &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;Haskins&lt;/span&gt; died, time stopped a bit as I thought about him, my childhood idols and my father. &lt;br /&gt;&lt;br /&gt;The stories about this man that circulated in El Paso and in the family were legendary.  Here's an &lt;a href="http://www.elpasotimes.com/haskins/ci_10407784"&gt;example&lt;/a&gt;, or &lt;a href="http://www.elpasotimes.com/haskins/ci_10407781"&gt;two&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;RIP Coach &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;Haskins&lt;/span&gt;  (and I still miss ya Dad!)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-6306881584950540467?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/6306881584950540467/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=6306881584950540467' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6306881584950540467'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6306881584950540467'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/09/story-about-don-haskins-and-my-father.html' title='A story about Don Haskins and my father'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-6375468631845203467</id><published>2008-08-29T13:44:00.000-07:00</published><updated>2008-12-28T21:40:00.965-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><title type='text'>Open Source Search Engine Rodeo: Solr v. Sphinx v. MySQL-FT</title><content type='html'>Last summer Anthony Arnone and I did a study on the performance of three open source search engines.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html"&gt;MySQL native fulltext search&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.sphinxsearch.com/"&gt;Sphinx SQL full-text search&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://lucene.apache.org/solr/"&gt;SOLR&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;We chose these three as two of them have close ties to MySQL and the other is a well used and performant offering from Apache.  There are many that we skipped.&lt;br /&gt;&lt;br /&gt;Here's the &lt;a href="http://www.cs.montana.edu/%7Erichter/Search_Engine_Rodeo.pdf"&gt;Report in PDF&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Solr was the clear winner.  Sphinx was in a close second with blindingly fast indexing times.&lt;br /&gt;&lt;br /&gt;At this point the report's results are somewhat dated as both Sphinx and Solr are readying new releases.  So your mileage may vary, and I'm sure Peter Zaitsev and the Sphinx team could show us how to improve the performance of their engine.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0); font-style: italic;"&gt;Updates:  The Sphinx team contacted me and suggested some ways to improve Sphinx performance. New results will be published some time soon. They will likely also publish a test using Wikipedia as the document repository.&lt;br /&gt;&lt;br /&gt;More Updates:  I have started a new Solr project and may test Sphinx again.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-6375468631845203467?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/6375468631845203467/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=6375468631845203467' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6375468631845203467'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6375468631845203467'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/08/open-source-search-engine-rodeo-solr-v.html' title='Open Source Search Engine Rodeo: Solr v. Sphinx v. MySQL-FT'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4386990823888702228</id><published>2008-08-12T21:58:00.000-07:00</published><updated>2008-08-12T22:06:00.480-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='databases'/><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><title type='text'>Great comment on MapReduce</title><content type='html'>Spot on comment from &lt;a href="http://highscalability.com/database-people-hating-mapreduce#comment-926"&gt;&lt;span class="submitted"&gt; Iván de Prado&lt;/span&gt;&lt;/a&gt; on the &lt;a href="http://highscalability.com/database-people-hating-mapreduce"&gt;&lt;span style="font-size:100%;"&gt;Database People Hating on MapReduce&lt;/span&gt;&lt;/a&gt; blog post.&lt;br /&gt;&lt;blockquote&gt;I think this document is comparing things that are not comparable. They are talking about MapReduce as if it were a distributed database. But that's completely wrong. &lt;a class="glossary-term" href="http://highscalability.com/tags/hadoop"&gt;&lt;acronym title="Hadoop: Hadoop is a framework for running applications on large clusters of commodity hardware. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster.     More on Hadoop"&gt;Hadoop&lt;/acronym&gt;&lt;/a&gt; is a distributed computed platform, not a distributed database prepared for OLAP.&lt;/blockquote&gt;MapReduce is a re-implementation of LISP's map and reduce in a parallel setting.  Now the function/task that you give to Map is where the rubber meets the road of reading data from some data store.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4386990823888702228?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4386990823888702228/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4386990823888702228' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4386990823888702228'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4386990823888702228'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/08/great-comment-on-mapreduce.html' title='Great comment on MapReduce'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-2495491080105206892</id><published>2008-08-12T20:32:00.000-07:00</published><updated>2008-08-12T22:06:23.891-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='databases'/><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><title type='text'>MapReduce versus RDBMS - Round 2</title><content type='html'>Round 2 (for me anyway - this discussion has been raging for a while).  Nice read on &lt;a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data"&gt;&lt;span style="font-size:100%;"&gt;how &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Rackspace&lt;/span&gt;&lt;/span&gt; Now Uses &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;MapReduce&lt;/span&gt;&lt;/span&gt; and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Hadoop&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;.  They started with shell scripts, evolved to remote &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;RPCs&lt;/span&gt;&lt;/span&gt; of shell scripts, moved to MySQL, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;interated&lt;/span&gt;&lt;/span&gt; on MySQL and then jumped to a &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_5"&gt;heterogeneous&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Hadoop&lt;/span&gt; + &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Solr&lt;/span&gt; + &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;HDFS&lt;/span&gt;.  &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;Terrabytes&lt;/span&gt; of data.&lt;br /&gt;&lt;br /&gt;The MySQL evolution was interesting as I'm going through a similar process of attempting/planning to continually refine MySQL performance.  We need UPDATE statements in a big way, so it's a bit different than appending to log structures.&lt;br /&gt;&lt;br /&gt;I've been playing with a daily summarizing and distributed &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;ETL&lt;/span&gt; with MySQL.  Basically with creative use of Views and the Federated Engine one can do a scheduled daily map and reduce.  I hold no hope that this is a solution for &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;adhoc&lt;/span&gt; queries, it's not at all that flexible.  Wiki page describing this system &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_11"&gt;coming&lt;/span&gt; soon.&lt;br /&gt;&lt;br /&gt;Still trying to find a solution other than a 'union view' across the federated tables from the n data partition servers as the Map.  The Reduce will be a set of stored procedures against the union-view.  Perhaps &lt;a href="http://www.nabble.com/Hadoop-MapReduce-%2B-MySQL-td14649514.html"&gt;this post and hacked code&lt;/a&gt; holds promise for gluing &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;Hadoop&lt;/span&gt; to &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;JDBC&lt;/span&gt;/MySQL storage engines.  This would better enable ad-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;hoc&lt;/span&gt; queries.&lt;br /&gt;&lt;br /&gt;I also wonder if MySQL Proxy is useful here.. looking into it.. but at first glance it doesn't &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_15"&gt;inherently&lt;/span&gt; pattern the distributed Map operation well.&lt;br /&gt;&lt;br /&gt;Open question:  If one could publish MySQL stores to a column oriented DB like &lt;a href="http://monetdb.cwi.nl/projects/monetdb/Home/index.html"&gt;MonetDB&lt;/a&gt; or &lt;a href="http://www.luciddb.org/"&gt;LucidDB&lt;/a&gt; and then do Hadoop map-reduce operations then do I have what I want for ad-hoc queries?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-2495491080105206892?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/2495491080105206892/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=2495491080105206892' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2495491080105206892'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2495491080105206892'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/08/mapreduce-versus-rdbms-round-2.html' title='MapReduce versus RDBMS - Round 2'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-9031152646507823573</id><published>2008-08-11T14:02:00.000-07:00</published><updated>2008-08-11T15:49:41.886-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='databases'/><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><category scheme='http://www.blogger.com/atom/ns#' term='dogma'/><title type='text'>MapReduce versus RDBMS</title><content type='html'>I managed to stumble upon an interesting article while looking for a MySQL multi-database federated query tool.&lt;br /&gt;&lt;br /&gt;David &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;DeWitt&lt;/span&gt; and Michael &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Stonebraker&lt;/span&gt; write: &lt;a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html"&gt;&lt;span style="font-size:100%;"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;MapReduce&lt;/span&gt;: A major step backwards.&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;They rightly point out that &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;MapReduce&lt;/span&gt; is a 25 year old idea.  Lisp has had this functionality for decades.. and it's actually at least 30 years old. &lt;a href="http://portal.acm.org/citation.cfm?id=800132.804322"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Griss&lt;/span&gt; &amp;amp; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Kessler&lt;/span&gt; 1978&lt;/a&gt; is apparently the earliest &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_6"&gt;description&lt;/span&gt; of a parallel Reduce function.  That said, it's only in the last 10 years that an idea this great could have been implemented widely with the advent of cheap machines.&lt;br /&gt;&lt;br /&gt;Their second point is that &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;MapReduce&lt;/span&gt; is a poor implementation as it doesn't support or utilize indexes.&lt;br /&gt;&lt;blockquote&gt;One could argue that value of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;MapReduce&lt;/span&gt; is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3],  &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;Bubba&lt;/span&gt; [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;Teradata&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;In summary to this first point, there have been high-performance, commercial, grid-oriented &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;SQL&lt;/span&gt; engines (with &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;schemas&lt;/span&gt; and indexing) for the past 20 years. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;MapReduce&lt;/span&gt; does not fare well when compared with such systems.&lt;/blockquote&gt;Great point and point taken.  However, where are the open source &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_14"&gt;implementations&lt;/span&gt; of the things you mention?  This is a bit of the 'if a tree falls in the woods and no one is there to hear it' problem.  &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;A major reason MapReduce&lt;/span&gt; has seen uptake (other than being a child of Google) is that an example implementation is available for the Horde to steal, copy, improve &amp;amp; translate.&lt;br /&gt;&lt;br /&gt;The modern user generated content web is mostly built on Open Source these days, so the fact that I can get the above technology in commercial databases is a non-starter.&lt;br /&gt;&lt;br /&gt;I'm a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;SQL&lt;/span&gt; junkie and am searching in vain (it seems so far) for decent extension to MySQL that does cross-database query and &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_17"&gt;reduction&lt;/span&gt; of tables I know to be neatly partitioned.  No luck so far.  Starting to look into &lt;span class="a"&gt;other &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;SQL&lt;/span&gt; engines as their maybe a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;ODBC&lt;/span&gt; wrapper for the federation layer.  It's got to be mostly functional and EASY to adopt.. or you'll continue to have people spouting the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;MapReduce&lt;/span&gt; dogma.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Post scripts:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Nice summary of Dr. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;Stonebraker's&lt;/span&gt; accomplishments &lt;a href="http://www.sigmod.org/sigmod/record/issues/0503/p13.special.dewitt.pdf"&gt;here&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Funny link to &lt;a href="http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/e14717e3be4bc1b3/456e1d61f2b2ab72?lnk=st&amp;amp;q=Is+there+a+map+reduce+implementation+in+common+lisp%3F+like+hadoop%3F#456e1d61f2b2ab72"&gt;comp.lang.lisp&lt;/a&gt; some newbies asking about if Lisp has Map Reduce.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-9031152646507823573?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/9031152646507823573/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=9031152646507823573' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9031152646507823573'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9031152646507823573'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/08/mapreduce-versus-rdbms.html' title='MapReduce versus RDBMS'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-8909096431817662885</id><published>2008-08-04T08:50:00.000-07:00</published><updated>2008-08-04T22:11:18.618-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='software engineering'/><title type='text'>Improving Software Release Management</title><content type='html'>I just found this in my inbox, &lt;a href="http://www.cio.com/article/print/440101"&gt;&lt;span style="font-size:100%;"&gt;7 Ways to Improve Your Software Release Management&lt;/span&gt;&lt;/a&gt;.  It's an excellent overview of 'doing things differently'.   Pretty similar to the change that I experienced at the last job... and something that &lt;a href="http://korrespondence.blogspot.com/"&gt;Mike &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Dierken&lt;/span&gt;&lt;/a&gt; at the current job just seems to know instinctively.&lt;br /&gt;&lt;br /&gt;I need to find some stuff on the best ways to do personal lightweight processes.  I'd like to be more efficient in producing software... especially software that is based upon speculative &lt;span style="font-style: italic;"&gt;ideas&lt;/span&gt;.  So much of the time data mining and machine learning coding is subject to the vagaries of the data set you are working against and it's difficult to know ahead of time if a given algorithm will work well.. how much data cleaning needs to be done... etc.&lt;br /&gt;&lt;br /&gt;How can you adapt lightweight processes and the things mentioned above for producing software that is not so cut and dried in what needs to be done?  In those situations I tend to ping-pong between little process (seat of the pants coding) and too much (excessive research and design before coding start).&lt;br /&gt;&lt;br /&gt;Post script:  Had to add this:&lt;br /&gt;&lt;h1 style="font-weight: normal;"&gt;&lt;a href="http://www.cio.com/article/print/441215"&gt;&lt;span style="font-size:100%;"&gt;Five Things Linus Torvalds Has Learned About Managing Software Projects&lt;/span&gt;&lt;/a&gt;&lt;/h1&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-8909096431817662885?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/8909096431817662885/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=8909096431817662885' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8909096431817662885'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8909096431817662885'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/08/improving-software-release-management.html' title='Improving Software Release Management'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4921488169436235526</id><published>2008-07-29T23:55:00.000-07:00</published><updated>2008-07-30T00:35:46.147-07:00</updated><title type='text'>How not to launch software</title><content type='html'>From cnet:&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;a href="http://news.cnet.com/cuil-shows-us-how-not-to-launch-a-search-engine/"&gt;Cuil shows us how not to launch a search engine&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;&lt;blockquote&gt;Google challenger &lt;b&gt;&lt;a href="http://www.cuil.com/"&gt;Cuil&lt;/a&gt;&lt;/b&gt; launched last night in blaze of glory. And it went down in a ball of flames. Immediately after launch, the criticism started to pile on: results were incomplete, weird, and missing.&lt;/blockquote&gt;The various articles on Cuil's failure revealed much about their architecture.  Apparently they are categorizing a user query into a topic and shipping that out to topical servers.   While this sort of 'topical partitioning' is interesting, it has zilch to do with relevance ranking... and suffers from a failure-point issue.. if that topic server goes down then queries against that topic will get junk results or zero results&lt;br /&gt;&lt;br /&gt;Questions and points of discussion:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Is it really true that a &lt;span style="font-style: italic;"&gt;data schema partition&lt;/span&gt; results in a better engine than Google? No, a better engine is made by better relevance.  Perhaps this is what the PR/Marketing people focused on rather than relevance.&lt;/li&gt;&lt;li&gt;How do you simulate load post launch load when you have no idea how widely the free press will be distributed?&lt;/li&gt;&lt;li&gt;Free post launch press is invaluable to your buisiness.. squandering it might be a deathblow.&lt;/li&gt;&lt;li&gt;Why not launch more quietly in early adopter tech-press and then go try and get mainstream press when you have proven the system?&lt;/li&gt;&lt;li&gt;Trading on your status as ex-Googlers (and not early ones at that) seems VERY dubious.  Stand on your own feet rather than someone else's.&lt;/li&gt;&lt;li&gt;The absolute hottest area of information retrieval research right now is using user click-streams to improve the relevance live and on-line (learning to rank), as well as personalize results.  These are differentiating features (if they result in improved relevance).&lt;/li&gt;&lt;li&gt;Cuil keeps ZERO user history or assigns session/user-ids.  This will make it very difficult to follow this trend.. unless they are using someone else's cookies to do the identification via analytics partner (no evidence of this).&lt;/li&gt;&lt;li&gt;The other hot area of IR research is using semantic analysis and NLP to break away from simple keyword based inverted indicies.  Hakia still seems to be doing it better than Cuil... or at least will appear to as long as the topical partitions keep crashing under the load.&lt;/li&gt;&lt;li&gt;Risk analysis is a fantastic tool in organizing and prioritizing your work on new products.. it seems they missed that part before deciding to launch.&lt;/li&gt;&lt;/ul&gt;I feel bad for these guys,  Anna Patterson et al. have done some great work in the past and I just hate to see good people stumble like this.&lt;br /&gt;&lt;br /&gt;I still think they are wrong to go out as a consumer engine... Enterprise is a better play.. however if their leading market differentiator is a topical partitioning of back end servers.. then they aren't even considering this as individual Enterprise customers may not be big enough to need hundreds of servers to distribute the index like that.&lt;br /&gt;&lt;br /&gt;Hindsight is always 20/20 and hope to hell I am not standing there redfaced as software I helped create fails upon high volume launch.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4921488169436235526?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4921488169436235526/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4921488169436235526' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4921488169436235526'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4921488169436235526'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/07/how-not-to-launch-software.html' title='How not to launch software'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-6503810255949959784</id><published>2008-07-27T23:17:00.000-07:00</published><updated>2008-07-27T23:57:53.121-07:00</updated><title type='text'>Dubious results from Cuil .. and the Majors</title><content type='html'>I searched for &lt;a href="http://www.cuil.com/search?q=evolution+recombination+mutation&amp;amp;sl=long"&gt;'evolution recombination mutation'&lt;/a&gt; on &lt;a href="http://cuil.com"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;cuil&lt;/span&gt;.com&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The first 2 times, I got a no results page.  After a few variations and an hour or so, I tried again and got a nice result set.  One really jumped out at me:  'What's Driving Evolution; Mutations or Genetic Recombination'.  The problem is that it links to an intelligent design group disputing Evolution in general.&lt;br /&gt;&lt;br /&gt;http://nwcreation.net/geneticrecombination.html&lt;br /&gt;&lt;br /&gt;After trying the same search on Google, Yahoo, Live.com, Ask and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Hakia&lt;/span&gt;.. that same page shows up in the top ten.&lt;br /&gt;&lt;br /&gt;Really?  This is authoritative and the best that modern world-class search can do?  I was hoping from some kind of page summarizing epic battles between Ernst &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Mayr&lt;/span&gt; and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Motoo&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Kimura&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;While this is likely caused by both keyword matches via good &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;SEO&lt;/span&gt; and the fact that this page is probably highly linked to... this is a semantic failure!  Imagine if a query to the Holocaust had anti-Holocaust propaganda links appearing above genuine factual information!&lt;br /&gt;&lt;br /&gt;At least &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Hakia&lt;/span&gt; and Yahoo put up a page refuting the author of the above link in the top 10 results.  Google and the rest fail to do so.&lt;br /&gt;&lt;br /&gt;Admittedly, not everyone believes in evolution.. and that is fine with me.. but I'm not sure that fact refutes the semantic/authoritative failure of the engines.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-6503810255949959784?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/6503810255949959784/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=6503810255949959784' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6503810255949959784'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/6503810255949959784'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/07/dubious-results-from-cuil-and-majors.html' title='Dubious results from Cuil .. and the Majors'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-432437109641342081</id><published>2008-07-27T21:47:00.000-07:00</published><updated>2008-07-27T23:11:02.233-07:00</updated><title type='text'>Anna Patterson's new company - cuil.com</title><content type='html'>Just spotted a &lt;a href="http://www.nytimes.com/2008/07/28/technology/28cool.html?_r=1&amp;amp;oref=login&amp;amp;ref=business&amp;amp;pagewanted=print"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;NYT&lt;/span&gt; article&lt;/a&gt; on a new search engine called &lt;a href="http://www.cuil.com/"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;cuil&lt;/span&gt;&lt;/a&gt; from &lt;a href="http://en.wikipedia.org/wiki/Anna_Patterson"&gt;Anna Patterson &lt;/a&gt;and &lt;a href="http://www-formal.stanford.edu/tjc/"&gt;Tom Costello&lt;/a&gt;.  A while back I read Dr. Patterson's article &lt;a href="http://www.acmqueue.com/modules.php?name=Content&amp;amp;pa=showpage&amp;amp;pid=143&amp;amp;page=1"&gt;"Why Writing a Search Engine is Hard"&lt;/a&gt;.  Nice quick read on the issues.  I also stumbled upon &lt;a href="http://blog.searchenginewatch.com/blog/060523-175358"&gt;several patents&lt;/a&gt; she wrote.&lt;br /&gt;&lt;br /&gt;So the recent challengers of note are Ask.com, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Powerset&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Hakia&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Wikia&lt;/span&gt;, &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Mahalo&lt;/span&gt; and now &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Cuil&lt;/span&gt;.  I'm rooting for the algorithmic ones, not really sure how &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;Wikia&lt;/span&gt; and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;Mahalo&lt;/span&gt; can scale to be non-niche engines.&lt;br /&gt;&lt;br /&gt;While I hope Cuil is successful (I like to see Academics go out and build companies), I'm not sure that it's possible to beat Google at this point.  It seems far more likely that Microsoft will just try and swallow &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;Hakia&lt;/span&gt; and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;Cuil&lt;/span&gt; and attempt to brew something out of the parts.&lt;br /&gt;&lt;br /&gt;I recommend reading &lt;a href="http://searchengineland.com/080728-000100.php"&gt;Danny Sullivan's post&lt;/a&gt; on Cuil, he hit most of the obvious points.&lt;br /&gt;&lt;br /&gt;My thoughts:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;I still think building intelligence on top of an existing index/engine is the way to go and I'm not sure that bragging about your index size or your back end architecture is going to get you any &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_19"&gt;meaningful&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;marketshare&lt;/span&gt;.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Also, Enterprise Search is still far behind in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;NLP&lt;/span&gt; technology &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;vis&lt;/span&gt;-a-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;vis&lt;/span&gt; the big 4 and the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_24"&gt;NLP&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_25"&gt;startups&lt;/span&gt;.  It's still 1999 there, enterprises are just now figuring out how to expose their vast document sets to an internal crawler and provide a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_26"&gt;UI&lt;/span&gt; that is not overly simplistic for savy users.  They've tried the classic approaches of commodity engines and found them wanting.  Link analysis doesn't help either as most of the these documents are not web documents with links.  This just cries out for an approach like Vivisimo's clustering + Hakia's semantics + Delicious' user driven tagging.&lt;br /&gt;&lt;br /&gt;Corporate searchers need a good &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_27"&gt;advanced&lt;/span&gt; interface and results that can be grouped by things like time and originating department.... but not be forced to &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_28"&gt;drown&lt;/span&gt; in overly similar hits. That is a market worth getting into with a $33M &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_29"&gt;VC&lt;/span&gt; investment.   The landscape is littered with vendors that over-promised during the sales cycle and Enterprise customers will switch products if it solves the problem better.&lt;br /&gt;&lt;br /&gt;Look at the $500M acquisition of Verity by Autonomy in 2005.  That's 5X more than Powerset (in 2005 dollars) and they actually had loads of paying customers.&lt;br /&gt;&lt;br /&gt;I worry that going directly at Google with a consumer search engine is just so much tilting at windmills.  Sometimes just selling a product to people willing to pay for it is easier.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-432437109641342081?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/432437109641342081/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=432437109641342081' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/432437109641342081'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/432437109641342081'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/07/anna-pattersons-new-company-cuilcom.html' title='Anna Patterson&apos;s new company - cuil.com'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-9083579748209629826</id><published>2008-07-27T16:06:00.000-07:00</published><updated>2008-07-27T21:27:58.666-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='interns'/><category scheme='http://www.blogger.com/atom/ns#' term='news'/><category scheme='http://www.blogger.com/atom/ns#' term='othersonline'/><title type='text'>Others Online news and welcome Rance and Vik!</title><content type='html'>I've been busy in the last few weeks getting new products launched at Others Online.  We've deployed a set of new 'Audience Affinity Analytics'. By adding a simple Javascript tag to your webpages, our software delivers free audience summary reports that detail at a keyword/phrase level what people are paying attention to on your site!  &lt;a href="http://www.othersonline.com/"&gt;Short video here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;We've also hired Rance Harmon (MS student at Montana State U) and &lt;a href="http://eecs.wsu.edu/%7Evjakkula/"&gt;Vik Jakkula&lt;/a&gt; (MS student at Washington State U) as interns for the summer.  Rance will be working on general web and Java coding as well as testing frameworks.  Vik will be working on some new data mining algorithms.  Welcome guys!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-9083579748209629826?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/9083579748209629826/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=9083579748209629826' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9083579748209629826'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/9083579748209629826'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/07/others-online-news-and-welcome-rance.html' title='Others Online news and welcome Rance and Vik!'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-8448696336700084039</id><published>2008-05-15T22:13:00.000-07:00</published><updated>2008-05-15T22:18:12.936-07:00</updated><title type='text'>Looking for an intern</title><content type='html'>I'm looking for resumes for a possible internship this summer with a startup company I work for - &lt;a href="http://www.othersonline.com"&gt;Others Online&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;   Min 20 hours per week, possibly 40 hours.&lt;br /&gt;&lt;br /&gt;   Job Description:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;      Assist with Java servlet, SQL and Ajax UI coding.   Data mining and search engine work possible if background allows.&lt;/li&gt;&lt;li&gt;       Potentially assisting with highly scalable system work using partitioned architecture migrating to Amazon EC2 &amp;amp; S3 system.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;   Required Skills (4+ out of 6)&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Java/JSP proficiency&lt;/li&gt;&lt;li&gt;SQL knowledge (MySQL)&lt;/li&gt;&lt;li&gt;Ajax experience (YUI, Google Web Toolkit, or from scratch javascript)&lt;/li&gt;&lt;li&gt;Some Sys Admin skill in Linux&lt;/li&gt;&lt;li&gt;Perl, Ruby or Python scripting&lt;/li&gt;&lt;li&gt;Familiarity with VMWare, Tomcat/Resin &amp;amp; Apache&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;   Preferred Skills&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Data Mining, NLP or Search Engine experience/coursework&lt;/li&gt;&lt;li&gt;WordNet or other taxonomy experience&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Text processing, clustering and classification&lt;/li&gt;&lt;li&gt;On-line model building&lt;/li&gt;&lt;li&gt;Ant System methods&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Partitioned high-availability system knowledge&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Design fundamentals of this type of architecture&lt;/li&gt;&lt;li&gt;Amazon EC2/S3 experience&lt;/li&gt;&lt;/ul&gt;&lt;li&gt; Firefox browser extensions&lt;/li&gt;&lt;/ul&gt;If you've taken a data mining or AI course, there is a very real possibility of doing some interesting work this summer with a conference research paper a likely outcome.&lt;br /&gt;&lt;br /&gt;If you want to assist with the high-availability partitioned architecture implementation and migration, you'll be learning cutting edge design patterns that I have yet to see in a textbook.  The other software engineer (&lt;a href="http://twitter.com/dierken"&gt;Mike Dierken&lt;/a&gt;) worked at Amazon for a few years implementing very scalable systems.&lt;br /&gt;&lt;br /&gt;Splitting time between the above two work areas is possible and in fact likely.  Either way you'll come out with some real world skills desirable in industry.&lt;br /&gt;&lt;br /&gt;We're a startup trying to explode on the scene this summer and the outcome post-summer is unknown.  It could turn into a full-time salary job or continued part-time work in the fall.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-8448696336700084039?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/8448696336700084039/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=8448696336700084039' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8448696336700084039'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8448696336700084039'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/05/looking-for-intern.html' title='Looking for an intern'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3498165073359163880</id><published>2008-05-08T08:48:00.000-07:00</published><updated>2008-05-08T08:58:55.511-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='computational advertising'/><title type='text'>A Subtle Art</title><content type='html'>I loved this quote from a &lt;a href="http://bits.blogs.nytimes.com/2008/05/05/how-googles-checkbook-stymied-microsoft/index.html"&gt;NYT Blog post&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;"By attracting a commanding share of the search advertising activity, Google also has the best data with which to create equations that maximize the money it makes from each search. It turns out that picking which ad to display when is a subtle art that can have a great effect."&lt;/blockquote&gt;&lt;br /&gt;I've been working on effectively harvesting and choosing keywords to best power and adnetworks like Google/Yahoo/MSN.   I believe it's much harder than simply producing good search results.  Search results are targeted at people explicitly looking for information, so in that sense it's like an auto-generated yellowpages entry, people &lt;span style="font-weight: bold;"&gt;want &lt;/span&gt;to click on something.  The adverts on these pages benefit from the explicit nature of the search.&lt;br /&gt;&lt;br /&gt;How do you do the same for display advertising on the rest of the non-SERP pages of the web?  You get one shot at providing value, a bit like trying to hit the green from the rough and through a stand of trees.   A subtle art indeed, and great fun to work on.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3498165073359163880?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3498165073359163880/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3498165073359163880' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3498165073359163880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3498165073359163880'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/05/subtle-art.html' title='A Subtle Art'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4288256072888833163</id><published>2008-04-24T23:31:00.000-07:00</published><updated>2008-04-25T00:19:31.538-07:00</updated><title type='text'>AI Luminaries and their claims</title><content type='html'>This is a bit of a rant.  In our weekly AI colloquium we have covered two high profile AI people, Dr. Robert Hecht-Nielsen and his &lt;a href="http://www.scholarpedia.org/article/Confabulation_theory"&gt;Confabulation theory&lt;/a&gt;  and Jeff Hawkins and his book &lt;a href="http://www.onintelligence.org/"&gt;On Intelligence&lt;/a&gt; plus his ideas around &lt;a href="http://en.wikipedia.org/wiki/Hierarchical_Temporal_Memory"&gt;Hierarchical Temporal Memory.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Neither of the two above people are terribly famous in AI academic circles per se, but they do get some press.  They are both founders of very successful business in HNC (Now FairIssac) and Palm/PalmPilot.  Particularly in the case of HNC's use of AI for fraud detection in financial transactions, I am very impressed.  Yet this is where it gets ugly.&lt;br /&gt;&lt;br /&gt;If you read Hawkin's book or listen to a Youtube lecture of Dr. Hecht-Nielsen you would think these guys have invented the next great AI of the 'I have modeled the brain' variety.  They both demonstrate very interesting neural network inspired architectures and basic computation.  They are both complete with deep layering, feedback loops and other structures that NN people have known will work for years.&lt;br /&gt;&lt;br /&gt;Yet both of them will just repeat similar claims that this is how the brain actually works and with this architecture real cognition is possible, even potentially trivial?  Hogwash.  Batshit Insanity.&lt;br /&gt;&lt;br /&gt;These aren't my words, they were used to describe &lt;a href="http://en.wikipedia.org/wiki/Stephen_Wolfram"&gt;Stephen Wolfram&lt;/a&gt;'s work on Cellular Automata and his claims that his flavor of CAs can build anything and describe all physics.  He makes lots of strong claims against a theory he spend decades toiling on nearly alone.&lt;br /&gt;&lt;br /&gt;Would Peter Norvig ever make these types of claims?   I tend to think no way.&lt;br /&gt;&lt;br /&gt;First, how can you make the claim that this &lt;span style="font-style: italic;"&gt;architecture &lt;/span&gt;is really how the brain works and as such will lead to cognition or reasoning?  To me that is what they are.. architectures of computation.&lt;br /&gt;&lt;br /&gt;Where are the programs?  Where is the embedded/encoded/learned method that can actually reason in some logical or nearly logical fashion?  Ie chaining facts together to create derived knowledge?  Picking apart disparate background statements/data to answer queries for information?  How does this architecture answer the question of what are necessary and sufficient conditions to produce computerized general reasoning?&lt;br /&gt;&lt;br /&gt;It's a massive leap of faith for me to take for granted that a neural architecture, with all it's bells and whistles, will just lead to reasoning.  Doesn't it take just one counter-example of something that such an architecture and associated learning methods can not learn correctly to break the bold claims?&lt;br /&gt;&lt;br /&gt;To me that is a central lesson of genetic algorithm theory, people for years went around insisting that the GA was a better optimizer (mousetrap) than all that had come before.  They invented theories to describe it's behavior and made bold claims.  Yet Wolpert and Macready  come along and show the &lt;a href="http://http//en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization"&gt;No Free Lunch theorems&lt;/a&gt;.  It basically blew many of the bolder claims of GA superiority to hell.  It has now spread into general Machine Learning as well.&lt;br /&gt;&lt;br /&gt;I think people, particularly ones in AI with monster egos, need to exercise some humility and not make such strong claims.  Didn't that type of behavior contribute to the &lt;a href="http://en.wikipedia.org/wiki/AI_winter"&gt;AI Winter&lt;/a&gt;?&lt;br /&gt;&lt;br /&gt;Every time I hear or read these types of claims my mind sees an image of Tom Hanks in Cast Away dancing around his fire and saying "Yes! Look what I have created! I have made fire!! I... have made fire!".  The horrible irony of the scene is that he remains fundamentally lost, couldn't find his island on a map and is no closer to finding home as a result of making his fire.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4288256072888833163?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4288256072888833163/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4288256072888833163' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4288256072888833163'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4288256072888833163'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/04/ai-luminaries-and-their-claims.html' title='AI Luminaries and their claims'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1628644667913758549</id><published>2008-04-20T23:08:00.000-07:00</published><updated>2008-04-21T00:03:41.702-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='culture'/><category scheme='http://www.blogger.com/atom/ns#' term='devices'/><category scheme='http://www.blogger.com/atom/ns#' term='nomads'/><title type='text'>Series of Economist Stories on Digital Nomads</title><content type='html'>During a flight to Silicon Valley last week for &lt;a href="http://www.ad-tech.com/"&gt;AdTech&lt;/a&gt; I read a special report in the Economist on &lt;a href="http://www.economist.com/specialreports/displayStory.cfm?STORY_ID=10950394"&gt;Digital Nomads&lt;/a&gt;.  The intro article sets the tone and talks about how the mobility of cellular devices are obviously quickly changing the world.  I love the description of early digital nomads as more akin to astronauts who must carry everything with them (cables, disks, dongles etc).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;My comments:&lt;/span&gt;&lt;br /&gt;I do tend to carry a good inch of paper everywhere I go, 90% of which are CS/AI research papers or dissertation notes to work on as I have the time.  Opening the notebook is way to0 much of a hassle as a mobile reading device (on an airplane).  He also states that engineers at Google tend to carry only a smart phone of some kind and no laptop when traveling.  Is this true?  I find this a bit hard to believe unless they are managers and not active coders.. I can't imagine writing code on a blackberry or worse and iPhone.&lt;br /&gt;I'll admit I have no smart phone and do carry a laptop everywhere.  I'd like it to be smaller, but not at the cost of horsepower or decent display size.  I tried using a Microsoft powered Samsung smart phone and hated it, what it gave me in increased cool functions I sacrificed in phone function.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.economist.com/specialreports/displaystory.cfm?story_id=10950378"&gt;next article in the series&lt;/a&gt; discusses the new found benefits and costs of our ability to work everywhere and anywhere.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;My comments:&lt;/span&gt;&lt;br /&gt;My laptop is definitely a desktop replacement for me, I want mobility.. but I am old fashioned enough to want a regular desk to work at.. even if that is my current count of three different ones at different times of the week.  I have tried the coffee shop thing.. it works to a point but I find myself only being efficient for the first 2-3 hours then it goes to hell as I start hearing and being distracted by the conversations around me.  I suppose it would be fine if I needed to do mostly emailing and short attention span coding.  I also think that most business conversations are sensitive enough that talking in a public space is not acceptable in my mind... nothing to hide, yet why broadcast every mundane and not so mundane detail to the people around you?&lt;br /&gt;The parts about culture clashes from old cubical work and new mobile work are spot on.  It points to the need to trust every employee and set a tone that what matters is production and not the act of working.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.economist.com/specialreports/displaystory.cfm?story_id=10950463"&gt;third article in the series&lt;/a&gt; explores the need for new types of spaces and architecture for this new way of working.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;My comments:&lt;br /&gt;&lt;/span&gt;  I loved this one, even if it assumes that I 100% embrace the nomad ethos.  Perhaps if I had such a space to work, and people to work with in that space I'd abandon some of my older ways.   The closest thing I can imagine these 'third places' being is a combination of a student-union building and a college library.  Very non uniform places with nooks and crannies for every type of 'work'.   The Bozeman paper just had an article on a new business for 'on demand offices' and I saw two others described in a Seattle tech magazine at the airport.  Winning idea.. 50% of the reason I go to campus two days a week is for socialization.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Great quote:&lt;/span&gt;&lt;br /&gt;These places are "physically inhabited buy psychologically evacuated" ... leaving people feeing "more isolated that if the cafe where merely empty".&lt;br /&gt;&lt;br /&gt;Great food for thought in the articles.  I do notice that in my travels I see one thing that disturbs me.  It is some people's inability to ignore their cell phone or crackberry when the are engaged in a face to face conversation or meeting.  I was very appreciative of one executive's recent demonstration of 100% ignoring his device when it rang or vibrated.. he didn't even flinch.  This was the exception to the rule over the past week.&lt;br /&gt;&lt;br /&gt;I need to let this brew more.. at some point I am sure it will spur some good ideas.  Perhaps there is an algorithm or platform waiting to be discovered that will spur us to look up from our devices and engage each other again.  I suspect a big part of our addiction to them is that it's a much more high bandwidth information pipe that simple conversations are.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1628644667913758549?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1628644667913758549/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1628644667913758549' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1628644667913758549'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1628644667913758549'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/04/series-of-economist-stories-on-digital.html' title='Series of Economist Stories on Digital Nomads'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-8009317220913963349</id><published>2008-04-18T12:05:00.000-07:00</published><updated>2008-04-18T15:39:20.044-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='human relevance'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='computational advertising'/><title type='text'>Using human relevance judgements in search and advertising</title><content type='html'>This is old news on a couple of dimensions.  Read Write Web had a &lt;a href="http://www.readwriteweb.com/archives/google_hires_people_for_feedba.php"&gt;post&lt;/a&gt; on how Google uses human relevance studies to help judge/QA their search results.  This resulted from an interview that &lt;a href="http://norvig.com/"&gt;Peter Norvig&lt;/a&gt; gave to MIT Technology Review and caused some commenting in the blogosphere (&lt;a href="http://bits.blogs.nytimes.com/2007/12/18/the-people-inside-googles-black-box/?ref=technology"&gt;NewYorkTimes Tech Blog&lt;/a&gt;, &lt;a href="http://blogoscoped.com/archive/2007-12-18-n89.html"&gt;Goolge Blogoscoped)&lt;/a&gt;.  Old news on old news.&lt;br /&gt;&lt;br /&gt;We now know that both Yahoo and Microsoft are using (to some degree) human studies to evaluate computational advertising algorithms (see &lt;a href="http://aicoder.blogspot.com/2008/03/scraping-documents-for-advertising.html"&gt;this&lt;/a&gt; and &lt;a href="http://aicoder.blogspot.com/2008/04/text-summarization-and-advertising.html"&gt;this&lt;/a&gt;). Evaluating the correlation of what informational item an algorithm predicts, vs what humans think, is relevant to a context is  the performance metric of your algorithm.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Question:&lt;/span&gt;  When will TREC have a computational advertising contest?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-8009317220913963349?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/8009317220913963349/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=8009317220913963349' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8009317220913963349'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8009317220913963349'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/04/using-human-relevance-judgements-in.html' title='Using human relevance judgements in search and advertising'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-5538913207706951253</id><published>2008-04-17T15:57:00.000-07:00</published><updated>2008-04-18T13:26:06.310-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='text summarization'/><category scheme='http://www.blogger.com/atom/ns#' term='computational advertising'/><title type='text'>Text Summarization and Advertising</title><content type='html'>Recently read another CompAdvert paper (CIKM'07) from the Yahoo group, &lt;a href="http://portal.acm.org/citation.cfm?id=1321488&amp;amp;jmp=cit&amp;amp;coll=&amp;amp;dl=GUIDE"&gt;Just-in-Time Contextual Advertising&lt;/a&gt;.  They describe a system where the current page is scraped on-line via a javascript tag, summarized and then that summary is passed to servers to match with Ad listings.  Interesting points are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;5% of page text carefully chosen can yield 97+% of full-text advert matching relevance&lt;/li&gt;&lt;li&gt;Best parts of the document are URL, referring URL, title, Meta and headings.&lt;/li&gt;&lt;li&gt;Adding in a classification against a topical taxonomy adds to the accuracy of the ad matches.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;They judged the ad matching relevance against human judgments of ad to page relevance.&lt;/li&gt;&lt;/ul&gt;I found these papers within the last few months as &lt;a href="http://www.othersonline.com/"&gt;OthersOnline&lt;/a&gt; focused on behavioral based advertising.  In many ways their finding are interesting, affirming and unsurprising.   Interesting in that they are pushing the state of the art in advert matching, and affirming in that we @ OO are on the right track.  Unsurprising in that using the document-fields about is the classic approach to indexing webpages and documents.&lt;br /&gt;Of course internet search engines used this for years (it defines the SEO industry's eco-system), and the old/retired open source engine HtDig has had special treatment of those fields since the late 90s.  The difference now is the direction, the documents are the "query" and the hits are the ads.  Best part about the method is that it's cheap... javascript + the browser becomes your distributed spider and summarizer of the web.&lt;br /&gt;I do love finding these papers.. we just don't have the time or resources to have a study like this and confirm the approach with a paid human factors study.  Just go forward on gut educated feel day to day and the human measure is if we get clicks on the ads.&lt;br /&gt;This approach is similar to the one we outlined and implemented before finding this paper.  The difference is what we do with the resulting "query",  using the signal to learn a predictive interest model of users.&lt;br /&gt;Still no mention of any relative treatment of words within the same field... one would assume this would move the needle on relevance as well.&lt;br /&gt;I still believe that this type of summarization approach can be used to make an implicit page tagger and social recommender like del.icio.us ... if you can filter the summary based upon some knowledge of the users real (as opposed to statistical) interests.  Key route to auto-personalization of the web.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5538913207706951253?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5538913207706951253/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5538913207706951253' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5538913207706951253'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5538913207706951253'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/04/text-summarization-and-advertising.html' title='Text Summarization and Advertising'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-861800636429238657</id><published>2008-04-04T07:24:00.000-07:00</published><updated>2008-04-04T08:08:10.780-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='som'/><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='itemset mining'/><category scheme='http://www.blogger.com/atom/ns#' term='ant system'/><title type='text'>Itemset mining in data streams</title><content type='html'>We covered a nice paper by &lt;a href="http://www.liacs.nl/home/edegraaf/"&gt;de Graaf&lt;/a&gt; et al of &lt;a href="http://www.liacs.nl/"&gt;Leiden University&lt;/a&gt; in the AI colloquium.&lt;a href="http://arxiv.org/pdf/0705.0588"&gt; Clustering Co-occurrences of Maximal Frequent Patterns in Streams.&lt;/a&gt;  It deals with the problem of finding frequent itemsets in a datastream.  Ideally you want an incremental algorithm for this with an 'updatable model' rather than being forced to reprocess the entire data/transaction sequence when adding new data.  The paper's approach has the extra benefit that the itemsets are clustered by similarity as well.  I really enjoy using and learning about algorithms that have nice side-effects.&lt;br /&gt;&lt;br /&gt;A rough overview of the algorithm is that it does three basic operations with incoming data.  First it builds a support model of patterns encountered.  It does this with a with a reinforcement and decay technique, reinforcing support for patterns encountered in the current transaction and decaying those that don't.  Second it maintains a geometric organization of itemsets according to a distance metric in a boxed 2-D area.  As new data is processed itemsets' coordinates in the (x,y) box move and shift around according to their similarity with other patterns.  Third it performs a pattern merging/splitting mechanism to derive new patterns for the model to track. new patterns get a random (x,y) position.&lt;br /&gt;&lt;br /&gt;At the termination of processing some amount of data, you are left with a list of itemsets and their support/frequency as well as a nice grouping by similarity.&lt;br /&gt;&lt;br /&gt;One advantage of his presentation is that it is stripped of all excess complexity.  They well note that it learns an approximation of what you would get from a full-data-scan of a traditional itemset miner.  Fine with me.. I don't get hung up on exactness and have lots of faith that incremental model building works well in practice.&lt;br /&gt;&lt;br /&gt;The minor flaw of the paper is that they fail to point out (or notice??) that what they have built is a hybrid of a &lt;a href="http://en.wikipedia.org/wiki/Swarm_intelligence"&gt;Swarm Intelligence&lt;/a&gt; and a &lt;a href="http://en.wikipedia.org/wiki/Self-organizing_map"&gt;Self Organizing Map.&lt;/a&gt; The Swarm/Ant portion comes from the reinforcement &amp;amp; decay of the support model, and the SOM from the geometric clustering of the itemsets.   On could duplicate this algorithm in spirit by implementing Ant System + SOM with the merging/splitting for new pattern production.  By 'Ant System' here I refer to the spirit of an ant system where you use pheromone reinforcement and decay of a model, actual ants traversing a path in a graph are not necessary.  The cells in the SOM would contain a sparse vector of itemsets and apply the standard rules for updating.&lt;br /&gt;&lt;br /&gt;Yet, even as I see the connection, this is a pointy-headed comment.  The paper is nice in that the algorithm is presented without flourish in a straight forward way... sometimes using the word 'hybrid' and casting your work that way is just a form of academic &lt;a href="http://en.wikipedia.org/wiki/Buzzword_bingo"&gt;buzzword bingo&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I'll definitively look to implement something similar to this ASAP.  I may skip the merging/splitting and use a standard itemset miner offline over the 'last X transactions' and form a itemset pattern dictionary.  Only itemsets in the dictionary will be tracked and clustered with the data stream, and every so often run the offline algorithm to learn new patterns and support to merge into the dictionary.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-861800636429238657?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/861800636429238657/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=861800636429238657' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/861800636429238657'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/861800636429238657'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/04/itemset-mining-in-data-streams.html' title='Itemset mining in data streams'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1916464493650876279</id><published>2008-03-30T22:31:00.000-07:00</published><updated>2008-04-18T13:29:29.147-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><title type='text'>OpenOffice and LaTeX</title><content type='html'>I am stunned.  I am working on the dissertation and need to make a &lt;a href="http://www.latex-project.org/"&gt;LaTeX&lt;/a&gt; monograph out of some publications.   The biggest pain is that I had written two of the papers in MS Word, and previous experience cut-and-pasting from Word to a text file was horrible.  I also no longer have MS Word installed on my new machine (didn't come pre-installed from Dell), so I have been using &lt;a href="http://www.openoffice.org/"&gt;OpenOffice&lt;/a&gt;.  So how do I get from Word to LaTeX when I only have OO?&lt;br /&gt;&lt;br /&gt;More as procrastination I did a search and found &lt;a href="http://www.hj-gym.dk/%7Ehj/writer2latex/"&gt;Writer2LaTeX.&lt;/a&gt;  Some saint named Henrik Just created this to convert OO docs (either swx or odf format) to LaTeX.  Looks like he has been working on it for a while with a set of faithful followers submitting feature requests.  Between OO 2.4 being able to flawlessly open the MS Word docs and save to ODF plus Writer2LaTeX, I probably saved myself 2 days of irritating work.&lt;br /&gt;&lt;br /&gt;It's not perfect as I need to convert some equations images to proper LaTeX, yet the mind-numbing job of cut and paste is rendered unnecessary.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Thanks Henrik and the rest of the OpenOffice team!&lt;br /&gt;&lt;br /&gt;Postscript:&lt;br /&gt;&lt;/span&gt;&lt;span&gt;While not open source, &lt;a href="http://www.dessci.com/en/products/texaide/"&gt;TeXaide &lt;/a&gt;from &lt;a href="http://www.dessci.com/en/"&gt;Design Science&lt;/a&gt; is a very nice free LaTeX equation editor.  Write your equation, the highlight and copy it... then it pastes magically as text markup into your LaTeX code.  Now if only I could just drag and drop an image on it and have it convert the equation in the image.&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1916464493650876279?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1916464493650876279/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1916464493650876279' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1916464493650876279'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1916464493650876279'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/03/openoffice-and-latex.html' title='OpenOffice and LaTeX'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1559508668644277996</id><published>2008-03-28T10:49:00.000-07:00</published><updated>2008-04-01T10:57:57.283-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='semantic web'/><category scheme='http://www.blogger.com/atom/ns#' term='natural language processing'/><category scheme='http://www.blogger.com/atom/ns#' term='computational advertising'/><title type='text'>Semantic Features for Contextual Advertising</title><content type='html'>&lt;a href="http://research.yahoo.com/Andrei_Broder"&gt;Andrei Broder's&lt;/a&gt; group at Yahoo! Research has a focus on &lt;a href="http://http//research.yahoo.com/Computational_Advertising"&gt;Computational Advertising.&lt;/a&gt;  At SIGIR 2007 they released a paper on using Semantic Taxonomy to do contextual matching for advertising.  This is a similar problem to the previous post about MS Research, deriving lists of keywords from a document to use as queries to an advertising system.  Unlike the MS Research paper, Yahoo has built a large taxonomy of "commercial interest queries" with 6000 nodes and approx 100 items attached to each node.&lt;br /&gt;&lt;br /&gt;The essential approach is to classify a document into the taxonomy as well as all of the ads and match ads to documents on the basis of topical distance.  The distance score is combined with a more standard IR type approach forming a combined score.  The top-k matching ads ordered by lowest distance are the ads displayed the page.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-style: italic;"&gt;TaxScore()&lt;/span&gt; function is fairly interesting, it attempts to generalize the given term within the taxonomy.  It seems that this type of approach could work well with using WordNet's &lt;a href="http://en.wikipedia.org/wiki/Hypernym"&gt;Hypernyms&lt;/a&gt; in a more regular IR/Search setting.&lt;br /&gt;&lt;br /&gt;I have to read it again more carefully to see if I missed it, however I did not see anywhere in the formulas using any weighting of a keyword's bid value (or advert count).  Maybe this was omitted for trade secrecy?? .. it seems obvious that it should be used to some degree to maximize $$ yield or eCPM of clicked ads.  The idea is not to let it affect the matching of the ads to keywords, just the final rank order to some degree.&lt;br /&gt;&lt;br /&gt;In my own experiments @ OO, using some proxy for bid value seems to increase eCPM.  The biggest challenge is getting comprehensive data for your dictionary if you are not Google, Yahoo or MS.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Postscript:&lt;/span&gt;&lt;br /&gt;I have it confirmed from two independent sources (current and ex Y!ers)  that Yahoo is working in a new Content Match codebase as the old version didn't work.   Hard to say what status Broder's above technique is in (production usage or internal testing).. or if it was part of the old system?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1559508668644277996?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1559508668644277996/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1559508668644277996' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1559508668644277996'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1559508668644277996'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/03/semantic-features-for-contextual.html' title='Semantic Features for Contextual Advertising'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3224197226838781610</id><published>2008-03-28T09:41:00.000-07:00</published><updated>2008-03-28T10:21:26.932-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='natural language processing'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='computational advertising'/><title type='text'>Scraping Documents for Advertising Keywords</title><content type='html'>Lately I've been working on extracting keywords from text that would be associated with good keyword advertising performance.   This is fairly related to the 'text summarization' problem, yet that usually works towards a goal of readable summaries of documents.  This is a simpler problem as I don't want to build readable summaries.&lt;br /&gt;&lt;br /&gt;'&lt;a href="http://www2006.org/programme/files/pdf/533.pdf"&gt;Finding Advertising Keywords on Web Pages&lt;/a&gt;' from MS Research (Yih, Goodman, and Carvalho) was interesting reading.  To boil it down to its essence, the authors used a collection of standard text indexing and NLP techniques and datasets to derive 'features' from the documents, then used a feature-selection method to decide what features were best in deciding good advertising keywords in a document.  They judged the algorithms against a human generated set of advertising keywords associated with a group of web pages.  Their 'annotators' read the documents then chose prominent words from the document to use as viable keyword advertising inputs.&lt;br /&gt;&lt;br /&gt;Note that this is not an attempt to do topic classification, where you could produce a keyword describing a document that did not exist in the document.. for example labeling a news article about the Dallas Cowboys with 'sports' or 'event tickets' if those labels did not exist in the article.&lt;br /&gt;&lt;br /&gt;Interestingly the algorithm learned that the most important features predicting a word's advertising viability was the query frequency in MSN Live Search (a dead obvious conclusion now supported by experiments), and the TF-IDF metric.   Other features like capitalization, link text, phrase &amp;amp; sentence length and title/headings words were not as valuable alone.. yet (unsurprisingly) the best system used nearly all features.  The shocker was that the part-of-speech information was best left unused.&lt;br /&gt;&lt;br /&gt;I emailed the lead author and learned that the MS lawyers killed the idea of releasing the list of labeled URLs.&lt;br /&gt;&lt;br /&gt;Post Script:  The second author is Joshua Goodman, who had a hilarious exchange with some authors from &lt;a href="http://www.uniroma1.it/"&gt;La Sapienza University&lt;/a&gt; in Rome.  They wrote a 2002 &lt;a href="http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/BenedettoCaLo.pdf"&gt;Physical Review Letters paper&lt;/a&gt; on using gzip for analyzing the similarity of human languages.  Goodman responded with &lt;a href="http://citeseer.ist.psu.edu/503042.html"&gt;this critique&lt;/a&gt;, causing the original authors to respond with this &lt;a href="http://arxiv.org/abs/cond-mat/0203275"&gt;response&lt;/a&gt;.  Looks like there are other follow ups by third-parties.  The mark of an effective paper is that it is talked about and remembered.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3224197226838781610?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3224197226838781610/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3224197226838781610' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3224197226838781610'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3224197226838781610'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/03/scraping-documents-for-advertising.html' title='Scraping Documents for Advertising Keywords'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1140480127892565501</id><published>2008-03-28T08:12:00.000-07:00</published><updated>2008-03-28T09:38:39.413-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Frequent Pattern Mining</title><content type='html'>In the weekly &lt;a href="http://labs.rightnow.com/colloquium/papers.php"&gt;RightNow AI Colloquium @ MSU CS &lt;/a&gt;, we read a paper by  Jiawei Han et al. called&lt;br /&gt;&lt;a href="http://www.cs.uiuc.edu/%7Ehanj/pdf/dami04_fptree.pdf"&gt;Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach.&lt;/a&gt;&lt;br /&gt;Basically the problem is this, given a sequence of transactions involving multiple items per transactions, what are the frequent itemsets?  Itemsets are groups of m items that tend to be purchased together.  The &lt;a href="http://portal.acm.org/citation.cfm?id=968027"&gt;earlier SQL-based version of FP-Tree&lt;/a&gt; looks interesting as well.&lt;br /&gt;&lt;br /&gt;FP-Tree out competes in time complexity the &lt;a href="http://en.wikipedia.org/wiki/Apriori_algorithm"&gt;Apriori Association Rule Miner Algorithm&lt;/a&gt; (&lt;a href="http://www2.cs.uregina.ca/%7Edbd/cs831/notes/itemsets/itemset_prog1.html"&gt;Java Code).&lt;/a&gt;  Not sure how it compares with this raft of other algorithms available at the &lt;a href="http://fimi.cs.helsinki.fi/"&gt;FIMI.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I'd love to use an algorithm like this to extract pairs and triples of keywords from documents in a clickstream.. basically looking for recurring patterns in browsing/hunting bevavior in document repositories.&lt;br /&gt;&lt;br /&gt;I would say the biggest issue using either of the two above algorithms in a 'production' system is that they are not incremental.  Ideally one could process a batch of transactions a record at a time and form a rolling frequent itemset collection.  Itemsets would move in and out of the collection as they achieved or lost 'support', and the cost of adding another transaction is minimal... as in not having to rescan all previous transactions.&lt;br /&gt;&lt;br /&gt;My initial idea how to do this would be to devise an incremental approximation addition to any association miner.  At the end of a full-run of your miner, you would keep the final itemsets AND the itemsets that just missed the support threshold.   The incremental algorithm would process up to a tolerance level of new transactions, say log(n) of the original transaction set size, and look to promote the 'just missed' itemsets if support arrives.  Maybe some attempt could be made to remove itemsets if the additional transactions sank their support level to below the cut-line.  After more than log(n) new transactions arrive, you can reprocess the entire set or trim off the first log(n) of the old transactions plus the new ones.&lt;br /&gt;&lt;br /&gt;There are likely some speedups to be had in subsequent reprocessings.  If from the previous full-run you found that a collection of itemsets made the cut, you could prune out those itemsets from the old transactions.. restarting the algorithm with the previous run's itemsets in the "found pool".&lt;br /&gt;&lt;br /&gt;Of course with an algorithm like FP-Tree you must save the tree for the next run, and devise a tree rebalancing algorithm to make it incremental (relative frequencies of items change with new information).  It gets messy quick.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1140480127892565501?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1140480127892565501/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1140480127892565501' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1140480127892565501'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1140480127892565501'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/03/frequent-pattern-mining.html' title='Frequent Pattern Mining'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1399259994377710564</id><published>2008-03-24T08:00:00.000-07:00</published><updated>2008-03-28T08:10:42.092-07:00</updated><title type='text'>Road Coloring Problem solved</title><content type='html'>I'm not sure why this hit my buttons, as it's not my area of expertise, yet I found it very interesting. Avraham Trahtman, 63 year old Russian mathematician who used to work as a plumber &amp;amp; maintenance guy in Israel, has solved (&lt;a href="http://arxiv.org/pdf/0709.0099v4"&gt;paper with solution&lt;/a&gt;) an interesting &lt;a href="http://en.wikipedia.org/wiki/Road_Coloring_Conjecture"&gt;problem in graph theory&lt;/a&gt;. Unlike other famous problems with few applications, this appears to have some (network routing) and he's published an &lt;a href="http://arxiv.org/abs/0801.2838"&gt;updated paper with a sub-quadratic algorithm&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I think it has been captured in &lt;a href="http://www.msnbc.msn.com/id/23729600/"&gt;news coverage&lt;/a&gt; more as a result of his age and story than importance of the problem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1399259994377710564?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1399259994377710564/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1399259994377710564' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1399259994377710564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1399259994377710564'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/03/road-coloring-problem-solved_24.html' title='Road Coloring Problem solved'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4602108780228797275</id><published>2008-02-12T10:09:00.000-08:00</published><updated>2008-02-12T13:03:30.425-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='community sites'/><category scheme='http://www.blogger.com/atom/ns#' term='refocusing'/><category scheme='http://www.blogger.com/atom/ns#' term='computational advertising'/><title type='text'>Refocusing of Others Online to Behavioral Targeting and Computational Advertising</title><content type='html'>On the work/professional front Others Online is going into a bit of a transformation.&lt;br /&gt;&lt;br /&gt;A rethinking exercise demonstrated that what we do best was connect people with content based upon implicit attention streams (this can be anything, click streams, blog posts, twitter, searches).   That content was other people (social referrals), content (web pages &amp;amp; blogs) and ads.&lt;br /&gt;&lt;br /&gt;Optimizing the process of selecting ads for individual people/groups and context rather than broad categories is what we'll focus on for a while, particularly me as concentrate on being the computational advertising guy at OO.&lt;br /&gt;&lt;br /&gt;As for the social networking part of the software, we're refocusing on adding value to existing community and topical sites and other like groups that tend to connect people around a specific topic or geography.&lt;br /&gt;&lt;br /&gt;Personally I think the refocusing on computational advertising and targeting is a great choice.  When was the last time a web advertisement was &lt;span style="font-style: italic;"&gt;really relevant&lt;/span&gt; to you?  Do you really want to see noisy advertisements for mortgage refinancing and Viagra-like products?  The key to all of this is to be socially responsible and give users complete control so that we don't fall into the Facebook-Beacon backlash.&lt;br /&gt;&lt;br /&gt;Well, that and some clever algorithms to do the learning.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4602108780228797275?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4602108780228797275/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4602108780228797275' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4602108780228797275'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4602108780228797275'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/02/refocusing-of-others-online-to.html' title='Refocusing of Others Online to Behavioral Targeting and Computational Advertising'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-5365573131205807719</id><published>2008-02-12T09:52:00.001-08:00</published><updated>2008-02-12T13:04:18.626-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='czech'/><category scheme='http://www.blogger.com/atom/ns#' term='beer'/><category scheme='http://www.blogger.com/atom/ns#' term='vacation'/><category scheme='http://www.blogger.com/atom/ns#' term='germany'/><title type='text'>Dual Purpose trip to Germany</title><content type='html'>I was recently in Germany for two purposes.  The first was for the &lt;a href="http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=2008051"&gt;'Theory of Evolutionary Algorithms'&lt;/a&gt; conference at &lt;a href="http://en.wikipedia.org/wiki/Dagstuhl"&gt;Dagstuhl Castle&lt;/a&gt; in Saarland Germany.  This was a joy as I was able to focus nearly entirely on my PhD research and dissertation progress.  At this point I'm in the home stretch.. I think I have two provable theorems sketched out and a set of new tractable items to explore to finish off the meat of the dissertation.&lt;br /&gt;&lt;br /&gt;The second week in Germany was a beer tourist adventure through Bavaria and Bohemia.  We hit Munich on day one and drank expensive tourist beer, then took a train to the Czech Republic to see Plzen (home of Pilsener beer) and Budweiss (Ceske Budejovice - original home of Budweiser beer). Plzen was very interesting and Budweiss was an adventure of train &amp;amp; bus travel. Two days later we took a train back into Germany to Bamberg.   Bamberg is stunningly beautiful and is the home of a wide selection of locally crafted smoked lager beers.  More on this trip in future posts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-5365573131205807719?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/5365573131205807719/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=5365573131205807719' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5365573131205807719'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/5365573131205807719'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2008/02/dual-purpose-trip-to-germany.html' title='Dual Purpose trip to Germany'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4023445525611440049</id><published>2007-12-13T21:46:00.000-08:00</published><updated>2007-12-13T22:01:58.307-08:00</updated><title type='text'>Work and School News</title><content type='html'>It seems I have committed the most frequent sin of blogging.. sporadic posting patterns.  All of November and half of December and no posts.  It's been a busy month at &lt;a href="http://www.othersonline.com/"&gt;OthersOnline&lt;/a&gt;.  We got a huge spike in user traffic and sign-ups, which is awesome.. and predictably exposed some performance issues.   That work, plus working on a better user behavior capture algorithm (to better match users to other users and content) was the bulk of the time.&lt;br /&gt;&lt;br /&gt;In school news, I just turned in a first draft of the dissertation to my adviser.   It's really a disposable organizational draft to help us plan out what the flow of topics and structure is,, and what is left to add.  Nice milestone anyway.  Goal is late spring for near-final draft with the defense goal in august.&lt;br /&gt;&lt;br /&gt;Some advice for people doing PhD work while employed full-time:  If you are working in your field of graduate study, and your employer permits it.. do the dissertation on a work related topic!  I could easily have done mine on a topic related to my AI work at RightNow and have finished by now.  I chose to not only do a different topic in AI from work (to be more well-rounded) but to do it on the &lt;span style="font-style: italic;"&gt;theory&lt;/span&gt; of that topic (genetic algorithms). &lt;br /&gt;&lt;br /&gt;While this seemed a good choice at first, the dissertation sunk to 4th place on the priority list (family, career, misc leisure activities).  Don't make this mistake!  Take the shortest path and get done... then pursue the side topic under no pressure.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4023445525611440049?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4023445525611440049/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4023445525611440049' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4023445525611440049'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4023445525611440049'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/12/work-and-school-news.html' title='Work and School News'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-3455463368993796241</id><published>2007-10-29T09:26:00.001-07:00</published><updated>2007-10-29T22:31:52.756-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data mining'/><category scheme='http://www.blogger.com/atom/ns#' term='learning structure'/><category scheme='http://www.blogger.com/atom/ns#' term='meaning'/><category scheme='http://www.blogger.com/atom/ns#' term='collective search'/><title type='text'>Inferring meaning from data and structure</title><content type='html'>Jeremy &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Liew&lt;/span&gt; has quite the thread going on his blog about '&lt;a href="http://lsvp.wordpress.com/2007/10/22/meaning-data-structure/"&gt;Meaning = Data + Structure (User Generated)&lt;/a&gt;',   &lt;a href="http://lsvp.wordpress.com/2007/10/29/meaning-data-structure-inferring-structure-from-domain-knowledge/"&gt;Part2 on Inferring Structure &lt;/a&gt; and a &lt;a href="http://lsvp.wordpress.com/2007/10/27/meaning-data-structure-more-thoughts-on-user-generated-structure/"&gt;Guest Post by Peter Moore&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The post by Moore is a wonderful summary of approaches and their difficulties, and I'll post more on this as I think about it.  My initial response is that we should stop looking/waiting for some near holy-grail {fully functional semantic web} and use a lot of good-enough {technologies, algorithms, ontologies} to make progress.  I think that the perfection-in-reasoning stuff is great for the teleportation version of personal search vs the good-enough techniques as applicable now to the orienteering version of personal search.  See this &lt;a href="http://nform.ca/blog/2007/05/orienteering-vs-teleporting"&gt;post&lt;/a&gt; and this &lt;a href="http://haystack.lcs.mit.edu/papers/chi2004-perfectse.pdf"&gt;paper&lt;/a&gt; for orienteering vs teleportation in search.&lt;br /&gt;&lt;br /&gt;Last week the Bozeman AI group read a paper on &lt;a href="http://www.eml-r.villa-bosch.de/english/homes/strube/papers/aaai07.pdf"&gt;Deriving a large Scale Taxonomy from Wikipedia&lt;/a&gt;. I look at this as an example of the main idea above, deriving structure from user generated content. True, Wikipedia is already structured, but not necessarily in a way that a computer program can use to reason with.&lt;br /&gt;&lt;br /&gt;The killer thing about this idea is that it's finally time to do it.  Essentially this is what machine learning and data mining has been about for years.  I've read/&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;perused&lt;/span&gt; hundreds of academic papers where the basic premise is that we write a suite of algorithms to learn/extract structure from a pool of data.  A big chunk of papers in the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;KDD&lt;/span&gt; conferences each year (&lt;a href="http://www.kdd2007.com/"&gt;2007&lt;/a&gt;, &lt;a href="http://www.kdd2006.com/"&gt;2006&lt;/a&gt;, &lt;a href="http://www.sigkdd.org/kdd/2005/"&gt;2005&lt;/a&gt;) operates on this premise and this field is quite old (decades).&lt;br /&gt;&lt;br /&gt;Really pointy-headed CS types are &lt;span style="font-style: italic;"&gt;horrible &lt;/span&gt;at monetizing their work.  At approx the same time that Google founders were inventing &lt;a href="http://en.wikipedia.org/wiki/PageRank"&gt;PageRank&lt;/a&gt;, Jon Kleinberg was creating &lt;a href="http://en.wikipedia.org/wiki/HITS_algorithm"&gt;HITS&lt;/a&gt;.  Both are link-analysis algorithms to augment  what at the time were poor quality search engines.  Over the past 10 years when they are evaluated head-to-head on some Information Retrieval task HITS works on-par with PageRank.  Yet Kleinberg is not now worth 40 billion dollars like Brin and Page of Google.&lt;br /&gt;&lt;br /&gt;I fear that the Semantic web people/researchers have been building sand castles for a decade rather than monetizing what they have to subsidize more research on it.  Perhaps if they had been Delicious, Digg, WikiPedia, et al. would be contributing to the Semantic Web natively, rather than forcing people to figure out a way to export that data into RDF/OWL.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-3455463368993796241?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/3455463368993796241/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=3455463368993796241' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3455463368993796241'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/3455463368993796241'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/10/inferring-meaning-from-data-and.html' title='Inferring meaning from data and structure'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4572145612932248376</id><published>2007-10-24T22:22:00.000-07:00</published><updated>2007-10-25T08:50:29.638-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hype'/><category scheme='http://www.blogger.com/atom/ns#' term='semantic web'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='ai history'/><title type='text'>Semantic Wishfull Thinking? Or Semantics for turing lead into gold?</title><content type='html'>I'm seeing quite the meme these days on the 'Semantic Web' as a way to build the next big thing (See &lt;a href="http://www.readwriteweb.com/archives/twine_first_mainstream_semantic_web_app.php"&gt;Twine&lt;/a&gt;, &lt;a href="http://www.readwriteweb.com/archives/adaptiveblue_semantic_web_smartlinks.php"&gt;AdaptiveBlue&lt;/a&gt;, &lt;a href="http://www.readwriteweb.com/archives/blueorganizer_semantic_web.php"&gt;more&lt;/a&gt;).  The essence of the &lt;a href="http://en.wikipedia.org/wiki/Semantic_web"&gt;Semantic Web&lt;/a&gt; is the markup of knowledge in such a way as to enable machines to reason about it.&lt;br /&gt;&lt;br /&gt;The idea of having every HTML page you download contain markup that enables a smart web browser or search engine to know that you are looking for (or browsing about) Anthrax &lt;a href="http://en.wikipedia.org/wiki/Anthrax_%28UK_band%29"&gt;the UK punk band&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Anthrax_%28band%29"&gt;the US heavy metal band&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Anthrax_%28fly%29"&gt;the fly&lt;/a&gt;, or the &lt;a href="http://en.wikipedia.org/wiki/Anthrax_toxin"&gt;toxin.&lt;/a&gt;  This vision is basically one of the &lt;a href="http://www.readwriteweb.com/archives/structured_web_primer.php"&gt;Structured Web.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There are issues in my mind:&lt;br /&gt;&lt;br /&gt;1) The Semantic Web has been around for years.  During all those years the content of the web grew from nearly nothing to the mountain of (mostly unstructured) goo we all browse daily.  Why/How will all that knowledge be 'structured'?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Take Home:&lt;/span&gt;  People do not want to 'structure' knowledge themselves.  They are writing their content for people and not machines (except the SEO people).&lt;br /&gt;&lt;br /&gt;2) Formally structured data is an OLD idea in AI.  See &lt;a href="http://en.wikipedia.org/wiki/Expert_systems"&gt;expert systems&lt;/a&gt;.  How will the 'semantic web' over come the basic problem that structuring human knowledge is DAMN hard.  And by hard I mean making it consistent (this is what mostly broke expert systems).&lt;br /&gt;&lt;br /&gt;Have you been following what &lt;a href="http://www.cyc.com/"&gt;Cyc Corp&lt;/a&gt; has been doing since 1984?  Attempting to structure human knowledge.  These guys have invented whole new ways of representing human knowledge.. where is it on the web?  Can anyone tell me an application that uses it?  I am very certain that the CycCorp guys could (and likely have) a way to export their databases into RDF, OWL, etc.&lt;br /&gt;&lt;br /&gt;Also.. the old white-haired guys of AI invented various forms of Semantics and 'Knowledge Representation' &lt;span style="font-style: italic;"&gt;way-back&lt;/span&gt; in AI history (see chapter 10 of &lt;a href="http://aima.cs.berkeley.edu/"&gt;Russell-Norvig&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Take Home:&lt;/span&gt;  Once you have the knowledge structured and embedded, what happens next?  Magic?  Merely inventing a representation of knowledge relies on the 'if you build it they will come' doctrine of AI.. which has &lt;span style="font-weight: bold;"&gt;NEVER &lt;/span&gt;been true.&lt;br /&gt;&lt;br /&gt;3) Reasoning with said structured knowledge is unsolved in general.  Given a specific knowledgebase (or Semantic database) and specific questions (or semantic queries) systems can reason about the question and deliver results.. but it's still a garbage-in-garbage-out world.&lt;br /&gt;&lt;br /&gt;This is especially true when most people really expect a search engine to &lt;a href="http://glinden.blogspot.com/2007/10/searchers-say-please-read-my-mind.html"&gt;read their minds&lt;/a&gt; (Sorry Udi - I agree with Greg) or they tend to &lt;a href="http://www.calacanis.com/2007/10/23/search-engine-fatigue-we-hear-ya/"&gt;give up on their search queries&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;How do we prevent such systems from becoming SEO spammed?  I suppose a reputation system on the source of semantic markup data could be created.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Take Home:&lt;/span&gt;  How in the hell do I build a search engine that uses Semantics that really understands what I am looking for and delivers me the&lt;span style="font-weight: bold;"&gt; Answer&lt;/span&gt;?  Such a system pretty much is an AI Oracle.&lt;br /&gt;&lt;br /&gt;Ok.. enough with the half-empty-glass negativity!  What can we really do with the Semantic web NOW?&lt;br /&gt;&lt;br /&gt;For sure we can build a semantically enhanced 'filter' of the web.  Google/Yahoo/MSN/Ask are great, but in the end the are giant databases that serve you up link-graph weighted &amp;amp; keyword-filtered URLs.&lt;br /&gt;&lt;br /&gt;However, if you are trying to build a money making business, a new search box that returns URLs seems like an insane idea.. unless you can co-opt the browser and augment the results that the big-boys are returning (See &lt;a href="http://www.readwriteweb.com/archives/search_radar_adds_suggestions_to_search_results.php"&gt;Search Radar&lt;/a&gt;).    Or pull a &lt;a href="http://www.stumbleupon.com/"&gt;StumbleUpon&lt;/a&gt; strategy.&lt;br /&gt;&lt;br /&gt;For a business, the Semantic Web is a potential tool in a step along the path in creating a valuable application.  Remember your history here.. creating a giant repository and/or formal structure of knowledge will not alone result  in something novel.. nor is using it required to create novelty in AI.&lt;br /&gt;&lt;br /&gt;I'd probably make the argument that &lt;a href="http://www.techcrunch.com/2007/09/06/exclusive-screen-shots-and-feature-overview-of-delicious-20-preview/"&gt;delicious&lt;/a&gt; itself (and similar data) is a growing embodiment of a user-generated database that clever software could derive semantic-data from.&lt;br /&gt;&lt;br /&gt;I am NOT arguing that the semantic web is a bad idea... but be careful of the hype you read.  The Semantic Web is merely the first step (and a hard one)  at  stitching together  knowledge in a way that can be usefully used to &lt;span style="font-style: italic;"&gt;reason&lt;/span&gt;.   The S-M is as necessary for a smarter web as databases are for useful applications... yet the database is the data-store and NOT the application logic.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4572145612932248376?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4572145612932248376/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4572145612932248376' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4572145612932248376'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4572145612932248376'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/10/semantic-wishfull-thinking-or-semantics.html' title='Semantic Wishfull Thinking? Or Semantics for turing lead into gold?'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-4831391229603507324</id><published>2007-10-19T08:18:00.000-07:00</published><updated>2007-10-19T08:41:25.767-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='recommender systems'/><category scheme='http://www.blogger.com/atom/ns#' term='query understanding'/><title type='text'>Recommender Systems</title><content type='html'>Greg Linden has another insightful post on his blog about &lt;a href="http://glinden.blogspot.com/2007/10/recommender-systems-and-diversity.html"&gt;Recommender Systems&lt;/a&gt;.  He argues that the systems can be tuned to recommend diversity (ala-Netflix), rather than the more too-similar echo chamber of stuff you see sometimes on Amazon.&lt;br /&gt;&lt;br /&gt;Jeremy Liew at LightSpeed VCP had a good post recently about &lt;a href="http://lsvp.wordpress.com/2007/10/02/search-improvements-are-more-about-understanding-queries-better-not-understanding-results-better/"&gt;search query understanding&lt;/a&gt; being the future direction of search.&lt;br /&gt;&lt;br /&gt;In my mind, recommender systems are part of that vision.  A truly great search engine will seek to understand your queries, your query history, personal interests and recommend content.. rather than just give you a keyword-filtered &amp;amp; ranked slice of the web.&lt;br /&gt;&lt;br /&gt;Yet there are other ways to achieve that kind of output.  Search engines and AI in general are a good distance away from real query understanding (it requires some form of &lt;a href="http://www.ai.rutgers.edu/aaai25/mitchell.htm"&gt;machine reading&lt;/a&gt;).  If instead we consider bootstrapping a recommender system that is driven by people's recommendations on a topic.. we can potentially get there quicker.  This is how you train product recommender systems (with purchase history).&lt;br /&gt;&lt;br /&gt;A system that implicitly follows you around the web and allows your content to be communally shared into an index would at a minimum be a very fresh index of what people are looking at now.  Combining this index with a social network of people (enabling matching of topically relevant users to you) and we have something of a human-filter of the web driving a content recommender.&lt;br /&gt;&lt;br /&gt;Yes, this is what many social URL sharing sites are building now... but do they have the pieces all together to drive people to directed content rather than allowing them to surf the wave of current topics?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-4831391229603507324?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/4831391229603507324/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=4831391229603507324' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4831391229603507324'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/4831391229603507324'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/10/recommender-systems.html' title='Recommender Systems'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-252052762020160579</id><published>2007-10-18T12:34:00.000-07:00</published><updated>2007-10-19T15:30:49.214-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='beer'/><category scheme='http://www.blogger.com/atom/ns#' term='seattle'/><category scheme='http://www.blogger.com/atom/ns#' term='food'/><title type='text'>Seattle Beer notes</title><content type='html'>My september trip to Seattle included stops at &lt;a href="http://www.kellsirish.com/"&gt;Kells&lt;/a&gt;.  I really loved the Roslyn Brookside Lager from &lt;a href="http://www.roslynbrewery.com/"&gt;Roslyn Brewing&lt;/a&gt;.  It has a wonderful fruity complexity to it, which is unusual for a lager (more like a &lt;a href="http://en.wikipedia.org/wiki/K%C3%B6lsch_%28beer%29"&gt;Kolsch&lt;/a&gt;).  A quick email to the brewer and I learned that he ferments it warm with a lager yeast.&lt;br /&gt;&lt;br /&gt;I also enjoyed the &lt;a href="http://www.baronbeer.com/"&gt;Baron Brewing&lt;/a&gt; Helles Bock served at the &lt;a href="http://tomdouglas.com/palace/index.html"&gt;Palace Kitchen&lt;/a&gt;.  Great malt flavor.  Great food at PK at reasonable prices.&lt;br /&gt;&lt;br /&gt;On this October trip to Seattle I loved the &lt;a href="http://www.feierabendseattle.com/"&gt;Feierabend Pub&lt;/a&gt;.  They have about 18 beers (all German styles) on tap.  I tried/sampled about 5 kinds of Octoberfest and several other lagers.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.taphousegrill.com/"&gt;Tap House Grill&lt;/a&gt;, 160 draft beers.. need I say more?  This place was impressive.  I tried two more Baron beers (Pils &amp;amp; Uber-Weiss) and the &lt;a href="http://www.ommegang.com/index.php?mcat=1&amp;amp;scat=3&amp;amp;yr=1"&gt;Brewery Ommegang Hennepin Farmhouse Saison&lt;/a&gt;.  It's good, but my taste buds still prefer the &lt;a href="http://www.newbelgium.com/beers_saison.php"&gt;New Belgium Saison&lt;/a&gt;.. the NB has a nice earthy taste.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-252052762020160579?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/252052762020160579/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=252052762020160579' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/252052762020160579'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/252052762020160579'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/10/seattle-beer-notes.html' title='Seattle Beer notes'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-8648453561386550536</id><published>2007-10-17T11:38:00.001-07:00</published><updated>2007-10-17T12:04:01.551-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='people search'/><category scheme='http://www.blogger.com/atom/ns#' term='personalized search'/><category scheme='http://www.blogger.com/atom/ns#' term='attention data'/><category scheme='http://www.blogger.com/atom/ns#' term='implicit web'/><title type='text'>Attention IR and People Search</title><content type='html'>The &lt;a href="http://www.sigir2007.org/"&gt;SIGIR 2007&lt;/a&gt; conference also had a couple of gems in the &lt;a href="http://www.sigir2007.org/doctoralconsortium.html"&gt;Doctoral Consortium&lt;/a&gt; workshop. &lt;br /&gt;&lt;p&gt;&lt;em&gt;Krisztian Balog (University of Amsterdam) &lt;a href="http://staff.science.uva.nl/%7Ekbalog/"&gt;homepage&lt;/a&gt;&lt;/em&gt;&lt;br /&gt;&lt;a href="http://staff.science.uva.nl/%7Ekbalog/files/talks/sigir2007-dc.pdf"&gt;   People Search in the Enterprise&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;The abstract of Balog looked a two areas concerning people search, profiling people and enabling search of those people based upon both the topical and social profile.  Who is an expert on X?  Who do I know (or get introduced to) someone who is an expert on X?  His research seems to be just beginning.. I'll be checking his page for new papers.&lt;br /&gt;&lt;a href="http://staff.science.uva.nl/%7Ekbalog/files/talks/sigir2007-dc.pdf"&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;/a&gt;   &lt;/p&gt;&lt;p&gt;&lt;em&gt;Georg Buscher (German Research Center for AI) &lt;a href="http://www.dfki.uni-kl.de/%7Ebuscher/"&gt;homepage&lt;/a&gt;&lt;/em&gt;&lt;br /&gt;&lt;a href="http://www.dfki.uni-kl.de/%7Ebuscher/publications/Buscher07_AttIR_long.pdf"&gt;   Attention-Based Information Retrieval&lt;/a&gt;   &lt;/p&gt;Buscher won the best presentation award at the workshop.  His slides outline how attention data can be used to bias/rerank IR results to enable re-finding old information/documents as well as doing query expansion (profile based???) given the current user's attention data.  His research is also fairly new.&lt;br /&gt;&lt;br /&gt;Both of these topics are obviously of interest to &lt;a href="http://www.othersonline.com/"&gt;Others Online&lt;/a&gt; and the idea of connecting people together through a common topic or set of topics that are learned as implicitly related to the users.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-8648453561386550536?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/8648453561386550536/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=8648453561386550536' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8648453561386550536'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8648453561386550536'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/10/attention-ir-and-people-search.html' title='Attention IR and People Search'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-2879880806191765300</id><published>2007-10-17T10:16:00.000-07:00</published><updated>2007-10-17T11:58:24.716-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='social search'/><category scheme='http://www.blogger.com/atom/ns#' term='learning to rank'/><category scheme='http://www.blogger.com/atom/ns#' term='reranking'/><title type='text'>Learning to Rank</title><content type='html'>&lt;a href="http://www.sigir2007.org/"&gt;SIGIR 2007&lt;/a&gt; (which I unfortunately did not attend) had a really great workshop called '&lt;a href="http://research.microsoft.com/users/LR4IR-2007/"&gt;Learning to Rank&lt;/a&gt;' or LTR.  The weekly RightNow-organized Bozeman &lt;a href="http://labs.rightnow.com/colloquium/papers.php"&gt;AI Colloquium&lt;/a&gt; recently covered two papers in this area.  Essentially the idea is that a search engine can implicitly learn to rank documents for a given query by looking at user behavior.&lt;br /&gt;&lt;br /&gt;The first one we covered (&lt;a href="http://jenyuan.yeh.googlepages.com/jyyeh-LR4IR07.pdf"&gt;by Yeh, Lin, Ke &amp;amp; Yang&lt;/a&gt;) used genetic programming to do the learning.  Needless to say this caught my eye.  Evolutionary Algorithms are built to learn rankings, usually based upon a fitness function.  I found this paper interesting, however even the authors admit that their algorithm is very slow.&lt;br /&gt;&lt;br /&gt;In my mind they picked too complex of an algorithm.  There are far simpler EAs that can do this job.  The well-known (n+1) EA could do this task (per query).  I'll likely be writing a paper on this for &lt;a href="http://www.sigevo.org/gecco-2008/index.html"&gt;GECCO 2008&lt;/a&gt;.  l&lt;br /&gt;&lt;br /&gt;Many of the workshop papers reference work by Joachims and Radlinski (&lt;a href="http://www.cs.cornell.edu/People/tj/"&gt;find them here&lt;/a&gt;).  Their recent paper in IEEE Computer (not avail for free) was interesting in that they used a LTR method to re-rank Google results and then did a user-study to look at how effective the method was.&lt;br /&gt;&lt;br /&gt;Personally I think that the idea of LTR should be a component of every search engine.  The ranking of search results should change as fast as users interact with the content, rather than how fast the content itself changes.   This is something that the big search engines are fairly quiet on, not sure why.&lt;br /&gt;&lt;br /&gt;Sure it's an incremental rather than revolutionary step (Powerset is trying to take a revolutionary step), however can anyone give me a good argument why LTR should not be done?  The idea can be applied to any engine.. keyword, link-graph (Google) or NLP based (Powerset).&lt;br /&gt;&lt;br /&gt;Taking the next step beyond that, the next big thing could very well be doing an LTR method per-person or per-peer-group for each query family.  This effectively would allow the engine to self-learn to personalize results.  One can imagine how this could be glued into the idea of using the 'social graph' to establish the peer-group on a given topic/query.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-2879880806191765300?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/2879880806191765300/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=2879880806191765300' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2879880806191765300'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2879880806191765300'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/10/learning-to-rank.html' title='Learning to Rank'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-8317107088128217824</id><published>2007-09-21T15:00:00.000-07:00</published><updated>2007-10-17T11:59:31.806-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='snakeoil'/><category scheme='http://www.blogger.com/atom/ns#' term='old ideas'/><category scheme='http://www.blogger.com/atom/ns#' term='learning from past mistakes'/><title type='text'>More old wine in new Web 2.0 bottles</title><content type='html'>In light of last week's post on "beware of old AI wine in new Web 2.0 bottles" I wanted to post this link from &lt;a href="http://www.joelonsoftware.com/items/2007/09/18.html"&gt;Joel Spolsky&lt;/a&gt;. (A buddy of mine brought my attention to it)&lt;br /&gt;&lt;br /&gt;Once in a while Joel posts "strategy letters".  This one addresses history repeating itself in the old-html-web -&gt; ajax-web paralleling  the text-terminal -&gt; windows-api flow.  Very true.  I did find it odd that he did not mention the &lt;a href="http://code.google.com/webtoolkit/"&gt;Google Toolkit&lt;/a&gt; in his thoughts about the potential game-changing "NewSDK" and how it needs fancy new compilers.  As a CS geek I think the idea of compiling Java to cross-browser-compliant javascript a simply amazing technical achievement.&lt;br /&gt;&lt;br /&gt;The interesting thing about this topic is that it's what Java was supposed to do for the browser back in the 90s.  Didn't work, no one could keep their browser &amp;amp; the JVMs synced and integrated well, plus Microsoft managed to run good interference via IE just being crappy at Java/JVMs at that time.&lt;br /&gt;&lt;br /&gt;Turns out that Java succeeded wildly in reinventing the way back-end web services are written (CGIs just don't cut it for some things despite the PHP/Perl/Python crowd making CGIs way more useful than before.)  On the browser today's Java-JVM is the javascript-engine (which is not java at all).  The idea of using JS as byte-code makes me cringe, but it's where we are.&lt;br /&gt;&lt;br /&gt;Nice post Joel!   Would be nice if he'd follow up on why he thinks that the Google Toolkit, &lt;a href="http://developer.yahoo.com/yui/"&gt;Yahoo's YUI&lt;/a&gt; and others aren't yet (or won't get to) the definition of his NewSDK.&lt;br /&gt;&lt;br /&gt;&lt;a onclick="return top.js.OpenExtLink(window,event,this)" href="http://www.joelonsoftware.com/items/2007/09/18.html" target="_blank"&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-8317107088128217824?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/8317107088128217824/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=8317107088128217824' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8317107088128217824'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/8317107088128217824'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/09/more-old-wine-in-new-web-20-bottles.html' title='More old wine in new Web 2.0 bottles'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-1933520327526697644</id><published>2007-09-10T14:43:00.000-07:00</published><updated>2007-09-10T23:47:26.087-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='personalized search'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='implicit web'/><category scheme='http://www.blogger.com/atom/ns#' term='collective search'/><title type='text'>The Implicit Web flowing into Collective Search</title><content type='html'>Here are some recent articles that I read and kept thinking about again and again.  What is cool about this moment in time is that these things are gelling.  Entrepreneurs and innovators are trying to build this stuff, rather than the ideas rotting unfulfilled in the mind of some AI/Search-Engine geek.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.readwriteweb.com/archives/the_implicit_web_lastfm_amazon_google.php"&gt;Read/Write Web's Implicit Web&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Important point here is that systems should both learn what users are interested in implicitly and allow users control over the learned topics.  The former point is what algorithms like collaborative filtering were intended to do.  The latter is a great point that users should  have visibility and control into their learned topics.&lt;br /&gt;&lt;br /&gt;This has been a frequent  critique against Amazon's  recommender system.. while personalized, it can learn goofy things.  I have no desire to be a frequent buyer of items similar to what I bought for a niece as a gift last year.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://glinden.blogspot.com/2007/08/collective-search-versus.html"&gt;Collective Search by Greg Linden&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I just learned that Greg is one of the brains behind Amazon's AI.  Thinking about the data Amazon has and what could be done with it always makes me drool.  Greg's post here is an aggregation of points he came up with while reading transcripts of the recent SES 2007 conference.&lt;br /&gt;&lt;br /&gt;I'll join Ask's Jim Lanzone (isn't the new Ask.com much better than Google!) in saying that collective search is potentially better than personalized search.  Greg is arguing for a redefinition of 'personalization' here, but we have to pick descriptive terms for abstract ideas.  I would define personalization as skewing of search results by what &lt;span style="font-weight: bold;"&gt;you&lt;/span&gt; are interested in.   Where I'd read collective search as  letting the collective behaviors of a &lt;span style="font-weight: bold;"&gt;group&lt;/span&gt; of similar users influence/skew search results.  This is the flavor of stuff I worked on at RightNow.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.informationweek.com/story/showArticle.jhtml?articleID=201202986"&gt;Ultimate Answer Engine @ Information Week&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Favorite quote:  "Who said an edit box and 10 blue links is what search is?" asks Microsoft's Satya Nadella.&lt;br /&gt;&lt;br /&gt;This great piece has several items that just jumped out at me.  "Queryless Search", essentially this is using what the system knows about you and your path through to the engine and do a implicit query.  (We also worked and patented variations of this idea at RightNow).  The "Personalization" and "Social Skills" sections deal with the ideas in Greg's post above.   More to come on that re 'The Social Graph'.&lt;br /&gt;&lt;br /&gt;Another good quote: "Serendipity is an amazing teacher".  This is what Others Online is all about... focused on People, not necessarily documents/media.&lt;br /&gt;&lt;br /&gt;After reading all three of these in the current context of what people are willing to spend time and money  on... I can't help but be totally jacked about the opportunities at hand!&lt;br /&gt;&lt;br /&gt;Loads of academics have been working on this stuff for years, check out any ACM SIGIR and various data mining conference proceedings for the last 10+ years.  Personally, I've been thinking and working on many of the things above since 2000 when Doug Warner and I started doing a deep dive into the academic literature.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-1933520327526697644?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/1933520327526697644/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=1933520327526697644' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1933520327526697644'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/1933520327526697644'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/09/implicit-web-flowing-into-collective.html' title='The Implicit Web flowing into Collective Search'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-886996210906972978</id><published>2007-09-07T10:26:00.000-07:00</published><updated>2007-09-07T16:17:56.151-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='AI'/><category scheme='http://www.blogger.com/atom/ns#' term='search engines'/><category scheme='http://www.blogger.com/atom/ns#' term='facebook'/><category scheme='http://www.blogger.com/atom/ns#' term='chacha'/><category scheme='http://www.blogger.com/atom/ns#' term='scoble'/><category scheme='http://www.blogger.com/atom/ns#' term='mahalo'/><title type='text'>The "social graph" and search engines</title><content type='html'>Robert Scoble recently posted about &lt;a href="http://scobleizer.com/2007/08/26/why-mahalo-techmeme-and-facebook-are-going-to-kick-googles-butt-in-four-years/"&gt;Mahalo, TechMeme and Facebook versus Google.&lt;/a&gt;  His thesis is basically that somehow blending social networks with search engines will be the next big thing.  He also comments (as have others) that searching blogs can get better results than major search engines sometimes.&lt;br /&gt;&lt;br /&gt;Danny Sullivan chimed in response with a &lt;a href="http://searchengineland.com/070827-121805.php"&gt;blistering commentary&lt;/a&gt; on both Scoble's "new ideas" and &lt;a href="http://www.mahalo.com/"&gt;Mahalo&lt;/a&gt; (run by&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;a href="http://www.calacanis.com/"&gt;Jason Calacanis&lt;/a&gt;&lt;strong&gt;&lt;/strong&gt;).    Mahalo and &lt;a href="http://www.chacha.com/"&gt;ChaCha&lt;/a&gt; are both 'human powered' search engines.  Basically they take popular search terms and use editor to augment and/or reorganize Google results.&lt;br /&gt;&lt;br /&gt;First a history review.  Way back Yahoo built it's people powered directory, while initially useful it could not keep up with the growth of the internet.  Google comes along with a simple idea called PageRank (it essentially forms a Markov model of the web and computes the &lt;a href="http://en.wikipedia.org/wiki/Pagerank"&gt;stationary distribution of the markov matrix&lt;/a&gt; - an 80+ year old idea applied to the web) and kills Yahoo's directory as well as purely keyword based engines like Altavista.&lt;br /&gt;&lt;br /&gt;More History.  Once upon a time in the 60s-80s &lt;a href="http://en.wikipedia.org/wiki/Expert_systems"&gt;expert systems&lt;/a&gt; were seen as the next big thing in AI.  Solve all the world's problems by enabling a formal system of rules and facts to answer questions posed to the system.  ES was a miserable failure at these lofty goals.  Why?    Growing the rulebase is hard.  Humans do a terrible job at crafting rulesets that are complete and consistent (no conflicts).  Even worse is when you throw multiple people at crafting rules together.  You end up with trash.&lt;br /&gt;&lt;br /&gt;Why is this relevant here?  The lesson of ES seems to be lost on efforts like ChaCha and Mahalo.  These systems are built on very basic rules (if query X then return A, B, C, D ...).  Granted these are much simpler rules than a typical ES, and the engines don't support real reasoning using backward or forward chaining either.  This may not save them.. the rules will still suffer from the huge maintenance problem in a context where the information captured is dynamic and changing.  Just ask any of the dozen 80s companies that tried to build medical diagnosis expert systems.  The rules suffered from inattention to medical advances as well as being contradictory (multiple doctors with different ideas making rules).&lt;br /&gt;&lt;br /&gt;Nowdays we call this "linkrot" on the web.  While successful, sites like About.com suffered from linkrot on pages not frequently edited.  How will ChaCha and Mahalo avoid this without having a massive number of editors?  &lt;a href="http://del.icio.us/"&gt;Del.icio.us&lt;/a&gt; itself suffers from the same issues, people tag stuff and it mostly rots unorganized or maintained.&lt;br /&gt;&lt;br /&gt;Yet More History.  From about 1999 to 2003 AskJeeves.com sold software in the emerging web eCRM space in addition to having a search engine.  Web eCRM (or web self-service) is essentially creating a customer service portal for corporate websites.  The portal contains a collection of FAQs, articles, HowTos, Manuals etc.  The essential function of the portal is to help people find what they are looking for and keep them from dialing the 1800 customer service number (which typically costs a company about $30 per call).  AskJeeves sold their CRM and enterprise search unit in 2003 for less than 5 million dollars.  Why?  Their system required manual input of of a huge set of rules linking search queries and documents, as well as complex rules to equate queries to other queries and attempt to do some Natural Language Processing and Inference.&lt;br /&gt;&lt;br /&gt;It didn't work, there was no way in hell that an average business user that maintained this set of Articles, FAQs etc was prepared to the massive amount of structuring.  AskJeeves attempted to hire a team of people to optimize and tune the implementations.  It took weeks of learning the business and translating that into structure for the engine to use.  Nowdays we call this SEO.&lt;br /&gt;&lt;br /&gt;Another example in CRM is the 'chatbot'.  These are software products that try and give a user a good customer experience by putting a cute face/persona on the search box and having it talk back to you in a conversational style.  They have never really taken off, despite the CRM industry analysts that love them.  They suffer from the same basic problem that expert systems (chat bots are expert systems of a sort) suffered from.. structuring information is hard for most people to do.&lt;br /&gt;&lt;br /&gt;For the past 8 years I've been working for an CRM company (&lt;a href="http://www.rightnow.com/"&gt;RightNow Tech&lt;/a&gt;) that had a simple idea to help customer service web portals... implicitly learn from what users are doing in the portal to optimize the engine automatically.  (See patents &lt;a href="http://www.google.com/patents?id=v9kLAAAAEBAJ"&gt;6434550&lt;/a&gt;, &lt;a href="http://www.google.com/patents?id=cgoPAAAAEBAJ"&gt;6665655&lt;/a&gt;, &amp; &lt;a href="http://www.google.com/patents?id=Y6QTAAAAEBAJ"&gt;6842748&lt;/a&gt; -  at the moment the RNT systems process about 100 Million searches per month).  The cutting edge of eservice CRM at the moment is taking that type of idea and THEN adding (or learning) structure to it.&lt;br /&gt;&lt;br /&gt;Lessons learned and observations:&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;Study the basic history of AI.&lt;/span&gt;  Here's a good book &lt;a href="http://aima.cs.berkeley.edu/"&gt;Artificial Intelligence: A Modern Approach&lt;/a&gt;.&lt;br /&gt;Note that the one of the authors (Peter Norvig) is The Director of Research at Google.  Prabhakar Raghavan is his counterpart at Yahoo.  Ask.com and Microsoft also have strong AI people.   There is no secret as to why these four companies are hiring all the good AI people they can relocate to the bay area, Seattle and New Jersey.  &lt;span style="font-weight: bold;"&gt;You will not beat them with an expert system.&lt;/span&gt;  A secondary lesson of AI is to never believe someone who will attempt to tell you that a new algorithm will create intelligence (neural networks anyone?  Fuzzy Logic?).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Look at industries like CRM as a microcosm of the search industry.&lt;/span&gt;  For every new idea you have, someone in CRM has likely tried it already on a smaller scale.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Beware of old wine in new bottles.&lt;/span&gt;  You might be able to spend enough money on PR to help you get attention.. but you will likely die unless you invest in real scalable algorithms to do the work.&lt;br /&gt;&lt;br /&gt;I'm certainly not intending to down-grade ChaCha and Mahalo as viable businesses.  Often the viability of a business is independent of the technology used.  They seem to have plenty of funding, and will likely adapt as they see problems.  A babe-in-the-woods can't get 20 million in VC money.  Neither of these systems will require boiling-the-ocean and implementing strong AI.  Spinning a tight loop on what users are looking for and optimizing those results as fast as possible might work long enough to make some cash... it worked to bootstrap Yahoo after all.&lt;br /&gt;&lt;br /&gt;As for the social-network blending into standard search?  Stay tuned, I'll post some thoughts on that soon.  There are plenty of good AI people working on graph based data mining.&lt;br /&gt;&lt;br /&gt;Circling back to expert systems, if you can automatically 'read' text, and induce a rule-base.. then use that to help with queries, then we have something.  I believe the direction of search engines will slowly head in this direction... &lt;a href="http://www.ai.rutgers.edu/aaai25/mitchell.htm"&gt;machine reading&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Jordan Mitchell (my new boss at OthersOnline.com) recently posted on the same subject on his &lt;a href="http://kickstand.typepad.com/metamuse/2007/08/social-graph-mi.html"&gt;blog&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Other interesting links about this:&lt;br /&gt;&lt;a href="http://www.skrenta.com/2007/08/some_thoughts_on_mahalo.html"&gt;Skrentablog on Mahalo&lt;/a&gt;&lt;br /&gt;&lt;a href="http://feedblog.org/2007/08/27/google-will-index-the-social-graph/"&gt;Keving Burton's Thoughts on the Social Graph&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-886996210906972978?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/886996210906972978/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=886996210906972978' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/886996210906972978'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/886996210906972978'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/09/social-graph-and-search-engines.html' title='The &quot;social graph&quot; and search engines'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-2990092337601187726</id><published>2007-09-06T22:50:00.001-07:00</published><updated>2007-09-06T23:45:13.419-07:00</updated><title type='text'>New Job - OthersOnline.com</title><content type='html'>I just started a new job at &lt;a href="http://www.othersonline.com/"&gt;OthersOnline.com&lt;/a&gt;.  It's a new startup with a social networking spin. We let users declare themselves, their pages and interests, then be syndicated around the web via the OO Widget (see it to the right).  We also have a browser toolbar that allows users to see other people relevant to the user's own interests and the content of the current webpage.   I think my official title is the "Search Guy" or "AI Guy" or something.  The potential of these two basic ideas is huge, and I'm wading in chest deep to put some great AI ideas into the systems.  More posts coming soon on these topics.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;I spent the last (nearly) eight years working at &lt;a href="http://www.rightnow.com/"&gt;RightNow Technologies&lt;/a&gt; (a CRM SAAS company - once upon a time it was a small startup as well) in the &lt;a href="http://labs.rightnow.com/"&gt;AI Research Labs&lt;/a&gt;.  At RNT I was in charge of implementing various search engines, data mining &amp;amp; nlp algorithms, swarm techniques, user interfaces, analytics, and whatever AI I could throw at the basic problem of enabling endusers to find information on approx 2000+ customer service portals around the web (here is &lt;a href="http://leapfrog.custhelp.com/cgi-bin/leapfrog.cfg/php/enduser/std_alp.php"&gt;Leapfrog's Portal&lt;/a&gt;).  I spent most of the last six months becoming the project manager of the group, responsible for multiple projects, coordinating with product management, initiating new feature ideas, etc.  It's a fantastic group to work for, and has an application for about any advanced CS topic there is.  A more complete synopsis is on my &lt;a href="http://www.cs.montana.edu/%7Erichter/resume-2007.php"&gt;resume&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;(At some point in 2008 I will hopefully finish a PhD in CS at Montana State - topic is Theory of Genetic Algorithms)&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-2990092337601187726?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/2990092337601187726/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=2990092337601187726' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2990092337601187726'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/2990092337601187726'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/09/new-job-othersonlinecom.html' title='New Job - OthersOnline.com'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-17598372.post-7897739146933306862</id><published>2007-09-06T22:01:00.000-07:00</published><updated>2008-12-08T15:44:35.929-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='introductions'/><title type='text'>New Blog</title><content type='html'>I have been ignoring using a blog for too long, the old homepage is too static.  I'll use this space to muse about artificial intelligence, search engines, machine learning, social media &amp;amp; widgets, my career, PhD dissertation progress, Montana, fishing and good beer.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cs.montana.edu/%7Erichter/"&gt;My Montana State University homepage&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;RSS Feed of this Blog &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_eZUtZFDoqLA/Rxfol0fwt9I/AAAAAAAAAA8/OFUj3x-sTi8/s1600-h/rss_icon.gif"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_eZUtZFDoqLA/Rxfol0fwt9I/AAAAAAAAAA8/OFUj3x-sTi8/s200/rss_icon.gif" alt="" id="BLOGGER_PHOTO_ID_5122818837601892306" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_eZUtZFDoqLA/RuDh6La5pfI/AAAAAAAAAAM/sac75z2r1Lw/s1600-h/neal_richter.JPG"&gt;&lt;br /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/17598372-7897739146933306862?l=aicoder.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://aicoder.blogspot.com/feeds/7897739146933306862/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=17598372&amp;postID=7897739146933306862' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7897739146933306862'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/17598372/posts/default/7897739146933306862'/><link rel='alternate' type='text/html' href='http://aicoder.blogspot.com/2007/09/new-blog.html' title='New Blog'/><author><name>Neal</name><uri>http://www.blogger.com/profile/06306714297735275545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_eZUtZFDoqLA/SzkW1gTbAKI/AAAAAAAAAC4/Nj4N5fPUVxA/S220/headshot_highdef.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_eZUtZFDoqLA/Rxfol0fwt9I/AAAAAAAAAA8/OFUj3x-sTi8/s72-c/rss_icon.gif' height='72' width='72'/><thr:total>1</thr:total></entry></feed>
