I've been attending the University of Utah Machine Learning seminar (when I can) this fall. PhD student Piyush Kumar Rai is organizing it.
I volunteered to take the group though Learning to Rank with Partially-Labeled Data by Duh & Kirchhoff. I have some experience researching and implementing LTR algorithms, mostly using reinforcement learning or ant-system type approaches. Some general intro here.
The paper presents the main Transductive Learning algorithm as a framework, then fills in the blanks with Kernel PCA and RankBoost. Several Kernels are used: Linear, polynomial, radial basis function and knn-diffusion. RankBoost learns a kind of ensemble of 'weak learners' with simple thresholds.
The main reason to read the paper if you are already familiar with LTR is the use of the transductive algorithm.
Note the DISCOVER() & LEARN() functions. These are the unsupervised and supervised algorithm blanks they fill with Kernel PCA and RankBoost. What the first actually does is learn a function we could call EXTRACT() that can extract or create features for later use. They do show that the basic idea of layering in unlabeled data with labeled data is a net gain.
There are some issues with the paper. First the computational time performance, as they admit, is not good. The other is that their use of Kernel PCA in an information retrieval context is a bit naive IMO. The IR literature is full of hard-won knowledge of extracting decent features from documents. See this book for example. This is mostly ignored here.
The more confusing thing is the use of K-Nearest Neighbor diffusion kernels. Basically they take the vector of documents, form a matrix by euclidean distance and then random-walk the matrix for a set number of time-steps. The PCA then takes this 'kernel' output and solves the eigenvalue problem, to get the eigenvectors. This all seems a round-about way of saying they approximated the Perron-Frobenius eigenvector (sometimes call PageRank) by iterating the matrix a set number of times and zeroed out low order cells. Or at least I see no effective difference between what they did and what I just described. Basically they just make the matrix sparse to solve it easier (ie this is the dual).
Their use of various classic IR features like TFIDF, BM25 etc needed help. There's pleny of IR wisdom on how to use such features, why let the DISCOVER() wander about attempting to rediscover this? The results were also muddled with only one of the three data sets showing a significant improvement over a baseline technique.
All that aside, it's worth a read for the intro to the transductive alg used with an IR centric task.
No comments:
Post a Comment