实例介绍
本书由MSRA刘铁岩所写,介绍了排序学习的一些基本概念以及方法。排序学习指的是使用机器学习的方法来对网页进行排序,书中所讲的内容对于有机器学习背景的同学来说应该还是比较容易的。
4.1 Direct Optimiza. t ion of IR Evaluation Measures 44 4.2 Minimization of Listwise Ranking Losses 50 4. 3 Discussions 54 5 Analysis of the Approaches 56 5.1 The Pointwise Approach 57 5. 2 The Pairwise Approach 5.3 The Listwise Approach 61 5.4 Discussions 6 Benchmarking Learning-to-Rank Algorithms 66 6.1 The LetoR Collection 67 6.2 Experimental Results on LETOR 73 7 Statistical Ranking Theory 79 7. 1 Conventional Generalization Analyses on R anking 80 7.2 A Query-level Ranking Fi amewor 85 7.3 Qucry-lcvcl Generalization Analysis 7. 4 Discussions 94 8 Summary and Outlook 96 References 104 Acknowledgements 112 Introduction With the fast devclopment of the Wcb, cvcry onc of us is cxpcricncing the food of information. It was estimated that there are about 25 billion pages on the Web as of October 2008. which makes it generally impossible for common users to locate their desired information by browsing the Web. As a conscqucncc, cfficicnt and effective info ration retrieval(Ir) has become more important than ever, and search engines (or TR systems) have become an essential tool for many people Ranking is a central problem in IR. Many IR problems are by nature ranking problems, such as document rctricval, collaborative filtering 581, key term extraction 30, definition finding [130, important email routing 23, sentiment analysis 94, product, rating 36, and anti Web spam 56. In this tutorial, we will mainly take document retrieval as an cxamplc Notc that document retrieval is not a narrow task. Wcb pages, emails, academic papers, books, and news stories are just a few of the many examples of documents. And there are also many different ranking scenarios for document retrieval of our interest Scenario 1: Rank the documents purely according to their rele- http://www.worldwidewebsize.com/ Introduction vance with regards to the query Scenario 2: Consider the relationships of similarity [118], website structure 35, and diversity [139 between documents in the ranking process. This is also referred to as relational ranking 1021 Scenario 3: Aggregate several candidate ranked lists to get a bet Ler ranked lisl. This scenario is also referred lo as (mela search 7. The candidate ranked lists may comc from diffcrent index servers, or differ- ent vertical search engines, and the target ranked list is the final result presented to users Scenario 4: Find whelher and to what degree a property of a webpage influences the ranking result. This is referred to as"reverse engineering,in search engine optimization (SEO)2 To tackle the problem of document retrieval, many heuristic ranking models have been proposed and used in the literature of IR. Recentl given the amount of potential training data available, it has become possible to leverage machine learning(ML) technologies to build effec- Live ranking Models. Specilically, we call those methods Chat learn how to combine predefined features for ranking by means of discriminative learning "learning-to-rank" methods In recent years, learning to rank has become a very hot research direction in IR, and a large nunber oflearning-to-rank algorithns have been pre 4973]33907834]59114[269[29 14122]17]1629716][117136131[13104[99[17]129.We can foresee that learning to rank will have an even bigger impact on IR in the future a. When a research area comes to this stage, several questions as fol- ows naturally arise To what respect are these learning-to-rank algorithms similar and in which aspects do they differ? What are the strengths and weaknesses of each algorithm? Empirically, which of those many learning-to-rank algorithms perform the best? What kind of datasets can be used to make http://www.search-marketing.info/newsletter/reverse-engineering.ht.m 1.1.Ranking in Information Retrieval 3 fair comparison among different learning-to-rank algorithms Theoretically, is ranking a new machine learning problem, or can it be simply reduced to existing machine learning prob- lems? What are the unique theoretical issues for ranking that should be investigated Are there nany remaining issues regarding learning to rank to study in the future? What arc thcy? The above questions have been brought to the attention of the Ir and ML communities in a variety of contexts, especially during recent years. The aim of this tutorial is to review recent work that attempts to answer these questions. Needless to say, the comprehensive under- standing of the task of ranking in IR is the key to finding the right answers. Therefore, we will first give a brief introduction of ranking in IR, and then formalize the problem of learning to rank so as to set the stage for the upcoming detailed reviews 1.1 Ranking in Information Retrieval In this section, we briefly review representative ranking models in the literature of iR. and introduce how these models are evaluated 1.1.1 Conventional Ranking Models for IR In the literature of IR, many ranking models have been proposed [8 They can be roughly categorized as query-dependent Inodelsanld query independent models Query-dependent Models The early models retrieve documents based on the occurrences of the query terms in the documents. Examples include the boolean model Basically these models can only predict whether a documen relevant to the query or not. but cannot predict the degree of relevance To further model the relevance degree, the Vector Space model (VSM) was proposcd [8. Both documcnts and qucrics arc rcprcscntcd as vectors in a Euclidean space, in which(the inner product of two vec- tors can be used to measure their similarities. To get an effective vector Introduction representation of the query and the documents. TF-IDF weighting has been widely used. The TF of a term t in a vector is defined as the normalized number of its occurrences in the document and the idf of it is defined as follows IDF(t)=log ml(t where N is the total number of documents in the collection, and n(t) is the number of documents containing term t While VSM implies the assumption on the independence between terms, Latent Semantic Indering(LSI)37 tries to avoid this as- sumption. In particular, Singular Value Decomposition(SVD)is used to linearly transform the feature space and thus a "latent semantic space?is generated. Similarity in this new space is then used to define the relevance between the query and the documents As compared with the above, models based on the probabilistic ranking principle 83 garnered more attention and achieved more suc cess in the past decades. The famous ranking models like the BM25 model 111 and the language model for IR, can both bc categorized as probabilistic ranking models The basic idea of BM25 is to rank documents by the log-odds of their relevance. Actually BM25 is not a single model, but actually defines a wholc family of ranking modcls, with slightly diffcrcnt componcnts and parameters. One of the popular instantiations of the model is as follows Given a query g, containing terms t1, .. tm, the BM25 score of a document d is computed as bclow, BM25(dq)-∑ IDF(t)·TF(t2,d)·(k1+1 1TF(t2,d)+k1·(1-b+b LEN(d) note that there are many different definitions of Tf and IDF in the literature. Some are purely based on the frequency and the others include smoothing or normalization [116 Hore we just give somc simplc cxamples to illustrate the main idca 4 The name of the actual model is BM25. To the right context, however, it is usually referred to as"OKapi BM25,, since the OKapi system was the first system to implement this mode 1.1. Ranking in Information Retrieval J where TF(t, d) is the term frequency of t in document d, LEN(d) is the length (number of words) of document d, and avdl is the aver- age document length in the text collection from which documents are drawn. hi and b are free parameters, IDF(t) is the IDF weight of term t, computed by using Eqn. (1.1), for example The language model for IR [96 is an applicalion of the statistical anguage modcl on IR. A statistical language modcl assigns a proba bility to a sequence of terms. When used in IR, a language model is associated with a document. With query g as input, documents are ranked based on the query likelihood, or the probability that the doc ument's language model would generate the terms in the query (i.e. d)). By further assuming the independence among terms, one has P(qld) P(tild), if query q contains terms ti, ..., tM To learn the document's language model, a maximum likelihood method is used. As in many maximum likelihood methods, the issue of smoothing the estimate is critical. Usually a background language model estimated using the entire collection is used for this purpose Then, the docunents language nodel can be constructed as follows TF(ti, d) lenid +入p(t2C) where p(tilC) is the background language modcl for tcrm ti, and A C 0, 1] is a smoothing factor There are many variants of the language model for TR, some of them even go beyond the query likelihood retrieval model (e.g, the modcls bascd on K-L divcrgcncc [140). We would not introducc morc about them, and readers are encouraged to read the tutorial authored In addition to the above examples, many other models have also been proposed to compute the relevance between a query and a document. Some of them 119 take the proximity of the query terms into consideration, and some others consider the relationship between documents in tcrms of content similarity 118, hyperlink structurc 113], website structure [101], and topic diversity [139 6 Introduction Query-independent models In the literature of ir, there are also many models that rank doc uments based on their own importance. We will take PageRank 92 as an example for illustration. This model is particularly applicable to Web search because it makes use of the hyperlink structure of the Web or ranking PageRank uses the probability that a surfer randomly clicking on links will arrive at a particular webpage to rank the web pages. In the general case, the PageRank value for any page da can be expressed as PF(dn)=∑ PR(dy) du∈Bu That is, the Pagerank value for page du is dependent on the PageR- ank values for each page d out of the set Bu(containing all pages link ing to page du), divided by U(do), the number of outlinks from page To get a meaningful solution of Eqn. (1.4), a smoothing term is in troduced. When the random surfer walks on the link graph, she/he does not necessarily always follow the existing hyperlinks. There is a small probability that she/he will jump to any other page unifornly. This small probability can be represented by(1-a). where a is called the damping factor. Accordingly, the PageRank model is refined as follows PR(du)=a ∑ PR(d),(1a) U(c d2∈B here n is the total number of the Web There are many works discussing the theoretical properties, varia tions, and efficient, implementations of the Pagerank model. Further more, there are also many other link analysis algorithms, such as hy pcrlink Induccd Topic Scarch(HITS)[72 and TrustRank 57 Somc of these algorithms even leverage the content or topic information in the rocess of link analysis 91 1.1. Ranking in Information Retrieval 7 1.1.2 Query-level Position-based Evaluations in IR Given the large number of ranking models as introduced in the previous subsection. a standard evaluation mechanism is needed to select the most effective model. For this purpose, one usually proceeds as follows Collect a large number of (randomly sampled) queries to form a test set · For each query q Collect documents (d,lm associated with the query Get the relevance judgment for each document by human assessment Use a given ranking model to rank the documents Measure the difference between the ranking results and the relevance judgment using an evaluation mea Sure e Cse the average measure on all the queries in the test set to evaluate the performance of the ranking model As for collecting the documents associated with a query, a num ber of strategies can be used. For example, one can simply collect all the documents containing the query word. One can also choose to use SOne predefined rankers to get documents that are more likely to be relevant. A popular strategy is the pooling method used in TREC. In this method a pool of possibly relevant documents is created by taking a sample of documents selected by the various participating systems In particular, top 100 documents retrieved in each submitted run for a given query are selected and merged into the pool for human assess- ment. On average, an assessor judges the relevance of approximately 1500 documents per query As for the relevance judgment, three strategies were used in the literature (1)Specifying whether a document is relevant or not to the query (i.e, binary judgment 1 or0),or further specifying the degree ohttp://trec.nist.gov 【实例截图】
【核心代码】
标签:
小贴士
感谢您为本站写下的评论,您的评论对其它用户来说具有重要的参考价值,所以请认真填写。
- 类似“顶”、“沙发”之类没有营养的文字,对勤劳贡献的楼主来说是令人沮丧的反馈信息。
- 相信您也不想看到一排文字/表情墙,所以请不要反馈意义不大的重复字符,也请尽量不要纯表情的回复。
- 提问之前请再仔细看一遍楼主的说明,或许是您遗漏了。
- 请勿到处挖坑绊人、招贴广告。既占空间让人厌烦,又没人会搭理,于人于己都无利。
关于好例子网
本站旨在为广大IT学习爱好者提供一个非营利性互相学习交流分享平台。本站所有资源都可以被免费获取学习研究。本站资源来自网友分享,对搜索内容的合法性不具有预见性、识别性、控制性,仅供学习研究,请务必在下载后24小时内给予删除,不得用于其他任何用途,否则后果自负。基于互联网的特殊性,平台无法对用户传输的作品、信息、内容的权属或合法性、安全性、合规性、真实性、科学性、完整权、有效性等进行实质审查;无论平台是否已进行审查,用户均应自行承担因其传输的作品、信息、内容而可能或已经产生的侵权或权属纠纷等法律责任。本站所有资源不代表本站的观点或立场,基于网友分享,根据中国法律《信息网络传播权保护条例》第二十二与二十三条之规定,若资源存在侵权或相关问题请联系本站客服人员,点此联系我们。关于更多版权及免责申明参见 版权及免责申明
网友评论
我要评论