當(dāng)前位置首頁 > 建筑/施工 > 圖紙/圖集
搜柄,搜必應(yīng)! 快速導(dǎo)航 | 使用教程

信息檢索六tfidf

文檔格式:PPT| 47 頁|大小 407.50KB|積分 28|2024-12-12 發(fā)布|文檔ID:253393261
第1頁
第2頁
第3頁
下載文檔到電腦,查找使用更方便 還剩頁未讀,繼續(xù)閱讀>>
1 / 47
此文檔下載收益歸作者所有 下載文檔
  • 版權(quán)提示
  • 文本預(yù)覽
  • 常見問題
  • 單擊此處編輯母版標(biāo)題樣式,單擊此處編輯母版文本樣式,第二級(jí),第三級(jí),第四級(jí),第五級(jí),*,互聯(lián)網(wǎng)信息搜索,湖南大學(xué)計(jì)算機(jī)與通信學(xué)院,劉鈺峰,互聯(lián)網(wǎng)信息搜索六,tfidf and,vector spaces,回顧,1、中文分詞,2、詞典壓縮,3、posting list壓縮,4、tfidf,Scoring documents,How do we construct an index?,What strategies can we use with limited main memory?,Scoring,We wish to return in order the documents most likely to be useful to the searcher,How can we rank order the docs in the corpus with respect to a query?,Assign a score say in 0,1,for each doc on each query,Begin with a perfect world no spammers,Nobody stuffing keywords into a doc to make it match queries,More on“adversarial IR”under web search,Linear zone combinations,First generation of scoring methods:use a linear combination of Booleans:,E.g.,Score=0.6*,+0.3*+0.05*+0.05*,Each expression such as takes on a value in 0,1.,Then the overall score is in 0,1.,For this example the scores can only take,on a finite set of values what are they?,Exercise,On the query,bill,OR,rights,suppose that we retrieve the following docs from the various zone indexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Compute the score,for each doc based on the weightings 0.6,0.3,0.1,General idea,We are given a,weight vector,whose components sum up to 1.,There is a weight for each zone/field.,Given a Boolean query,we assign a score to each doc by adding up the weighted contributions of the zones/fields.,Typically users want to see the,K,highest-scoring docs.,Index support for zone combinations,In the simplest version we have a separate inverted index for each zone,Variant:have a single index with a separate dictionary entry for each term and zone,E.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Of course,compress zone names,like author/title/body.,Zone combinations index,The above scheme is still wasteful:each term is potentially replicated for each zone,In a slightly better scheme,we encode the zone in the postings:,At query time,accumulate contributions to the total score of a document from the various postings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,As before,the zone names get compressed.,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Score accumulation,As we walk the postings for the query,bill,OR,rights,we accumulate scores for each doc in a linear merge as before.,Note:we get,both,bill,and,rights,in the,Title,field of doc 3,but score it no higher.,Should we give more weight to more hits?,1,2,3,5,0.7,0.7,0.4,0.4,Term-document count matrices,Consider the number of occurrences of a term in a document:,Bag of words,model,Document is a vector:a column below,Bag of words view of a doc,Thus the doc,John is quicker than Mary,.,is indistinguishable from the doc,Mary is quicker than John,.,Which of the indexes discussed,so far distinguish these two docs?,Counts vs.frequencies,WARNING,:In a lot of IR literature,“frequency”is used to mean“count”,Thus,term frequency,in IR literature is used to mean,number of occurrences,in a doc,Not,divided by document length(which would actually make it a frequency),We will conform to this misnomer,In saying,term frequency,we mean the,number of occurrences,of a term in a document.,Term frequency,tf,Long docs are favored,because theyre more likely to contain query terms,Can fix this to some extent by normalizing for document length,But is raw,tf,the right measure?,Document frequency,But document frequency(,df,)may be better:,df,=number of docs in the corpus containing the term,Word,cf,df,ferrari,1042217,insurance,104403997,Document/collection frequency weighting is only possible in known(static)collection.,So how do we make use of,df,?,tf x idf term weights,tf x idf measure combines:,term frequency(,tf,),or,wf,some measure of term density in a doc,inverse document frequency(,idf,),measure of informativeness of a term:its rarity across the whole corpus,could just be raw count of number of documents the term occurs in(,idf,i,=,1/,df,i,),but by far the most commonly used version is:,See Kishore Papineni,NAACL 2,2002 for theoretical justification,Summary:tf x idf(or tf.idf),Assign a tf.idf weight to each term,i,in each document,d,Increases with the number of occurrences,within,a doc,Increases with the rarity of the term,across,the whole corpus,再論TF,Real-valued term-document matrices,Function(scaling)of count of a word in a document:,Bag of words,model,Each is a vector in,v,Here log-scaled,tf.idf,Note can be 1!,Documents as vectors,Each doc,j,can now be viewed as a vector of,wf,idf,values,one component for each term,So we have a vector space,terms are axes,docs live in this space,even with stemming,may have 20,000+dimensions,(The corpus of documents gives us a matrix,which we could also view as a vector space in which words live transposable data),Why turn docs into vectors?,First application:Query-by-example,Given a doc,d,find others“l(fā)ike”it.,Now that,d,is a vector,find vectors(docs)“near”it.,Intuition,Postulate:Documents that are“close together”,in the vector space talk a。

    點(diǎn)擊閱讀更多內(nèi)容
    最新文檔
    傳統(tǒng)文化道德不是高懸的明月而是腳下的星光.pptx
    世界無煙日關(guān)注青少年成長(zhǎng)健康無煙為成長(zhǎng)護(hù)航.pptx
    五四青年節(jié)詩詞贊歌五四青年自強(qiáng)不息.pptx
    XX學(xué)校班主任培訓(xùn)用心管理慧做班主任.pptx
    拒絕熬夜健康養(yǎng)生規(guī)律作息遠(yuǎn)離亞健康.pptx
    兒童成長(zhǎng)手冊(cè)時(shí)光里的童真印記.pptx
    幼兒園夏季傳染病預(yù)防指南預(yù)見夏天健康童行夏季傳染病預(yù)防科普.pptx
    高中生心理健康教育主題班會(huì)快樂學(xué)習(xí)高效學(xué)習(xí)正視壓力學(xué)會(huì)減壓.pptx
    員工職業(yè)道德與職業(yè)素養(yǎng)培訓(xùn)遵守職業(yè)道德提高職業(yè)修養(yǎng).pptx
    2025職業(yè)病防治法宣傳周健康守護(hù)職防同行.pptx
    XX幼兒園防災(zāi)減災(zāi)安全教育臨災(zāi)不亂安全童行學(xué)會(huì)保護(hù)自己.pptx
    在2025年縣教育工作大會(huì)暨高考備考工作推進(jìn)會(huì)上的講話發(fā)言材料.docx
    在2025年縣全面從嚴(yán)治黨和黨風(fēng)廉政會(huì)議上的講話發(fā)言材料.docx
    在2025年全市慶?!拔逡弧濒邉趧?dòng)模范表彰大會(huì)上的講話發(fā)言材料多篇.docx
    2025年稅務(wù)局青年代表在五四青年座談會(huì)上的發(fā)言材料3篇.docx
    在2025年市委全體會(huì)議上的主持講話發(fā)言材料.docx
    2025年黨風(fēng)廉政建設(shè)工作要點(diǎn)材料.docx
    在2025年全市青年干部慶祝五四青年節(jié)大會(huì)上的講話發(fā)言材料多篇.docx
    在入黨積極分子培訓(xùn)班上的講話發(fā)言材料.docx
    縣文旅局黨組書記在五一假期及夏季旅游安全生產(chǎn)工作部署會(huì)議上的講話發(fā)言材料.docx
    賣家[上傳人]:奇緣之旅
    資質(zhì):實(shí)名認(rèn)證