Government & Nonprofit

A framework for selective query expansion

Abstract Query expansion is a well-known technique that has been shown to improve< i> average retrieval performance. This technique has not been used in many operational systems because of the fact that it can greatly degrade the performance of
of 2
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Framework for Selective Query Expansion ∗ Steve Cronen-Townsend Yun Zhou W. Bruce Croft { crotown, yzhou, croft } Center for Intelligent Information RetrievalDepartment of Computer ScienceUniversity of Massachusetts, Amherst, MA 01003 ABSTRACT Query expansion is a well-known technique that has beenshown to improve average  retrieval performance. This tech-nique has not been used in many operational systems be-cause of the fact that it can greatly degrade the performanceof some individual queries. We show how comparison be-tween language models of the unexpanded and expandedretrieval results can be used to predict when the expandedretrieval has strayed from the srcinal sense of the query. Inthese cases, the unexpanded results are used while the ex-panded results are used in the remaining cases (where suchstraying is not detected). We evaluate this method on awide variety of TREC collections. Categories and Subject Descriptors: H.3.3 InformationStorage and Retrieval: Query Formulation General Terms: Experimentation Keywords: language modeling, clarity, query expansion 1. INTRODUCTION In this paper we develop a method for discriminating be-tween queries and deciding when not to use the results of anexpansion technique that is likely to hurt the retrieval per-formance for that particular query  . We explore our methodin a language modeling framework where ordinary retrievalis done by the query likelihood method[4] and expandedretrieval is done using relevance models[1]. The retrievalparameter settings are given in [3]. 2. MODELCOMPARISONMETHOD We seek to predict queries that have highly negative changesin average precision on expansion, with a score that doesnot depend on relevance information. To do this, we com-pare a language model of the unexpanded retrieval rankedlist (model A) with a language model of the ranked listproduced with the expanded query (model B). With thiscomparison, our goal is to determine when the expandedretrieval has strayed from the sense of the srcinal query.Our model comparison scores focus on important terms inthe unexpanded query and are high when the documents ∗ A full version of this paper[3] is available as Copyright is held by the author/owner. CIKM’04, November 8–13, 2004, Washington, DC, USA.ACM 1-58113-874-1/04/0011. in the expanded results use the terms much less frequentlythan do the documents in the unexpanded results. Thisoften indicates a poor expansion outcome (highly negativechange in average precision). In this case the system wouldshow the user the unexpanded retrieval results instead of the expanded retrieval results. We call this strategy selec-tive query expansion  . We now define each of the componentof this method.For the first component, we estimate a ranked list lan-guage model (a distribution over terms) as P  ( w | Q ) =      D ∈ R P  ( w | D ) P  (rank of  D | Q ), (1)where w is any term, D is a document, Q is the query, and R is the set of all documents, or, in practice, the retrievedset. We approximate P  (rank of  D | Q ), the probability thata document at a certain rank under Q is relevant, as queryindependent. For this study we used equal probabilities of relevance for the top 100 documents, and zero for all others.Now that we have shown how to construct ranked listlanguage models for the two ranked lists (model A and modelB) the second component is the comparison. For this, weuse the weighted relative entropy[5] D ( A || B ; U  ) =1 E  ( A ; U  )      events,i u i a i log 2 a i b i , (2)where A and B represent probability distributions and U  represents a vector of weights over events. The normaliza-tion factor E  ( A ; U  ) = ¡ j a j u j , where a i and b i representsthe probability of event i according to the A and B distri-butions, respectively. The weighted relative entropy is theexpectation value of the quantity Log 2 AB using a weightedversion of the A distribution instead of the plain A distribu-tion as in standard relative entropy(KL). In our case, A and B are the language models for the two ranked lists, P  A ( w | Q )and P  B ( w | Q ) and the events are occurrences of terms fromthe vocabulary of the collection.Differences in the usage of all terms are not equally impor-tant. To reflect this we pick the top T  terms in contributionto the clarity score[2] of the unexpanded model,contrib( w ) = P  A ( w | Q ) ∗ Log 2 [ P  A ( w | Q ) /P  ( w )], (3)where P  A ( w | Q ) is the probability of a term w in the modeland P  ( w ) is the probability of the term in the entire col-lection. Since these are the terms in the unexpanded modelthat are most unusual relative to the overall collection statis-tics, this forms a suitable measure of importance in the
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks