Devices & Hardware

A relevance-extended multi-dimensional model for a data warehouse contextualized with documents

Description
A relevance-extended multi-dimensional model for a data warehouse contextualized with documents
Published
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A Relevance-Extended Multi-dimensional Model for a DataWarehouse Contextualized with Documents Juan Manuel Pérez, Rafael Berlanga,María José Aramburu Universitat Jaume I { martinej,berlanga,aramburu } @uji.esTorben Bach Pedersen Aalborg University tbp@cs.aau.dk ABSTRACT Current data warehouse and OLAP technologies can be ap-plied to analyze the structured data that companies store intheir databases. The circumstances that describe the con-text associated with these data can be found in other inter-nal and external sources of documents. In this paper we pro-pose to combine the traditional corporate data warehousewith a document warehouse, resulting in a contextualizedwarehouse. Thus, contextualized warehouses keep a histor-ical record of the facts and their contexts as described bythe documents. In this framework, the user selects an anal-ysis context which is represented as a novel type of OLAPcube, here called  R-cube  .  R-cubes   are characterized by twospecial dimensions, namely: the relevance and the contextdimensions. The first dimension measures the relevance of each fact in the selected analysis context, whereas the sec-ond one relates each fact with the documents that explaintheir circumstances. In this work we extend an existingmulti-dimensional data model and algebra for representingthe  R-cubes  . Categories and Subject Descriptors H.4.2 [ Information Systems Applications ]: Types of Systems— decision support  General Terms Design Keywords OLAP, text-rich XML documents 1. INTRODUCTION Current data warehouse [8] and OLAP [3] technologiescan be efficiently applied to analyze the huge amounts of structured data that companies produce. These organiza-tions also produce many documents and use the Web as Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.  DOLAP’05,  November 4–5, 2005, Bremen, Germany.Copyright 2005 ACM 1-59593-162-7/05/0011 ... $ 5.00. their largest source of external information. Examples of internal and external sources of information include the fol-lowing: purchase-trends and market-research reports, de-mographic and credit reports, popular business journals, in-dustry newsletters, technology reports, etc. Although thesedocuments cannot be analyzed by current OLAP technolo-gies mainly because they are unstructured and contain alarge amount of text, they include highly valuable informa-tion that should also be exploited by companies. The cur-rent trend is to find these documents available in XML-likeformats [21].Our proposal is to build XML document warehouses thatcan be used by companies to store unstructured informationcoming from their internal and external sources. In [16] weoutlined a multi-dimensional implementation of a documentmodel [15] for the analysis of the information stored in awarehouse of text-rich XML documents. In this paper wepresent an architecture for the integration of a corporatewarehouse of structured data with a warehouse of unstruc-tured documents. We call the resulting warehouse a  con-textualized warehouse  . Thus, a contextualized warehouse isa new kind of decision support system that allows users toobtain strategic information by combining all their sourcesof structured data and unstructured documents, and by an-alyzing the integrated data under different contexts. Forexample, if we have a document warehouse with businessnews articles, we can analyze the evolution of the sales mea-sures stored in our corporate warehouse in the context of aperiod of crisis as described by the relevant news. Thus, wecould detect which products have been more affected. Thesame set of facts could be less revealing under a differentcontext (e.g. regions in economical development). Further-more, for those facts that are not available in the corporatewarehouse, some kind of alternative approximate informa-tion about them could be extracted from past economical re-ports (e.g. aggregated measures of historical export-importrates of some countries), and then included in the contex-tualized warehouse. We note that some important char-acteristics make different typical OLAP facts from factualinformation extracted from documents: the extracted factsmay be incomplete (since not all the dimensions may bequoted in the documents contents) and/or imprecise (if thedimension values found belong to non-base levels).The applications described above require both the avail-ability of a document warehouse and its cooperation withthe corporate data warehouse. The circumstances of thesrcinal facts are understood by analyzing their contexts,that is, the information available in the documents related 19  with the facts. In this paper, a context is defined as a  set of textual fragments that can provide analysts with strategic information important for decision-making tasks  . Contextsare thus unstructured, and cannot be managed by the well-structured corporate warehouse. Since the document ware-house may contain documents about many different topics,we apply well-known Information Retrieval (IR) [1] tech-niques to select the context of analysis from the documentwarehouse. Thus, in order to build a contextualized OLAPcube, the analyst will specify the context under analysis bysupplying a sequence of keywords. Each fact in the resultingcube will have a numerical value representing its relevancewith respect to the specified context, thereby its name,  R-cube   (Relevance cube). Moreover, each fact in the  R-cube  will be related to its context (i.e. the set of the relevantdocuments that describe the context of the fact). In thispaper we extend an existing multi-dimensional data modelto represent these two new dimensions (relevance and con-text), and we study how the traditional OLAP operationsaffect them.The relevance and context dimensions provide us furtherinformation about facts that can be very useful for analysistasks. From the user point of view, the relevance dimen-sion can be used to explore the most relevant portions of an  R-cube  . For example, it can be used to identify the pe-riod of a political crisis, or the regions under economicaldevelopment. The usefulness of the context dimension istwofold. First, it can be used in the selection operations torestrict the analysis to the facts described in a given subsetof the documents (e.g. the most relevant documents). Andsecond, the user will be able to gain insight into the circum-stances of a fact by retrieving its related documents. Thegraphs, charts, and other binary files possibly linked in thedocuments would also be presented to the user, easing theunderstanding of the analysis context.The main contributions of this paper are both, (1) anarchitecture for the integration of a traditional structuredcorporate warehouse with a document warehouse, resultingin a  contextualized warehouse  ; and (2) the formal multi-dimensional data model and the algebra unary operationsto manage  R-cubes  .The rest of the paper is organized as follows. Section 2discusses the related work. In Section 3 we present the ar-chitecture of a contextualized warehouse. Section 4 showshow the analysis cubes ( R-cubes  ) are built. The multidi-mensional data model for  R-cubes   is presented in Section5. In Section 6 we propose an algebra for  R-cubes  . Finally,Section 7 addresses conclusions and future work. 2. RELATEDWORK In [7] the importance of external contextual informationto understand the results of historical analysis was empha-sized. Some works like [21] are focused on the constructionof repositories of XML documents gathered from the Inter-net, but they do not provide on-line analysis tools.In [13], OLAP operations are extended to involve dimen-sion and/or measures contained in external XML data. Froma different point of view, [2] propose to extend XQuery [20]with grouping constructs to evaluate OLAP-style aggrega-tion queries on XML documents. However, the cited papersonly deal with highly structured XML data (e.g. on-lineXML products pricing lists), since the measures and dimen-sions should be selected by using XPath expressions [20].These approaches are not suitable for analyzing text-richXML documents where the measure and dimension valuesare found in the documents textual contents.A recent paper [12] provides mechanisms to perform spe-cial text aggregations on the contents of XML documents,e.g., getting the number of words of a document section, itsmost frequent keywords, a summary, etc. Although thesetext-mining operations are very useful to explore an XMLdocument collection, these techniques cannot be applied toevaluate OLAP operations on the factual information de-scribed by the textual contents of the documents. Our ap-proach differs from [12] in the sense that we do not analyzethe documents textual contents themselves, we extract thedimension values from the documents contents and relatethe documents with those corporate facts characterized bythe same dimension values. Afterwards, we analyze the cor-porate data by using the documents as their context.Information Retrieval (IR) [1] is playing an important roleon the Web, since it has enabled the development of usefuldiscovery tools (e.g. web search engines) and digital libraryservices. These applications deal with huge amounts of text-rich documents and have successfully applied IR techniquesto query this type of repositories. In an IR system the usersdescribe their information needs by supplying a sequence of keywords. The result is a set of documents ranked by rel-evance. The relevance is a numerical value which measureshow well the document fits the user information needs. Tra-ditional IR models (e.g. the vector space model [18]) calcu-late this relevance value by considering the local and globalfrequency (tf-idf) of the query keywords in the documentand the collection, respectively. Intuitively, a document willbe relevant to the query if the specified keywords appear fre-quently in its textual contents and they are not frequent inthe collection. Newer proposals in the field of IR include lan-guage modeling [17] and relevance modeling [9] techniques. The works on language modeling consider each documentas a language model. Thus, documents are ranked accord-ing to the probability of obtaining the query keywords whenrandomly sampling from the respective language model. Anextension of the language modeling approach is relevancemodeling [9] which estimates the probability of observing aquery keyword in the set of documents relevant to a query.The language and relevance modeling approaches still inter-nally apply the keyword frequency to estimate probabilities,and they have been shown to outperform baseline tf-idf mod-els in many cases [17, 9]. As discussed below, our approachrelies on relevance modeling techniques. Unlike traditionalIR models, language and relevance models provide a formalbackground based on probability theory which is suitable tobe included in the formalization of OLAP operations.In [15] we presented a document model for text-rich XMLdocuments where information extraction techniques [10, 5,6] are applied to identify the facts described by the tex-tual contents of the documents. Particularly, [10] and [5] were proven to work to extract time and location references.Given an IR condition, in [15] we showed how relevancemodeling techniques [9] can be applied to estimate the rel-evance of a fact by the probability of observing this factin a document which contains the keywords stated in theIR condition. By following this research line, in a shortpaper [16] we outlined a multi-dimensional implementationof the relevance model discussed in [15]. The focus of thepresent paper is contextualized warehouses. Here we relate 20  the facts of a traditional corporate warehouse with the doc-uments that describe their circumstances. IR conditions areused for establishing an analysis context, and the relevancemodel of  [15] is applied to calculate the relevance of thefacts in the resulting analysis cube ( R-cube  ). The relevancevalue of a fact measures how well the fact fits into the se-lected context. At any moment, it is possible to retrievethe documents related to a fact, and to gain insight into theexplanation of its circumstances. The architecture of a con-textualized warehouse, and the extended multi-dimensionaldata model and algebra for  R-cubes   proposed in the currentpaper are novel.In [11] a probabilistic multi-dimensional model was pre-sented, but it does not clearly discuss how the probabilitiesof the facts are re-calculated after an aggregation. More-over, the selection operation does not modify the probabil-ities of the facts in the resulting cube. We consider thatthe conditions established in a selection operation are also arestriction on the context under analysis and therefore therelevance of facts must be updated accordingly. 3. CONTEXTUALIZEDWAREHOUSES Figure 1 shows the proposed architecture for the contex-tualized warehouse. Its main components are a traditionalstructured corporate warehouse [8], a document warehouseable to evaluate IR conditions [15], and a fact extractor mod-ule. Building a contextualized warehouse mainly means re-lating each fact of the corporate warehouse with its context.For this purpose, the fact extractor tool uses the dimensionsdefined in the corporate warehouse to detect and build thefacts described in the documents. Next, we describe thisprocess in detail by means of an example. Q, XPath DocumentWarehouseCorporateWarehouseDocumentAnalysts Contexts &Facts OLAPCubeCorp.Facts R-cube Dimensions ContextualizedFacts Analysts Contexts&Facts Fact Extractor  Q, XPath DocumentWarehouseCorporateWarehouseDocumentAnalysts   Contexts &Facts OLAPCubeCorp.Facts R-cube Dimensions ContextualizedFacts Analysts Contexts&Facts Fact Extractor  Figure 1: Contextualized warehouse architecture Let us consider the corporate warehouse of an interna-tional provider of vegetable oil by-products. The main prod-ucts of this company include:  fo 1,  fo 2 (used as preserva-tives in the  food  sector), and  he 1 and  he 2 (used in theelaboration of   healthcare  products). The company keepsin its corporate warehouse a historical record of its sales,the quantity sold ( Quantity  measure) and its cost ( Amount measure), per product and customer. Thus, the dimen-sions of the corporate warehouse are  Time ,  Products  and Customers . The  Products  are classified into  Sectors  ( food and  healthcare ). Finally,  Customers  are organized into Countries  and  Regions  (e.g. Southeast Asia, Central Amer-ica, etc.).Our example company also maintains a document ware-house of business newspapers gathered from the Internet inXML format. Figure 2 shows a fragment of an exampledocument of this warehouse that depicts a context for thesales of food sector products to customers of the SoutheastAsian region, made during the second half of 1998. Noticethat contexts are very useful for analysis tasks, since theycan give us detailed information about the facts of the cor-porate warehouse. For example, the document in Figure 2would help to understand a sales drop. <article date=‘‘Dec.1,1998’’><paragraph>The financial crisis in  Southeast Asian   countries,has mainly affected companies in the  food   marketsector. Particularly,  Chicken SPC Inc.  has reducedtotal exports to $1.3 million during  this half of theyear   from $10.1 million in 1997.</paragraph> ...</article> Figure 2: Example fragment of a business journal By applying information extraction techniques [10, 5, 6],and considering the predefined analysis dimensions of the ex-ample corporate warehouse, the dimension values  SoutheastAsia ,  food , and 1998 / 2 nd half   can be identified in the doc-ument fragment. The fact extractor tool builds all the validfacts with them, in this case, ( Products.Sector  =  food,Customers.Region  =  Southeast Asia ,  Time.Half year  =1998 / 2 nd half  ). As it can be noticed, some of these dimen-sion values are not  precise   enough and belong to non-base di-mension categories, for example, the  SoutheastAsia  dimen-sion value belongs to the category  Region  of the  Customers dimension. We may also find documents where some dimen-sions are not mentioned, resulting in  incomplete   facts. Foreach fact, the fact extraction tool also keeps the numberof times that its dimension values occur in the documentfragment (i.e. the fact dimension values frequency). Thisfrequency determines the importance of the fact in the doc-ument, and later will be used to estimate the relevance of afact in a given context.Let us now consider the second sentence of the exampledocument of Figure 2. It depicts two facts: ( Company  = Chicken SPC, Time.Y ear  = 1997 , Export  = $10 , 100 , 000),( Company  =  Chicken SPC, Time.Half  year  = 1998 / 2 ndhalf, Export  = $1 , 300 , 000). Chicken SPC Inc. could be apotential customer or competitor of our example oil providercompany. In this way, the document warehouse also provideshighly valuable strategic information about some facts thatare not available in the corporate warehouse nor in externaldatabases. We note that sometimes it is relatively easy toobtain these facts, for example, when they are presented astables in the documents. However, most times documentscontain already aggregated measure values (total exports inthe facts of the previous example). The main problem hereis to automatically infer the implicit aggregation functionthat was applied (i.e. average, sum, etc.) Alternatively, thesystem could ask the user to guess the aggregation functionby showing him/her the document contents. In this work wemainly focus on the fact  dimensions  , leaving for future workthe management of measures extracted from texts. Noticethat documents extracted measure values are not essentialto construct a contextualized warehouse, since the dimen-sion values found in a document are sufficient to relate it 21  F Products.ProductId Customers.Country Time.Month Amount R Ctxt f  1  fo 1  Cuba  1998 / 03 4 , 300 , 000$ 0 . 05  d 0 . 0053  ,d 0 . 0057 f  2  fo 2  Japan  1998 / 02 3 , 200 , 000$ 0 . 1  d 0 . 025 f  3  fo 2  Korea  1998 / 05 900 , 000$ 0 . 2  d 0 . 044 f  4  fo 1  Japan  1998 / 10 300 , 000$ 0 . 4  d 0 . 041  ,d 0 . 082 f  5  fo 2  Korea  1998 / 11 400 , 000$ 0 . 25  d 0 . 082  ,d 0 . 016 Table 1: Example  R-cube  . Each row represents a fact. The  R  and the  Ctxt  columns (dimensions) depictthe relevance value and the context of the facts, respectively. Each  d ri  denotes a document fragment of thecollection whose relevance with respect to  Q  is  r . with the corporate facts that are characterized by these di-mension values. 4. BUILDING  R-CUBES In this section we study how the analysis cubes are ma-terialized from the contextualized warehouse. We call them R-cubes   since they include two special system-maintaineddimensions, namely: the  relevance   and the  context   dimen-sions. From these cubes, users can study the contextualizedfacts.In order to create an  R-cube   the analyst must supply aquery of the form ( Q,XPath,MDX  ), which states the fol-lowing restrictions:  Q  is an Information Retrieval (IR) con-dition, consisting of a sequence of keywords that specifiesthe context under analysis;  XPath  is a path expression [20]that establishes the document sections under study; finally, MDX   are conditions over the dimensions and measures of the analysis [19]. Here, our purpose is not to define a newquery language, but to identify the type of conditions neededto build an analysis cube in a contextualized warehouse.The query process takes place as follows: (1) First, the IRcondition  Q  and the path expression  XPath  are evaluatedin the documents warehouse. The result is a set of docu-ments fragments satisfying  XPath  and  Q , along with theirrelevance with respect to  Q . (2) Second, the fact extractorcomponent parses the documents fragments obtained in step(1) and returns the set of facts described by each documentfragment, along with their frequency. Notice that we do notparse entire documents, but those document fragments se-lected by the  XPath  expression. (3) Next, or in parallel tosteps (1) and (2), the MDX conditions are evaluated on thecorporate warehouse. (4) Then, we assign each document tothose facts of the corporate database whose dimension val-ues can be “rolled-up” or “drilled-down” to some (possiblyimprecise or incomplete) fact described by this document.(5) Finally, we calculate the relevance of each fact, resultingin an  R-cube  .By following with the running example, let us considerthe analysis of the sales of food products under the contextof a financial crisis reported in the business articles of thedocument warehouse. Thus, given  Q  =“ financial,crisis ”, XPath  =“ /db/business/article/paragraph ” and  MDX   =( Products. [ food ] ,  Customers.Country  ,  Time. [1998] .Month , SUM  ( Measures.Amount )  >  0) as query conditions, thecontextualized warehouse will return the  R-cube   presentedin Table 1. That is, the set of facts of the corporate ware-house that satisfy the stated MDX conditions, along withtheir relevance values with respect to the IR condition (rel-evance dimension, depicted as  R ), and the set of text frag-ments where each fact occurs (context dimension, repre-sented by  Ctxt ).As Table 1 shows, the relevance ( R  dimension) is a nu-meric value that measures the importance of each fact inthe context established by the initial query conditions. Themost relevant facts of our example  R-cube   involve the salesmade to Japanese and Korean customers during the monthsof October and November 1998. Notice that we could ob-tain the details described in the relevant documents by per-forming a drill-through operation on the context dimension[19]. By studying these documents we can discover that theSoutheast Asian financial crisis reported by the documentof Figure 2, is a valid explanation of the sales drop. Eachdocument  d i  of the context dimension also has associateda relevance value (represented by the  r  superscript in  d ri )which measures how well this document describes the se-lected analysis context.A detailed discussion on the calculus of the relevance of the facts can be found in [15], a brief explanation of theinvolved formulas is included in Appendix A. Intuitively, afact will be relevant for the selected context if the fact isfound in a document which is also relevant for this context(e.g. if the keywords of the IR condition  Q  occur frequentlyin the document). In [15] we applied relevance modelingtechniques [9] to estimate the relevance of a fact by means of the probability of observing this fact in the set of documentsrelevant to the IR condition. The probability of finding afact in a document is determined by the frequency of thedimension values of this fact in the textual contents of thedocument.An interesting property of this approach is that the sumof the relevance values of all the facts in an  R-cube   is equalto one. However, notice that not all the document collec-tions are suitable for all the analysis (e.g. an analysis onfinancial crisis with a document collection about productsmanufacturing processes) and the sum of the facts relevancevalues will be kept equal to one. Later in the paper, we willdenote by  Quality  the factor that normalizes the relevancevalues of the facts. This factor indicates the quality of the R-cube   for the selected context, since it measures the overallrelevance of the documents that satisfy the IR condition  Q .Unlike OLAP-XML federations like those proposed in [13], R-cubes   are materialized once, when the query is fetched tothe contextualized warehouse, and will be incrementally up-dated when new relevant documents and data satisfying thesrcinal query are added to their respective warehouses. Themain advantage of this approach is that pre-aggregationscan be performed over  R-cubes  , thus allowing fast analysisoperations over them.The rest of the paper will focus on the formal definitionof   R-cube  , and the provision of a suitable algebra for OLAPoperations. 22  5. AMULTIDIMENSIONALDATAMODELFOR  R-CUBES In this section we define a data model for the  R-cubes  .We extend an existing multidimensional model [14] with twonew special dimensions to represent both, the relevance of the facts and their context. For each component of the ex-tended data model, we show its definition and give someexamples. 5.1 Dimensions A  dimension   D  is a two-tuple  D  = ( C  D ,  D ), where  C  D  = { C  j }  is a set of categories  C  j . Example 1.  In [14] everything that characterizes a fact isconsidered to be a dimension, even those attributes mod-eled as measures in other approaches. Figure 3 shows thedimensions for the running example.Each category  C  j  =  { e }  is a set of dimension values.   D  isa partial order on  ∪ j C  j  (the union of all dimension values inthe individual categories). Given two values  e 1 ,e 2  ∈ ∪ j C  j ,then  e 1   D  e 2  if   e 1  is logically contained in  e 2 . The intuitionis that each category represents the values of a particulargranularity. We will write  e  ∈  D ,  e  is a dimensional valueof   D , if   e  ∈ ∪ j C  j .There are two special categories present in all dimensions:  D , ⊥ D  ∈  C  D  (the top and bottom categories). The cate-gory  ⊥ D  has the values with the finest granularity. Thesevalues do not logically contain other category values and arelogically contained by the values of other coarser categories.The category   D  =  {}  represents the coarsest granularity.For all  e  ∈  D,e   D   .The partial order   D  can be generalized to work on cat-egories as follows: given  C  1 ,C  2  ∈  C  D , then  C  1   D  C  2  if  ∃ e 1  ∈  C  1 ,e 2  ∈  C  2 ,e 1   D  e 2 . We will write    instead of   D  when it is clear that    is the partial order of the dimen-sion  D . Example 2.  The  Customers  dimension has the categories ⊥ Customers  =  Country     Region    Customers , with thedimension values  Country  =  { Japan, Korea, Cuba,... } and  Region  =  { Souhteast Asia,  Central America  ,... } . Thepartial order on category values is:  Japan    Southeast Asia   ,  Korea    Southeast Asia ,  Cuba    Central America   , etc. Customers Time Products Amount Customers Time Products Amount RegionCountryYear Half-year MonthSector ProductIdSales  F DDDD  Amount R D R Ctxt D Ctxt Rel.Degree ColCustomers Time Products Amount Customers Time Products Amount RegionCountryYear Half-year MonthSector ProductIdSales  F DDDD  Amount   R D R Ctxt D Ctxt Rel.Degree Col Figure 3: Dimensions of the example case of study. R  and  Ctxt  are the relevance and context dimensionsof the  R-cube  .The Relevance Dimension.  The  relevance   dimension de-picts the importance of each fact of the  R-cube   in the se-lected context (i.e. the IR condition  Q ). Therefore, it canbe used to identify the portions of an  R-cube   that are moreinteresting for the context of analysis.Different approaches can be followed to state the  relevance  dimension  R . The simplest one is to define it just with thebottom and top categories:  ⊥ R  =  Relevance   R   R . Sincewe model the relevance as a probability value, the valuesof the  Relevance   category are real numbers in the interval[0,1]. Like in [11], we propose to introduce an intermediatecategory to allow users to study relevance values from ahigher qualitative abstraction level. In this new categorythe relevance values will be classified into groups ( Relevance Degrees  ) like  irrelevant  ,  relevant   or  very relevant  .However, the relevance values are normalized to have theirsum equal to one. Thus, a relevance index of 0.02 maybe  irrelevant   if the rest of relevance values are significantlygreater, or  relevant   if the maximum value of relevance ob-tained was, for example, 0.03.Thus, we need to define a dynamic partial order   γ R  tomap the values  r  of the base  Relevance   category to valuesof the  Relevance Degree   category depending on the value of  r/γ  . In this way, we will use  γ   as a normalization factor.Note that  γ   should measure the global relevance of a particu-lar result. Typical measures are  γ   =  MAX  ( r ),  γ   =  AV G ( r )or  γ   =  Quality . Definition 1.  The  relevance dimension   is a two-tuple  R  =( C  R ,  γ R ) where:  C  R  =  { Relevance,  Relevance Degree  ,   R } is the set of categories;  Relevance  = [0 , 1] is the base cate-gory  ⊥ R ;  Relevance Degree  ∈  ℘ ([ a,b ]) is a partition on theinterval of Real numbers [ a,b ]; and   γ R  is the partial order r   γ R  rd , if   r  ∈  Relevance ,  rd  ∈  Relevance Degree  and r/γ   ∈  rd . Example 3.  Let us consider  γ   =  MAX  ( r ) (the maximumvalue of relevance obtained in the  R-cube  ), and five differentdegrees of relevance,  Relevance Degree  =  { very irrelevant  = [0 , 0 . 25),  irrelevant   = [0 . 25 , 0 . 45),  neutral   = [0 . 45 , 0 . 55), relevant   = [0 . 55 , 0 . 75),  very relevant   = [0 . 75 , 1] } , a partitionof [0,1]. As Table 1 shows,  MAX  ( r ) = 0 . 4, then 0 . 05   0 . 4 R very irrelevant  , 0 . 1   0 . 4 R  irrelevant  , 0 . 2   0 . 4 R  neutral  , 0 . 4   0 . 4 R very relevant   and 0 . 25   0 . 4 R  relevant  . The Context Dimension.  The context of the facts isdetailed by the documents of the warehouse. We representthese documents in the  context   dimension, along with theirrelevance to the IR condition  Q . Definition 2.  The  context dimension   is a two-tuple  Ctxt  =( C  Ctxt ,  Ctxt ), where  C  Ctxt  =  { Col,  Ctxt } is the set of cat-egories. The category  ⊥ Ctxt  =  Col  =  { d r }  is the set of thedocuments  d  of the warehouse, the subscript  r  denotes therelevance of the document to the context of analysis (the IRcondition  Q ). Example 4.  In our example,  d 1 ,...d 7  are documents of the warehouse which describe the context of the facts pre-sented in the  R-cube  . The relevance value with respect tothe established IR condition is 0.04 for  d 1 , 0.08 for  d 2 , 0.005for  d 3 , 0.04 for  d 4 , 0.02 for  d 5 , 0.01 for  d 6  and 0.005  d 7 .Then,  { d 0 . 041  ,d 0 . 082  ,d 0 . 0053  ,d 0 . 044  ,d 0 . 025  ,d 0 . 016  ,d 0 . 0057  } ⊂  Ctxt .The context dimension as defined in Definition 2 is plain,i.e., it has no hierarchies. It is possible to define a hier-archy for the context dimension by considering the hierar-chical structure of the XML documents, and by classifying 23
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks