Services

A graphical user interface for structured document retrieval

Description
A graphical user interface for structured document retrieval
Categories
Published
of 16
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A Graphical User Interface for StructuredDocument Retrieval Jes´us Vegas 1 , Pablo de la Fuente 1 , and Fabio Crestani 2 1 Dpto. Inform´aticaUniversidad de ValladolidValladolid, Spain jvegas@infor.uva.espfuente@infor.uva.es 2 Dept. Computer and Information SciencesUniversity of StrathclydeGlasgow, Scotland, UK fabioc@cs.strath.ac.uk Abstract Structured document retrieval requires different user graphi-cal interfaces from standard Information Retrieval. An Information Re-trieval system dealing with structured documents has to enable a userto query, browse retrieved documents, provide query refinement and rel-evance feedback based not only on full documents, but also on specificdocument structural parts. In this paper, we present a new graphical userinterface for structured document retrieval that provides the user withan intuitive and powerful set of tools for structured document search-ing, retrieved list navigation, and search refinement. We also presentthe results of a preliminary evaluation of the interface which highlightsstrengths and weaknesses of the current implementation and suggestsdirections of future work. 1 Introduction Standard Information Retrieval (IR) treats documents as their were atomic enti-ties, indexing and retrieving them as single objects. However, modern IR needs tobe able to deal with more elaborate document representations, like for exampledocuments written in SGML, HTML, or XML. These document representationformalisms enable to represent and describe documents said to be structured  ,that is documents that are organised around a well defined structure. New stan-dards in structured document representation compel IR to design and implementmodels and tools to index and retrieve documents according to the given docu-ment structure. This area of research is known as structured document retrieval  .This means that documents should no longer be considered as atomic entities,but as aggregates of interrelated objects that need to be indexed, retrieved, andpresented both as a whole and separately, in relation to the user’s needs.IR systems are powerful and effective tools for accessing documents by con-tent. A user specifies the required content using a query, often consisting of a  natural language expression. Documents estimated to be relevant to the userneed expressed in the query are presented to the user through an interface. Agood interface enables the IR process to be interactive, stimulating the user toreview a number of retrieved documents and to reformulate the initial query, ei-ther manually or, more often, automatically, using relevance feedback. Given thecomplexity of the IR task and the vagueness and imprecision of the expression of the user information need and document information content, interactivity hasbeen widely recognised as a very effective way of improving information accessperformance [10].In structured document retrieval, the need for a good interface between userand system becomes even more pressing. Structured documents are often longand complex and it has been observed that a certain form of disorientation oc-curs in situations where the user does not understand why certain documentsappear in the retrieved list [5]. The length and structural complexity of suchdocuments makes it difficult for the user to capture the relationship betweenthe information need and the semantic content of the document. This user dis-orientation makes the task of query reformulation much harder, since the userhas first to understand the response of the system and then to choose if any of the retrieved document is a good enough representation of the information needto provide the system with precise relevance feedback. This added difficultiesobviously increase the cognitive load of the user and decrease the quality of theinteraction with the system [3].A possible solution to this problem in structured document retrieval is toprovide the IR system interface with explanatory  and selective feedback  capa-bilities. In other words, the system should be able to explain the user, at anymoment, why a particular document has been estimated as relevant and wherethe clues of this estimated relevance lie. In addition, the user should then beable to select for relevance feedback only those parts of the document that arerelevant to the information need and not the entire document, has it is done instandard IR. Being able to focus on the parts of the document that makes itappear relevant, without losing the view of the relationships between these partsand the whole document, the user’s cognitive load is reduced and the interactionis enhanced in quality and effectiveness.The consideration that the design and implementation of improved struc-tured document retrieval visualisation interfaces provide an effective contributionto the solution of this problem is the starting point of the work presented here.In this paper we present the current stage of implementation and evaluation of a graphical user interface of a structured document retrieval system. This workis part of a wider project aimed at the design, implementation, and evaluationof a complete system for structured document retrieval. The system combinesa retrieval engine based on an aggregation-based approach for the estimation of the relevance of the document, with an interface with explanatory and selectivefeedback capabilities, specifically designed to the structured document retrievaltask. In this paper we present only the work concerned with the interface.  The paper is structured as follows. In section 2 we highlight the importanceand the difficulty of the design of graphical user interfaces for information access.In section 3 we explain the characteristics of the specific information access taskthat we are targeting: structured document retrieval. In section 4 we presenta new graphical user interface for hierarchically structured document retrieval.The retrieval model currently used in conjunction with the interface is presentedin 5, even though this part of our work is only at a very early stage. A first eval-uation of the graphical user interface is presented in section 6. Finally, section 7summarises the conclusions of this work and provide an outline of future work. 2 Graphical User Interfaces for Information Access A great deal of work has been devote in IR to the design of user interfaces. Itis well recognised that information seeking is a vague and imprecise process. Infact, when users approach an information access system they often have only animprecise idea of what they are looking for and a vague understanding of howthey can find it. The interface should aid the users in the understanding andexpression of the information need. This implies not simply helping the users toformulate queries, but also helping them to select among available informationresources, understand search results, reformulate queries, and keep track of theprogress of the whole search process.However, human-computer interfaces, and in particular graphical user inter-faces (GUI), are less well understood than other aspects of IR, because of thedifficulties in understanding, measuring, and characterising the motivations andbehaviours of IR users. Nevertheless, established wisdom combined with morerecent research results (see [2][pp. 257-322] for an overview of both) have high-lighted some very important design principles for GUIs for information accesssystems.GUIs for information access should: – provide informative feedback; – reduce working memory load; – provide functionalities for both novice and expert users.These design principles are particularly important for GUIs for IR systems,since the complexity of the IR task requires very complex and interactive inter-faces.An important aspect of GUIs for information access systems is visualisa-tion  . Visualisation takes advantage of the fact that humans are highly attunedto images and visual information. Pictures and graphics can be captivating andappealing, especially if well designed. In addition, visual representation can com-municate some information more rapidly and effectively than any other method.The growing availability of fast graphic processors, high resolution colour screens,and large monitors is increasing interest in visual interfaces for information ac-cess. However, while information visualisation is rapidly advancing in areas such  as scientific visualisation, it is less so for document visualisation. In fact, visuali-sation of inherently abstract information is difficult, and visualisation of textualinformation is especially challenging. Language and its written representation,text, is the main human means of communicating abstract ideas, for which thereis no obvious visual manifestation [18].Despite these difficulties, researchers are attempting to represent some as-pects of the information access process using visualisation techniques. The mainvisualisation techniques used for this purpose are icons, colour highlighting,brushing and linking, panning and zooming, and focus-plus-context [2][257-322].These techniques support a dynamic, interactive use that is especially importantfor document visualisation.In this paper we will not address these techniques in any detail. We will justuse those that are most suitable to our specific objective: structured documentretrieval. The distinctive characteristics of this task are described in the nextsection. 3 Structured Document Retrieval Many document collections contain documents that have complex structure, de-spite this not being used by most IR systems. The inclusion of the structureof a document in the indexing and retrieval process affects the design and im-plementation of the IR system in many ways. First of all, the indexing processmust consider the structure in the appropriate way, so that user can search thecollection both by content and structure. Secondly, the retrieval process shoulduse both structure and content in the estimate of the relevance of documents.Finally, the interface and the whole interaction has to enable the user to makefull use of the document structure. In the next section we report a brief and ab-stract taxonomy of IR systems with regards to the use of content and structurefor indexing and retrieval. This will help us to identify what kind of operationsshould an interface for structured document retrieval provide the user with. Fora more detailed discussion on the types of operations necessary for structureddocument retrieval see [19]. 3.1 Use of Content and Structure in IR Systems From a general point of view, we have nine types of IR systems, depending onthe use of content and/or structure in the indexing, querying and processes. Wecan identify them with two sets of two letters: two letters for the indexing andtwo for the querying process. We use C and S to indicate the use of content orstructure, respectively. The nine types of IR systems are showed in Table 1. Inthis table we indicate what kind of retrieval function the system would need (thefunction f  ) and what kind of input the user can provide to the querying (thearguments of the function f  , c or s ), where the first argument refers to indexingand the second to querying.The different types of IR systems are the following:  indexing querying  C CS SC q = f  c ( c ) q = f  c ( c,s ) -CS q = f  c + s ( c ) q = f  c + s ( c,s ) q = f  c + s ( s )S - q = f  s ( c,s ) q = f  s ( s ) Table 1. Taxonomy of IR Systems respect the use of Content/Structure (C,C) In this type of systems, indexing is carried out only with the contentof the documents, so either the structure is not taken in account, or thedocuments are unstructured. Since the structure has not been considered inthe indexing, querying can only be related to the content of the documents.This is the most common type of IR systems, and all the classic IR models(vector space model, probabilistic, etc.) have been designed to work withthis kind of systems. (CS,CS) This type of systems index document content in relation to documentstructure, so that the index contains indications of how a document contentis arranged in the document. The user can then query the system for contentwith the ability to specify the structural elements in which the content shouldbe found. This enables a higher degree of precision in the search. Some modelshave been proposed for this kind of IR systems, but few implementationsexists. (S,S) This type of systems index documents only in relation to their structure,not content, and querying can only be related to structure. So, this typeof systems are useful only when the queries are exclusively about structure.Models for this systems are very simple, and given the very limited andspecific use of these systems, only few implementation exists. (CS,C) This type of systems index documents content in relation to documentstructure, but users are not able to specify structural information in thequerying. In other words, document structure is used in an implicit way;the user is not aware of document structure, but it is used in the relevanceevaluation to achieve better retrieval performance. In this category of IRsystems we can include a number of advanced systems aimed at collectionswhere the structure is specified in the document markup (e.g. SGML, HTML,XML, etc.). In addition, in this category we find most Internet crawlers, thatuse information about hyper links, titles, etc. of the HTML pages to betterrank the retrieved set [3]. (CS,S) This type of systems index document content in relation to documentstructure, but querying can only uses the structure of documents. This kindof systems may not seem very useful, but they may be valuable componentsof (CS,CS) systems, where there is a need to use document similarity froma structural point of view. (C,CS) This type of IR systems index documents by content, but allow queriesto specify structure too. Clearly, the system cannot answer completely thequery, since to structural information is not contained in the index. The user
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks