A new data model for XML databases

A new data model for XML databases
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A New Data Model forXML Databases Richard Ho,* Li Bai and David Elliman School of Computer Science & IT, Nottingham University, UK  A BSTRACT  The widespread activity involving the Internet and the Web causes large amountsof electronic data to be generated every day. This includes, in particular, semi-structured textual data such as electronic documents, computer programs, log files,transaction records, literature citations, and emails. Storing and manipulating thedata thus produced has proven difficult. As conventional DBMSs are not suitablefor handling semi-structured data, there is a strong demand for systems that arecapableofhandlinglargevolumesofcomplexdatainanefficientandreliableway.The Extensible Markup Language (XML) provides such solution. In this paper, wepresenttheconceptofa‘verticalviewmodel’anditsusesasamappingmechanismforconverting complexXML datatorelationaldatabasetables,andasa standalonedata model for storing complex XML data. Copyright © 2003 John Wiley & Sons,Ltd. INTRODUCTION WiththefastdevelopmentofInternettechnology,large volumes of data in the form of electronicdocumentshavebeencreatedandstoredinaformaccessiblefromtheWeb.Forthepurposesofdataintegration and data exchange, more and moreexisting sources, such as relational databases,support public XML export, and an increasingamount of public and private data is described ina semi-structured way. A number of issues needto be addressed when we integrate data fromdifferent sources, including heterogeneous andduplicate data, multiple divisions and partners,and changes.Data heterogeneity results from the use of dif-ferent information management systems to storedata and each system has its own data struc-ture and access methods. Relational databasemanagement systems benefit from the universalacceptance of Structured Query Language (SQL)as the primary means of getting answers whilstdocument and email repositories are generallyaccessed using text search engines with varying *Correspondence to: Richard Ho, School of ComputerScience&IT,NottinghamUniversity,NottinghamNG81BB, UK. interfacesandcapabilities.Becausethesesystemswere not designed with interoperability in mind,each must generally be accessed using source-specificapplicationsorapplicationprogramminginterfaces (APIs).Another difficulty in data integration is dataduplication—different systems represent thesame piece of data in different ways. For exam-ple, customers may be identified by name inone database, but by account number in a sec-ond database. A third data source, perhaps anemail repository, may identify the same cus-tomer by email address. Frequently a requiredpiece of information is derived from multipledata points. Data integration is further compli-cated when customers do business with multipledivisions within a large company, or with otherpartners. Similarly, answering questions aboutthe state of a company’s supply chain requiresaccess to vendor and distributor informationsources. Doing business electronically across thefirewall gives rise to security and data own-ership issues. Finally, data integration has todeal with different types of changes: changein business requirements and strategies, in ITsystems, mergers and acquisitions, and newproduct launches. This demands that a data Copyright 󰂩 2003 John Wiley & Sons, Ltd. InternationalJournalofIntelligentSystemsinAccounting,Finance&ManagementPublishedonlineinWileyInterScience( Int.J.Intell.Sys.Acc.Fin.Mgmt. 11 ,149–157(2002)  integration solution be sufficiently flexible andadaptable.One possible solution for the data integra-tion problems mentioned above is to providean XML query engine to link the server and theclient.XMLWebservicesbreakdownthebarriers between different computing platforms, devel-opment environments and communications net-works, allowing organizations to work togetherelectronically without the expense and delay of agreeing on semantics, schema, interfaces, andotherapplication integration.Figure 1 showsthisarchitecture.XML provides the flexibility for handling datawith differing structures. As XML is becomingthe principal medium for data exchange over theWeb and for information integration in general,increasing amounts of public and private dataare described in XML. XML data is usuallydefined in a tree or graph based, self-describingobject instance model (Boncz and Kersten, 1999).However, semi-structured data is incompatiblewith the flat structure of relational databasetables, and so the growth of XML data requiresnewandcomplexqueryoptimizationtechniques.In this paper, we propose a new approachfor storing XML data, a hybrid model namedthe  vertical view mapping model . In this modelwe separate data structure from data content ina similar way to the AST/TA model (Florescuand Kossmann, 1999). The idea is to considerXML data as in two parts: (1) the actual data,and (2) information about the actual data, whichis analogous to metadata. We adopt relationaldatabase techniques to cope with the logicalstructures of documents. On the other hand, asthe data varies in length, it would cause largestoragewasteifwestoreitinarelationaldatabase.In order to solve this problem, we propose amapping mechanism to bridge the structure andthe data. This will provide efficient storage forcomplex data and fast reconstruction, and willimprove operation workload.The remainder of this paper is organized asfollows. In the next section, we give a brief overviewof the existing approachesfor handling DatabasesFileRepositoryExisting SystemsLegacy SystemsERP SystemsBusiness Partneror other systemsCommon XML View of All DataXML Query EngineData ModelBusiness Partneror other systemApplets,ApplicationsWeb BrowserPDA,Mobile DevicesWeb services technologies(Soap, UDDI, WSDL, ebXML)IIOPHTTPHTTPProprietaryProtocolSQLProprietaryProtocolWeb services technologies(Soap, UDDI, WSDL, ebXML)XML QueriesXML ResultsWebService Figure 1  XML web service architecture Copyright 󰂩 2003 John Wiley & Sons, Ltd.  Int. J. Intell. Sys. Acc. Fin. Mgmt.  11 , 149–157 (2002)150 R. HO, L. BAI AND D. ELLIMAN  semi-structured data. In the third section wediscuss our proposed method in detail. Wepresent our conclusion in the final section. EXISTING APPROACHES Existing approaches for storing semi-structureddata fall into threebroadcategories: Flat streams,Meta-modeling,andMixed(Bourret,2001).Inthefirst case, XML data is stored as serialized bytestreams, for example as XML files. This methodprovides fast access when retrieving the wholedocument or large parts of the document, since asingle index lookup or positioning the disk headonce, the entire document or fragment can beretrieved. However, problems can occur whenretrieving data in any other format. In the sec-ond case, XML data is stored in conventionaldatabases. Querying and navigating the struc-ture of the data is fast in this approach, however,it is likely to encounter performance problemswhenthereconstructionofthewholedocumentisrequired. Several mapping schemes for handlingXML data in relational databases were proposedsuch as Edge (Ceri  et al ., 1999), Universal Table(Garcia-Molina  et al ., 2000), Xrel (Dodds, 2001)and XParent (Gorla and Liu, 1999), which usea number of tables to map XML data into rela-tional database tables. In the Mixed approaches,thereareseveralattemptstomergethetwometh-ods above and data is held in two repositories.Because of this, updates incur significant over-heads for consistency control.As one of the applications of the proposedconcept in the paper is a mapping scheme forhandling XML data in relational databases, itwill be beneficial to review some of the existingmapping schemes for comparison purposes. Inorder to show how XML data is mapped intorelational database tables in each approach, wewill use the example XML data in Figure 2.Generally, XML data can be represented as adata graph, see example the XML data graph inFigure 3. Edge The Edge mapping scheme has been studied inCeri  et al . (1999), Garcia-Molina  et al . (2000) and <?xml version="1.0" ?><menu><food category="breakfast"><name>Full English Breakfast</name><price>6.95</price><description>Two eggs, bacon or sausage, toast…</description><calories>970</calories><chef id="001"><first>Mick</first><family>Burton</family><email></email></chef></food><food><name>Old-Fashioned Breakfast</name><price>7.03</price><description>Eight eggs, complete pig, loaf of bread…</description><calories>1150</calories><chef><first>Jack</first><middle>A.</middle><family>Smith</family><phone>123-4567</phone></chef></food><drink category="Hot"><name>Tea</name><price>2.50</price><description>A pot of Tea</description><calories>100</calories></drink></menu> Figure 2  Example XML data Gorla and Liu (1999). In this approach XML datagraphs are stored in a single table, namely theEdge table.Theedge canbeseenasapath,whichdescribes the relationship between two nodes, ‘Source’ and ‘Target’ . Associated with each nodeisaflag,whichindicateswhethertheattributeisareference or a value. Further, an  ‘Ordinal’  keepstrack of the sequential order of the attribute.Table 1 is the Edge table corresponding to theexample XML data in Figure 2. Universal Table In this approach a Universal table (Ceri  et al .,1999; Garcia-Molina  et al ., 2000) is generated tostore all the elements of an XML document. Eachelement name occurring in an XML documenthas a set of corresponding columns. There will be many null values in different columns whichresultsinupdateanomalies.Table 2isanexampleUniversal table. Copyright 󰂩 2003 John Wiley & Sons, Ltd.  Int. J. Intell. Sys. Acc. Fin. Mgmt.  11 , 149–157 (2002) A NEW DATA MODEL FOR XML DATABASES 151  rootmenu#1#13#2#4#3#5#6#10#9#11#12#19#20#21#22#7#8#14#15#16#17#18#23#24#25#26#27#28fooddrinkfoodcategorydescriptionnamepricecaloriescategorydescriptionchefchefnamepricecaloriesdescriptionnamepricecalories7.031150HotTea2.50A pot of Tea100Old-FashionedBreakfastEight eggs, completepig, loaf of bread...Two eggs, bacon orsausage, toast...Full EnglishBreakfastbreakfast6.95970ldfirstfamilyfamilyemail001BurtonMickMick@button.comA.JackSmith123-4567firstmiddlephonetextelementattribute Figure 3  XML data graph  XRel TheXRel(Dodds,2001)approachdividesanXMLdocumentintofourtables,namelyElement,Text,Attribute and Path separately. In this approach,XML data falls into three categories, element,attribute, and text. The element table is mainlyfor presenting the regions of each subtree. Theregions { start,end } depicttheactualstartandendposition in the XML document. The actual datacontent is stored in attribute and text table. Thecombination of   { docID, pathID }  is an identifierfor each node in the XML data. Table 3 is theXRel representation of the example XML data inFigure 2.There are several important issues to be takeninto consideration when choosing a mappingscheme, such as variable length data, structurecomplexity, redundancy, document reconstruc-tion, and access efficiency. The Universal Tableapproach is not good at handling complex datasince each XML element should have a set of cor-responding columns. Both Edge and Universaltable approaches have redundancy problems inhandling the structure of XML data. This makesit difficult to cope with changes in structureand therefore to maintain data accuracy. More-over, as mentioned in XRel (Dodds, 2001), pathexpressions are insufficient to restore the topol-ogy of XML trees, since more than one node mayshare the same path expression and precedencerelationships among nodes may be lost in pathexpressions. This gives rise to problems in docu-ment reconstruction (Jiang  et al ., 2001; Scheffnerand Conrad, 2001). Ronald Bourret (2001) men-tioned that a native XML database is suitable forstoring document-centric documents, and a rela-tional database provided better performance fordata-centricdocuments.Asaresult,therelational Copyright 󰂩 2003 John Wiley & Sons, Ltd.  Int. J. Intell. Sys. Acc. Fin. Mgmt.  11 , 149–157 (2002)152 R. HO, L. BAI AND D. ELLIMAN  Table 1.  Edge tableSource Target Ordinal Flag Name Value#0 #1 1 Ref Menu#1 #2 1 Ref Food#2 #3 1 Val Category Breakfast#2 #4 2 Val Name Full English Breakfast#2 #5 3 Val Price 6.95#2 #6 4 Val Description Two eggs, bacon or sausage . . . #2 #7 5 Val Calories 970#2 #8 6 Ref Chef #8 #9 1 Val Id 001#8 #10 2 Val First Mick #8 #11 3 Val Family Burton#8 #12 4 Val Email #13 2 Ref Food#13 #14 1 Val Name Old-fashioned Breakfast#13 #15 2 Val Price 7.03#13 #16 3 Val Description Eight eggs, complete pig . . . #13 #17 4 Val Calories 1150#13 #18 5 Ref Chef #18 #19 1 Val First Jack #18 #20 2 Val Middle A.#18 #21 3 Val Family Smith#18 #22 4 Val Phone 123–4567#1 #23 3 Ref Drink #23 #24 1 Val Category Hot#23 #25 2 Val Name Tea#23 #26 3 Val Price 2.50#23 #27 4 Val Description A pot of tea#23 #28 5 Val Calories 100 Table 2.  Universal tableSource Ordinal category  Target category  . . .  Ordinal chef   Target chef   . . .  Ordinal phon e Target phone #2 1 Breakfast  . . .  6 #8  . . .  Null Null#13 Null Null  . . .  5 #18  . . .  Null Null#23 1 Hot  . . .  Null Null  . . .  Null Null#8 Null Null  . . .  Null Null  . . .  Null Null#18 Null Null Null Null 4 123–4567 database will be no longer feasible for variablelength data, so new data models are needed. THE VERTICAL VIEW MODEL The ‘vertical view model’ separates logical struc-ture of data from its contents to provide reliableaccess for semi-structured data. The idea of sep-arating structure from data is similar to theAST/TA model(Florescu andKossmann,1999a).However, in the AST/TA model data is located by a pair of values, called the Text Surrogate Val-ues, which are the start position and the lengthof the data. Scheffner and Conrad (2001) pointedout that in their AST/TA model, data lengthchanges require expensive shifting, and in theworst case, re-calculation of all the text surrogatevalues is required to keep the model correct. Weuseaconceptualmappingtobridgemetadataand Copyright 󰂩 2003 John Wiley & Sons, Ltd.  Int. J. Intell. Sys. Acc. Fin. Mgmt.  11 , 149–157 (2002) A NEW DATA MODEL FOR XML DATABASES 153
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks