A natural language approach for data mart schema design

A natural language approach for data mart schema design
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A NATURAL LANGUAGE APPROACH FOR DATA MART SCHEMA DESIGN Fahmi Bargui, Jamel Feki, Hanene Ben-AbdallahMir@cl LaboratoryFSEG, University of Sfax Tunisia, Po Box 1088  {fahmi.bargui, jamel.feki, hanene.benabdallah} ABSTRACT  In this paper, we are interested in the requirementsengineering of DSS. In particular, to apprehend theemergent difficulties during the steps of OLAP requirements specification, validation and automatic processing, we propose a data mart schema generationmethod. Our method defines a Data Warehouse Requirement Model DWRM to specify OLAP requirements in natural language. In addition, it includesa semi-automatic process to extract multidimensional concepts and to generate data mart schemas fromrequirements specified in this DWRM model. Keywords: DSS, OLAP requirements engineering, naturallanguage 1   . INTRODUCTION This paper addresses two shortages in the context of automatic data mart schema generation. First, it proposesa natural language-based formalism to facilitate analyticalrequirements solicitation and specification; the proposedmodel is called  DW Requirements Model (DWRM for  short) . Secondly, it defines a semi-automatic process for the extraction of multidimensional concepts and thegeneration of Data Dart (DM) schemas.To define the DWRM, we adopted three essentialrecommendations: i) define this model at a high level of abstraction in order to guarantee its stability; ii) usesimple concepts easy to understand not only by decisionmakers, but also by the DSS (Decision Support System)development team. This facilitates the communication between the various stakeholders without any prior training; and iii) ensure the traceability of requirementsexpressed with DWRM. To provide for these threerecommendations, we opted for natural language as ameans for requirements specification. On the other hand,using natural language for requirements specification can be done in two ways: template - based  and  free syntax-  based. In the first way, a template has to be filled by thedecision makers with business terms to indicate what theywant to analyze, according to which axes, etc. Thisalternative facilitates the analysis and derivation of theDW schemas, but it limits the application scope of theDWRM to some standard queries. The second way to usenatural language for requirements specification is throughfree natural language syntax. Despite the large expressive power of such a language, its processing remains a bigchallenge. Our approach was then to adopt anintermediate way: Our modeling language relies on astructured syntax that includes linguistic patternsformalizing common and frequent styles of queries. Thelinguistic patterns were defined through an empiricalstudy covering several hundred of analytical requirementswritten in the French language.On the other hand, once the analytical requirements arespecified, they must be processed to derive correspondingDM schemas. To do so, our method defines a semi-automatic process that identifies the multidimensionalconcepts within the specified requirements and thengenerates DM schemas. The decision makers intervene tofine tune the organization of hierarchies, which requiresdomain knowledge.In the remainder of this paper, we overview relevantwork pertinent to analytical requirements specification. InSections 3 and 4, we outline our method for elaboratingthe DWRM then we present its grammar. In Section 5,we propose a process for the semi-automatic generation of data mart schemas relying on DWRM. Finally, Section 6summarizes the presented work and sketches some future perspectives. 2   . RELATED WORK  In this section, we briefly study how the three maincategories of DW design approaches, namely data-driven,requirement-driven and mixed, deal with analyticalrequirements. Data-driven Approaches   To build a DM multidimensional schema, data-drivenapproaches ( cf. [2], [5], [6], [11]) mainly applytransformation rules to the enterprise operational datamodel. However, these approaches presume that decisionmakers have a good expertise in operational data models,and a    perfect knowledge of data source structures.Although these approaches can reach a good level of automation, they marginalize the OLAP requirementanalysis phase during the DSS design. Consequently, theDW project may not satisfy all its future users and may,hence, probably fail [4].   Requirement-driven approaches   These approaches, initiated by [8] take into accountthe analytical needs of the future DSS users. However, in[8] there is no formal method for requirement expressionto help decision makers formulating their needs. In asimilar way, [10] and [14] proposed a process-driven  approach limited to the documentation of the analyticalrequirements. These latter are analyzed according to three perspectives at different levels of abstraction, each of which is associated with a specific requirement    template : business, user and implementation. Although the authorsconcentrate on the documentation of the analyticalrequirements, their approach is limited to indicate thecomponents of each template without giving enoughillustrative details and examples. Moreover, the authorsdo not propose guidelines to assist the designer.In [12], the authors propose an engineering processcalled DWARF (Data WArehouse RequirementsdeFinition) and composed of three main phases:identification, specification and validation of analyticalrequirements. To assist the designer during each of these phases, the authors reuse techniques well known insoftware engineering. Nevertheless, the authors did not propose an appropriate formalism for the expression of OLAP requirements.   Furthermore, we consider that recenttechniques and approaches ( e.g., goal-driven approaches[4]) may lead to better and effective requirementsanalysis.   On the other hand, in [3], the authors propose a tabular format for OLAP requirements expression, and a semi-automatic approach for the derivation of multidimensionalschemas from these tabular requirements. Although theauthors presented algorithms for this derivation, theysuppose that the decision makers are qualified enough, or could be assisted by a decisional ontology, to identify andexpress their requirements.   Mixed approaches Mixed approaches result from a combination of thetwo previous ones. Within this category of approaches,[17] and [18] proposed a process allowing the analysis of requirements and their mapping with data sources. Theauthors did not offer techniques, guidelines or models for the identification, documentation or validation of OLAPrequirements.On the other hand, the mixed approach proposed in [4]is goal-driven; it derives a multidimensional schema fromtwo models elaborated by the designer according to two perspectives: organizational model  centered on thestakeholders and decisional model  focused on the decisionmakers. Although the proposed approach facilitates theidentification of analytical requirements while considering both functional and organizational requirements, theresulting models are expressed with several DW technicaldetails; these details complicate the validation process bythe decision makers, often non expert in DW technology.Furthermore, the proposed derivation process of multidimensional schemas from the requirement modelscannot be automated. In [19], the authors present a fullyautomatic approach to derive a DM schema from datasources and requirements expressed in SQL queries. Themain limit of this approach is that analytical requirementsare expressed in SQL, which complicates the validation process for decision makers.   This overview of the related work elucidates thefollowing key points: i) there is an obvious lack inappropriate formalisms for modeling OLAP requirements,that should be practical for designers and easy tounderstand and validate by decision makers; ii) the proposed models take into account either weakly or not atall the requirements traceability; and iii) there is nocomplete approach that automatically derives the DW/DMschema starting from specified OLAP requirements.The remainder of this paper addresses these shortagesfirst by developing a natural language-based formalismfor requirements specification ( cf. sections 3 and 4), andsecondly by defining a transformation process to deriveDM schemas from OLAP requirements ( cf. section 5). 3   . AN ANALYTICAL REQUIREMENTSMODEL In [9] the authors claim that a model for requirementsspecification must satisfy the five followingcharacteristics:   1.   High level of abstraction: the requirements modelmust directly capture the business concepts of decisionmakers. 2.   Completeness: all information necessary to thedecision making process must be captured. 3.   Readability: the natural language used for requirements documentation is easily comprehensible by decision makers. 4.   Precision: the use of mathematical formulas for thespecification of indicators and a grammar for expressing analytical requests bring a better precisionto the specification. 5.   Traceability: It indicates to what each element of therequirements model corresponds in the model produced during the development process [15].Furthermore, each model produced during thedevelopment of a DSS must be rooted in the essentialnature of the decision making process [13].In the remainder of this section, we describe thedecision process in order to identify the business conceptsthat we regroup in a template describing OLAPrequirements. Then, we present linguistic patternscollected empirically.   3.1   . DECISION PROCESS AND BUSINESSCONCEPTS A decision is intentional, i.e., each justifiable decisionis based on at least one objective and one prediction [16].According to [7], an objective must be measurable by anestimated target    value ( e.g., a rate, a quantity, an amount)that a  process must reach during a given period of time.The attained value of an objective is measured by a key performance indicator  . The analysis of the difference between the attained and target values allows the decisionmaker  to measure the realization level of the objective andhence to judge the performance of the process. In the caseof a negative variation, the decision maker notes an  anomaly and looks for its srcins. To do so, they examinedetailed information of the analyzed process that theyretrieve from the DW through analytical requirements they have to formulate.Given the above business concepts pertinent to adecision making process, we propose the template shownin Figure 1 as a means to formulate analyticalrequirements. This template can be filled by a decisionaldesigner with information ( e.g., title, summary...) for a better documentation of the requirements.   Title: ....................................................................................... Summary: .............................................................................. Creation date: ....................... Update date: ........................ Version N°: ........................... Author: ................................. Actor: ...................................... Process: ..................................Label ......................Formula ........................Target ...................... Indicator 1 Analytical queries(1) .................(2) .................(3) .................Label ......................Formula ......................Target ...................... Objective1: ............................................................ Indicator 2 Analyticalqueries(1) .................(2) .................(3) ................. Figure 1. Template for analytical requirements specification   In the above template: ‐   Process:   It indicates the business process to beevaluated.  ‐   Actor : the name of the person who controls the process performance.  ‐   Objective : a measurable objective that a process mustreach during a given period. Several objectives can beassociated with the same process. ‐   Indicator : provides a value indicating the level of realization of an objective. For each indicator, thefollowing information must systematically beclarified:   - Label : the indicator label. - Formulas : an indicator has a main formula and  secondary formulas useful for the calculation of themain formula operands. In our DWRM model, asecondary formula is introduced by the symbol "\"after the main formula. - Target:   an estimated value indicating the acceptablelevel of the indicator.  ‐   Analytical Queries : extract relevant information for the analysis and decision making.   Information of an indicator ( i.e. , label and formulas) will be present in the analytical queries. These queries will beformalized by a grammar we present in section 3.1.2. 3.2   . LINGUISTIC PATTERNS FOR ANALYSISAXES In accordance with [8] and [12], a natural language isthe best means of expression for analytical requirements because it facilitates the communication with the decisionmakers. However, the diversity of writing styles oftencauses semantic ambiguities. To overcome thesedifficulties, we chose to fix an expression style while benefiting from the advantages of natural language(mainly   expressivity and simplicity).We defined a simplified grammar by collecting andstudying a set of analytical queries written in French. Thefollowing are sample queries: 1.   Aer le chiffre d’affaires par catégorie de produit par catégorie d’un client par jour et année.   nalysie« Analyze the turnover by category of product bycategory of customer per day and year »   2.   Afficher le nombre total d’heures supplémentaires par enseignant et par semestre.« Display the total number of overtime hours by teacher during a semester »   3.   Etudr l’évolution du nombre de mortalité des volailles par poulailler d’élevage et par date.« Study the evolution of number of poultry mortalitiesaccording to a breeding hen house during a date »   Through our study, we noted that decision makersoften use in their requirements  particular  expressions. For example, to introduce analysis axes , they employ prepositions ( e.g., par « by », selon « according to »…).Similarly, to specify the  properties describing analysisaxes, they use nominal groups having simple recurrentgrammatical structures like a name ( e.g., enseignant «teacher »).In the remainder of the paper, we refer to thegrammatical structures employed to describe analysis axesas linguistic patterns that we will detail later.On the other hand, in the literature of DW design,analysis axes are rooted to the Information System (IS)entities. Actually, these entities are described textually inthe IS repository. Thus, to define complete linguistic patterns we collected and examined 4000 nominal groupsdescribing properties arising from one hundred of repositories elaborated in senior-year projects and belonging to nine distinct domains (commercial, e-commerce, accountancy, e-learning, insurance, finance,human resources, production and medical). The examination of these nominal groups leads us to identifynine frequently used linguistic patterns. Table 1 givesthese various patterns and the frequencies of their appearances.In the studied sample, approximately 70% (lines 1 to 9of    Table 1) of the nominal groups are concise and precise.Moreover, they adhere to simple grammatical structurescomposed of:  ‐   Qual i f yi ng adj ecti ve : In our context, wenoticed that these adjectives discard ambiguitiesthanks to the natural language. For example, in thenominal group « carte d’identité nationale d’unclient » (« national identity card of a customer »), if we remove the adjective « nationale »   ( « national » )    the semantics becomes ambiguous: the indentity card   can be a banking  or a  school card.   Table 1. Grammatical structures statistics   ( N oun, qualifying  adj ective, det erminant, p ast p articipial   and prep osition).   Linguistinc Patterns(Grammatical Structures)    F  r  e  q  u  e  n  c   i  e  s   (   %   ) 1. [det] NExample : un client « a customer »13.552. [det] N prep det NExample : nom d’un client « name of a customer »36.953. [det] N prep N prep det NExample : adresse de livraison d’un client« addresses of delivery of a customer »5.034. [det] N adj prep det NExample : code postale d’un client « postal code of acustomer »2.785. [det] N prep det N adjExample : désignation d’un acte médical« designation of a medical act »2.386. [det] N prep det N prep NExample : numéro d'une carte de crédit « number of acredit card »3.407. [det] N prep det N ppExemple : ancienneté d’un ouvrier qualifié« seniority of a skilled worker »3.108. [det] N pp prep det NExample : quantité livrée d’un article « deliveredquantity of an article »2.809. [det] N prep N adj prep det NExemple : carte d’identité nationale d’un client« national identity card of a customer »0.28Other complex structures 29.75 ‐   Past Parti ci pl e : without an auxiliary, the past participle plays the role of an adjective. In our context,we noticed that the use of past participle was restrictedto the role of a qualifying adjective. For instance, inthe nominal groups, « quantité livrée »   (« delivered quantity ») the word « livrée » («delivered») plays therole of a qualifying adjective. Its remove induces anambiguity.  ‐   Name : indicates a concrete or abstract entity like customer  , addresses.  ‐   Determi nant : characterizes a name that can be adefinite article or indefinite ( e.g., one, it…).  ‐   Preposi ti on : introduces a complement to the verb,noun, adjective or adverb.Moreover, we noticed that the nine identifiedlinguistic patterns (lines 1 to 9 of    Table 1) embracemultidimensional concepts. In line 1 of    Table 1, thegrammatical structure   det- N describes an entityindicating a dimension. For example, the nominal group «un client » (« a customer ») indicates the customer dimension. In lines 2 to 9 of    Table 1, the structure   prep-det is used to separate two nominal groups; the firstindicates a dimensional attribute and the second indicatesa dimension. For instance, the nominal group « code postale d’un client » (« postal code of a customer »)having the structure N- adj - prep- det- N (line 4 of    Table 1), generates a dimension « client » « customer  »(having the structure   name ) and a dimensional attribute «code postale »   «  postal    code »   (having the structure N-adj ).   Furthermore, in the nine identified patterns, adimensional attribute or a dimension is specified by anominal group according to one of the five followingstructures:   det- N, N- prep- N, adj - N, N- pp and   N- prep-adj - N . In these five structures, we noticethat a   name   is always followed by a   prep  , adj   or  pp .The nine identified linguistic patterns can then be mergedin order to define a generic nominal group noted GN1 .    Note that the remaining 30% of the nominal groups (lastline of Table 1) are rather long and useless statements thatcontain words in four grammatical categories: adjectivesnot qualifiers (conclusive, indefinite, interrogative, etc),adverb, pronoun (possessive, conclusive, indefinite, etc)and named entities. In addition, we noticed that words inthese four grammatical categories do not correspond toany multidimensional concept. For example, in thenominal group « les quatre dernières années » (« the last  four years »), having the grammatical structure det-numeral cardi nal adj - ordi nal adj - N , onlythe word « années » («  years ») corresponds to adimensional attribute; the other words are useless. Thus,the nominal groups can be rewritten in a more concise andsimple way by using the nine linguistic patterns described by GN1 . 4   . GRAMMAR FOR THE DWRM The fusion of the above nine linguistic patterns producesthe following GN1 grammatical structure:   GN1 :: = N (N| adj| pp| prep)*where the character * indicates n ( n ≥  0 ) occurrences of anelement.   In fact, GN1 allows a partial description of adimension limited to one attribute at a time. In order tocompletely describe a dimension, we generalize the GN1syntax by defining a full format allowing the descriptionof a dimension with many attributes. This leads us todefine the linguistic pattern GN: GN :: = [det] GN1 [, [det] GN1,…and [det] GN1] [det- prep GN1]   where the structure   det- prep   is useful to separate anominal group indicating dimensional attributes and theone indicating a dimension. This facilitates thesegmentation process of analytical queries. As an exampledescribing a dimension of analysis according to GN is: « nom d’ un client  » (« name of a customer  ») .   In addition, the study of the grammatical structures of the nominal groups used in the specification of thedimension time (e.g.   «  jour et année » (« day and year  »))showed the absence of the structure   prep- det   as aseparator between the nominal groups indicating thedimension time and the dimensional attribute time. The  absence of this nominal group indicating explicitly thename of the time dimension causes problems in thesegmentation phase of the query given by the decisionmaker; this in turn causes errors in the analysis andgeneration of the multidimensional schema. This is due tothe fact that the dimension time is implicitly announced by: day, year, etc. In order to overcome this difficulty, wechose to express the time dimension in accordance with GN; i.e. by explicitly writing the key word date asfollows: day and year of a date ; this has the grammaticalstructure   N and N- det- prep- N .   Figure 2 depicts our proposed grammar for theexpression of analytical queries: Query ::=Verbe (verb) [GM] indicateur  (indicator) [GM](marqueur_dimensionnel (dimensional marker) GN)*[marqueur_comparatif  (comparatif marker) GM] GN ::= [déterminant (det) ] GN1 [,[déterminant (det) ] GN1,… et (and)  [déterminant (det) ] GN1] [préposition (prep) déterminant (det) GN1] GN1 :: = N (N| adj| pp| prep) * Verbe ::= analyser  (Analyse) | comparer  (Compare) | étudier  (Study) …   GM ::= chaine de caractères ( string)   Indicateur ::= mots clés indiquant des concepts métiers (string of keywords)   Marqueur_dimensionel ::= durant (during) | en fonction de (according to) |selon (according to) | suivant (according to) | par  (by/per)   Marqueur_comparatif  ::= avec (with) | par rapport à (compared to) Préposition ::= de (of) | à (at) | dans (in) | chez (at) | depuis (since) ... Adjectif  ::= adjectif qualificatif  (qualifying adjective) Déterminant ::= un (a) | des (of) | de la (of the) | de l’ (of the) | le (the) ... Figure 2. Grammar for the analytical queries. In Figure 2, the square brackets indicate optionalelements, the * indicates possibly multiple occurrences of an element and, GM is an optional string useful tocomplete the semantics of the query.Each query, even it is complex, can be simply writtenaccording to our grammar. For instance, the third query of section 3.2, which analyzes the evolution of number of  poultry mortalities and, written in free natural language,can be easily expressed as below:  Etudier  (Verb) l’évolution du (GM) nombre de mortalité desvolailles (indicator) par  (dimensional marker) poulailler d’unélevage (GN) par  (dimensional marker) date (GN) .   The formalization of user requirements according toour DWRM model produces specified queries free of natural language ambiguities. This constitutes a steptoward an automatic derivation of data mart schemas. Thefollowing section details our approach to reach this goal. 5   . DM SCHEMA GENERATION PROCESS For the generation of multidimensional schemas (i.e. DMschema) from user requirements, we propose therequirement-driven approach depicted in Figure 4. Thisapproach consists of three major phases namely: analytical requirements acquisition ,  pretreatment  and  generation .In the following subsections, we describe the phases of our approach and it with the example shown in Figure 3.This latter specifies one analytical requirement expressed by a Sales manager who aims maximizing the sales of  products sold by their company. The realization of thisobjective requires the adoption of strategies (e.g. allowingan extra discount on orders exceeding a certain amount,easiness of payment…). To measure the performance of his/her strategic choices and in consequence of the sale process, the decision maker chooses the indicators « Tauxde Croissance Mensuel du Chiffre d’Affaires (TCMCAfor short) » (« Monthly Turnover Rate (MTR) »). Figure 3shows all information concerning this indicator.Moreover, the decision maker chooses the necessaryinformation for his/her analysis. The analyticalRequirement components of the example show theinformation required by the actor  Sales manager  . 5.1   . ANALYTICAL REQUIREMENTSACQUISITION Before their acquisition, the analytical requirements(AR) identified and specified in natural language, by theDSS designer according to our DWMR model, will be presented to decision makers for validation. Thisvalidation requires a strong interaction with decisionmakers in order to identify and correct mistakes and/or clarify misunderstandings of their needs. This phaseallows the DW development team to correct their requirement models. The acquisition accepts and storesthese validated analytical requirements to be available for next processing steps. 5.2   . PRETREATMENT This phase extracts the relevant words and nominalgroups ( i.e. , indicating multidimensional concepts) from   the stored AR. It includes three steps: multidimensional(MD) concept extraction, text labeling and syntaxchecking.  Multidimensional concept extraction This extraction identifies fact, measures, and dimensionsand their attributes respectively from the threecomponents of our DWMR model: process, formulas andanalytical queries.   1. Fact name extraction :   A star schema allows detailedanalyses of  a business process [8] [1]. Generally, this star schema consists of a single  fact  , measures anddimensions. We can then associate the name of a fact withthe name of the process to be evaluated. 2. Measure extraction : Measures are generally numericalattributes used in the calculation of the key performanceindicators. Thus, these measures are extracted from theformulas of indicators. The formula contains empty words( i.e. , not matching measures) as arithmetic operators,aggregation functions, punctuation marks, numericalvalues and special characters. However, the operands of the formula indicate significant (non-empty) words whichcorrespond to candidate   measures. A computation formulamay contain redundant operands. Hence, the extraction of measures from a formula must undergo a filtering step of 
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks