Sheet Music

A mereology-based general linearization model for surface realization

In this paper, we propose a crosslinguistically motivated architecture for surface realization based on mereology, i.e. on the part-whole distinction. First, we present the main ideas that motivated the model. Then we present a general mereological
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A mereology-based general linearization modelfor surface realization Ciprian-Virgil Gerstenberger Computational LinguisticsUniversity of Saarland, Germany Abstract In this paper, we propose a cross-linguistically motivated architecture forsurface realization based on mereology,i.e. on the part-whole distinction. First, wepresent the main ideas that motivated themodel. Then we present a general mereo-logical description of the natural languageutterance. The utterance is modeled interms of embedded Linear Order Parts withtwo mutually exclusive relations holdingbetween them: the Part-Of relation and theLinear Order relation. A General Lineariza-tion Model based on these concepts consistsof a linearization module, an inflectionmodule, and a text polishing module. Thearchitecture we propose models surface re-alization phenomena in terms of constraintson grammatically valid configurations of utterance “parts”. We illustate linearizationwith our model by presenting walk-throughexamples, and compare our model withother approaches to linearization. 1 Introduction In the time of multimedial, multilingual web-basedinformation systems, spoken dialogue systems playan important function in presenting search results ina compact, flexible way. This is a challenge for thetraditionalNaturalLanguageGeneration(NLG)sys-tems that have to provide spoken systems with flex-ible, context-sensitive output.Different NLG approaches propose language-independent methods for determining word order(e.g., (Nizar Habash et al., 2001), (Bohnet, 2004),(Gerdes and Kahane, 2001)). However, basic ques-tions such as (1) what are the primitive items for lin-earization, (2) whether to linearize lemmata or al-ready inflected words, (3) how to form complex lin-ear order parts, (4) what are the subsequent steps of linearization that lead to grammatically and ortho-graphically correct strings, are left open. In this pa-per, weproposeacross-linguisticallymotivatedgen-eral linearization model that approaches these basicquestions.To emulate the flexibility of natural language withrespect to surface realization, the properties that ac-count for this flexibility have to be properly anal-ysed. To this end, we propose a linguistically in-formed mereological utterance description (MUD).This description is the guideline for the general lin-earization model (GLM) we propose. We show thatthis linearization model offers a solid ground for auniform treatment of various linguistic phenomenarelated to linearization, inflection, phonological as-similation, and orthography.This paper is organized as follows: In section2, we present four basic observations we considerof major importance for language-independent sur-face realization. In section 3, we propose a mereo-logic description of the naturallanguage utterance asthe fundament for a flexibile, language-independent,general linearization model. Section 4 deals withthree essential questions for a general sentence re-alization architecture: (1) What to linearize: sym-bols for words (lemmata) or word forms (inflectedwords)? (2) What are the primitive items for lin-earization? (3) How to form complex linear orderparts? In section 5, we describe the general lin-  earization model, and briefly describe the inflectionand the text polishing submodules of the overall ar-chitecture. We exemplify linearization with GLMin section 6 and compare our model to similar ap-proaches in section 7. We summarize our model insection 8. 2 Observations We make the following observations relevant to lan-guage generation: Observation 1 Speech is prior to writing both cul-turally and historically (see (Greenberg, 1968)). Designing and developing linguistically moti-vated NLG systems involves a strict separation of phonological and orthographic knowledge, as wellas a heavy use of phonological knowledge for aproper orthografic surface realization (but not vice-versa). Observation 2 Generation and analysis are two fundamentally different tasks. The input for analysis is a single linearized string,and the fundamental problem of analysis is the am-biguity: to interpret the input correctly, the analysismodules have to construct syntactic structure, andto insert empty elements (traces, empty topologicalfields, etc.) into that structure. The fundamentalproblem of generation is the choice: it knows whatto say, but there are many ways how to say it. Forlinearization, there is no need of empty elements norto output analysed structure such as NPs and VPs.The result of surface realization is always a string. Observation 3 The smallest linearizable entity in alanguage is the phoneme. Various types of speech errors reveal that, whenproducing language, we do not necessarily linearizewhole constituents, or even words. Phoneme shifts(e.g., mu tl imodal ), phoneme cluster shifts (e.g., fl ow sn urries) or morpheme shifts (e.g., self- in struct de struction) illustrate this fact. Observation 4 The most general relation betweentwo entities α and  β  such that  α is a substructure of  β  is the part-of relation. What is the relation between a phoneme and a syl-lable containing it? And the relation between a syl-lable and a morpheme containig it? It is evident thata phoneme is part of the syllable containing it. Thisalso holds for a word and constituent containing thatword or for a constituent and a clause containig thatconstituent. Yet, what is regarded as a constituentdepends on which constituency tests are used, andthe usage of traditional constituency tests is contro-versial (see (Phillips, 1998) or (Miller, 1992)). Incontrast, there is no controversy about – and no needto test – the fact that the phoneme /n/ is part of thesyllable /no/, if this syllable has been uttered. Itshould be noted that the part-of concept is an oldphilosophical concept and that regarding languageentities as part-of structures is not a novelty either(see (Moravcsik, to appear)). An example of a lin-guistic theory that employs the concept of part-of for cross-linguistic analysis is Radical ConstructionGrammar. (Croft, 2001, p. 203) says that “[...] theonly syntactic structure in constructions is the part-whole relation between the construction and its ele-ments”. 3 Mereological utterance description Taking into account observation 3 and 4, we presenta mereological utterance description (MUD). Themodel we propose here is an extension of the part-of relation to smaller linguistic items than words,constituents or constructions (in the sense of (Croft,2001)). First, we define the unit of mereologic de-scription we propose and we illustrate it with exam-ples. Then, we present relations and properties of the mereological structures. Definition 1 A Linear Order Part (  LOP ) is a lan-guage item which is phonologically realized as acontiguous part of a grammatically well-formed ut-terance. According to the definition above following lin-guistic entities are LOPs: a phoneme (e.g., c luster  ),aphonemecluster–notnecessarilyasyllable–(e.g., cl ust er  ), a syllable – not necessarily a morpheme –(e.g. clus ter  ), a morpheme – not necessarily a freemorpheme – (e.g., in credible ), a word (e.g., a book  ),parts of adjacent words (e.g., a b ig red b ook  ), orword groups ( the rather boring book  ). Any con-tiguous part of a grammatically well-formed utter-ance can be a LOP, either (1) motivated linguisti-cally – constituents such as noun phrases, adjectivalphrases, verb phrases, or partial constituents, (non-empty) topological fields (as in Topological FieldModel (TFM) (H¨ohle, 1983)) that can consist of more than one constituent, or embedded clauses,  matrix clauses, whole sentences, whole paragraphs,etc. –, or (2) not motivated linguistically (e.g., thenice book that Angela read is written by Merkel ).We restrict the use of MUD to LOPs that are linguis-tically motivated 1 .The following two relations hold between LOPs: Definition 2 [Part-Of relation]   Let  λ 1 and  λ 2 be two different LOPs: λ 1  λ 2 iff  λ 1 is proper part of  λ 2 . PO-relation is reflexive, anti-symmetric, and transitive. Definition 3 [Linear Order relation]   Let  λ 1 and  λ 2 be two different LOPs: λ 1  λ 2 iff the occurence of  λ 1 precedes the occurence of  λ 2 inthe utterance. LO-relation is irreflexive, asymetric,and transitive. In addition, LO-relation and PO-relation are mutu-ally exclusive, i.e., two different LOPs can eitherPO-relate or LO-relate but not both. Definition 4 [Exclusivity]   Let  λ 1 and  λ 2 be different LOPs, then:1. if  λ 1  λ 2  , then λ 1  λ 2 and  λ 2  λ 1 2. if  λ 1  λ 2  , then λ 1  λ 2 and  λ 2  λ 1 To illustrate the three definitions above let us con-sider the LOP [ the book on the table ]: λ 1  λ 2 : λ 3 [ λ 1 [the book on the] λ 1 λ 2 [table] λ 2 ] λ 3 λ 1  λ 3 : λ 3 [ λ 1 [the book on the] λ 1 λ 2 [table] λ 2 ] λ 3 [ the book on the ] can not be part of [ table ] but it pre-cedes it, whereas [ the book on the ] can not precede[ the book on the table ] but is definitely part of it.An important property of the parts of an utterance isthat they can not proper-overlap. Definition 5 [Non-proper-overlapping]   Let  λ 1  , λ 2  , and  λ 3 be different LOPs, and  λ 2  λ 1 and  λ 2  λ 3 . Then either  λ 1  λ 3 or  λ 3  λ 1 . To illustrate this property let us consider the string[ the red apple ]. From a mereologic perspective, onecan analyse this stringas λ 1 [the red] λ 1 λ 2 [apple] λ 2 or as λ 1 [the] λ 1 λ 2 [red apple] λ 2 but definitely not as λ 1 [the λ 2 [red] λ 1 apple] λ 2 .Taking into account observation 1, MUD naturallyextends to written language. 1 It is obvious that MUD can cover all types of speech errorsbut this is not part of our task  4 From analysis to generation There are various ways to chunk an utterance intoparts, at various levels. But even if the partitionsthat are not linguistically motivated are excludedsome basic questions relevant to linearization arise.What is the most appropriate LOP level to work atin linearization? Shall a general linearization mod-ule work at phoneme/grapheme level? 4.1 What to linearize? In order to answer the above questions, all cross-lingual phenomena that are relevant to linearizationshould be accounted for. It is impossible to deal withall these phenomena, but we want to call attention tothis fact. 4.1.1 Inflected or non-inflected items? For the design of a general surface realization ar-chitecture, it is necessary to determine the modules,their precise distribution of tasks, and the overallworkflow. For the linearization task, this means toknow whether to linearize lemmata, i.e., non-flectedwords, or lexemes, i.e., word forms.Different linguistic theories take different posi-tions with respect to linearization: syntax comes be-fore inflection morphology (e.g., Government andBinding, Minimalist Program); syntax comes afterinflection morphology (e.g., LFG, HPSG).To try to find an answer to this question let usconsider some example. In the Romanian this -NP, the position of the demonstrative can be ei-ther prenominal ( acest om , this man ) or postnom-inal ( omul acesta , this man ) (see (Mallison, 1986,p. 265), (Constantinescu-Dobridor, 2001, p. 123)).The Romanian this -NP is always definite, but itshows different marking patterns for definiteness,depending on the relative position of the demostra-tive to the noun. In prenominal position, neither thedemonstrative nor the noun is marked for definite-ness (ex. 1), while in postnominal position, both thedemonstrative and the noun is marked for definite-ness 2 (ex. 5).Toobtainonlythetwogrammaticallycorrectvari-ants of the Romanian this -NP (ex. 1 and 5) both themorpho-syntactic specification and the relative po-sition of demonstrative and noun is required. Thisfact definitely speaks for linearization before inflec-tional morphology. The conclusion is that theo- 2 Whether the demonstrative is really marked for definite-ness is questionable, but fact is that it features a different form,depending on its position to the noun.  retical frameworks such as HPSG or LFG mightnot be able to generate all grammatically correctvariants of a Romanian, without explicit coding of linearization-relevant information in other process-ing modules that are not supposed to handle lin-earization. (1) acestthisomman(2) *acest a this -def  om ul man -def  (3) *acest a this -def  omman(4) *acestthisom ul man -def  (5) om ul man -def  acest a this -def  (6) *ommanacestthis(7) *om ul man -def  acestthis(8) *ommanacest a this -def  4.1.2 Lexical or sublexical items? It is not always possible to tell whether an item is anaffix, a clitic, or a word, as the vast literature on cli-tics reveals (see (Miller, 1992)). This is understand-able, given the language as an ongoing process. As(Croft, 2001) put it: “[l]anguage is fundamentally DYNAMIC ... Synchronic language states are justsnapshots of a dynamic process emerging srcinallyfrom language use in conversational interaction.”To illustrate sublexical phenomena let us considerthe so-called floating affixes in Polish, a marker forperson and number (PN-marker) in past. Prever-bally, it behaves like a clitic, attaching to variousother words (ex. 10–11); postverbally, it behaveslike a suffix, attaching only to the finite verb (ex.9; for a detailed description, see (Kup´s´c and Tseng, 2005), (Crysmann, 2006)).To find out the granularity of primitive lineariza-tion entities we propose the following test. Linearization test Given two items α and β  atmorpho-syntactic level in a specific language, if thelanguage allows both for α  β  and β   α thenthese items are linearization primitives.Letusillustratetheapplicationofthelinearizationtest to the following cases: Polish PN-marker, Ro-manian weak pronoun and German separable verbparticle. (9) Nienotwidzieli ´smy see -pst-m-pl -1pl tego.this [We didn’t see this.] (10) Tego ´smy this -1pl nienotwidzieli.see -pst-m-pl (11) My ´smy we -1pl tegothisnienotwidzieli.see -pst-m-pl (12) S˘athat  ˆıl itfacet¸i!do -imp-pl[Do it!] (13) S˘a- l that itfacet¸i!do -imp-pl (14) Facet¸i- l !do -imp-pl it(15) SieshewillwantsdastheFensterwindow auf makeShe wants to open the window.(16) SieshemachtmakesdastheFensterwindow auf  .off She opens the window. A Polish PN-marker in past can occur before (ex.10–11) or after the verb (ex. 9). A Romanian weak pronoun can occur before (ex. 12–13) or after theverb (ex. 14). Finally, a German separable verb par-ticle can occur before (ex. 15) or after the verb (ex.16). All these different items pass the linearizationtest.Taking into account the phenomena describedabove, we propose that the set of the primitive itemsfor linearization should contain the following typesof entities: (1) sublexical items that pass the lin-earization test at morpho-syntactic level such as Pol-ish PN-markers, Romanian weak pronouns and Ger-man separable verb particles; (2) lexical items , pro-vided that there is agreement among linguists aboutthe definition of lexemes. 4.2 How to form complex entities? Inthissection, weshowhowtoformcomplexLOPs,assuming the primitive LOPs described in the previ-ous section. Given the well-known phenomena of discontinuous constituents such as partial frontingand extraposition, it is obvious that forming com-plex LOPs does not necessarily correspond to form-ing traditional constituents.If a language allows complex LOPs to occur indifferent positions, a general mechanism of form-ing complex LOPs should take the following two as-pects into account:1. whether two or more primitive LOPs permute always as a unit;2. whether two or more primitive LOPs permute sometimes as a unit, and if so, under which cir-cumstances.We call the first Total Permutation Constraint  (TPC) and the second Partial Permutation Con-straint  (PPC). If in a specific language two or moreprimitive LOPs never permute as a unit, no complexLOP can be formed of them.  As an illustration of the TPC, let us consider theGerman article + noun combination in ex. 17–19.In German, the article and the noun permute as aunit, independent of their occurence in a grammati-cally correct utterance. (17) Das Buch the book istissch¨on.niceThe book is nice.(18) Sch¨onniceistis das Buch .the book The book is nice.(19) Istis das Buch the book sch¨on?niceIs the book nice? Now imagine an – allowedly strange – languagein which the article of the direct object – if there isone–, thesubject–ifthereisone–, andthetemporaladverbial – if there is one – can permute freely butalways as a unit: this has to be modeled in exactlythe same way as the German article + noun combi-nation above, despite the fact that they do not belongtothesameconstituentbut, infact, theyarejustpartsof different constituents. Now it is clear that whilecontiguous constituents always meet TPC, TPC ap-ply to all kind of primitive LOP combinations, notnecessarily to those that are semantically related.The strange language example above illustrates ex-plicitly this extremely important issue.Please note that the fact that complex linear orderparts in our model are build solely based on TPC andPPC, and not on traditional syntactic constituency,is one of the crucial differences between the modelwe propose and approaches that, at a first sight, aresimilar to it, such as (Bohnet, 2004) or (Gerdes andKahane, 2001).For a general linearization model, we want tostress that TPC/PPC and adjacency are not the same,and that just adjacency as a linearization constraintis not an appropriate means of abstraction: TPCand PPC impose adjacency automatically if there areonly two primitive LOPs to combine. Imagine forinstance the German das rote Buch ( the red book ):these three primitive LOPs meet TPC, they alwayspermute as a unit, but, in this constellation, the ar-ticle is never adjacent to the noun. Moreover, justputting two or more LOPs together doesn’t say any-thing about their position to each other, as is the casewith scrambling in the middle field in German.To illustrate PPC let us consider German extra-posed and non-extraposed relative clauses such as Peter hat gestern ein Buch, das sch¨ on ist, gekauft  ( Yesterday, Peter bought a nice book ). (20) PeterPeterhathasgestern ein Buch, das sch¨on ist,yesterday a book that nice isgekauft.bought(21) PeterPeterhathasein Buch, das sch¨on ist, gesterna book that nice is yesterdaygekauft.bought(22) *PeterPeterhathasein Buch gestern, das sch¨on ist,a book yesterday that nice isgekauft.bought(23) PeterPeterhathasgestern ein Buchyesterday a book gekauft,boughtdasthatsch¨ PeterPeterhathasein Buch gesterna book yesterdaygekauft,boughtdasthatsch¨ As long as both ein Buch and das sch¨ on ist  oc-cur in the middle field (the underlined part in theexmples above) they have to form a complex LOPthat can be scrambled as a whole (ex. 20–22). How-ever, if the relative clause is extraposed, the adverb  yesterday can occur between the noun and the rel-ative clause (ex. 23–24). In the same vein, wemodel linearization constraints stemming from dif-ferent description levels: morpho-syntactic, syntac-tic and macro-structural. We want to stress here too,that our model handles discontiguous constituents,topological fields in Germanic languages but alsoany kind of partial constraints in other languages inthe same way, namely by means of PPC. 4.3 Where to linearize? Different approaches to linearization take differentpositions with respect to whether linearization is anabsolute ( 1 st , 2 nd , 3 rd , etc.) or a relative positioningprocess. Taking into accout the optionality of ele-ment usage in language, absolute positioning leadsto the use of empty slots (see, for instance, the tra-ditional TFM). However, this contradicts our aim tocomply with observation 2. Therefore, we proposeto use relative positioning, expressed in terms of  be-fore and after . 5 General Linearization Model The General Linearization Model (GLM) we pro-pose reflects the primitive structures of MUD, i.e.,the primitive LOPs, as well as the relations betweenthem: the PO-relation and the LO-relation, as de-scribed in section 3. The granularity of the inputsymbols is dictated by the considerations in section4.1, how to build complex linearization structures
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks