A morphological tagger for standard Albanian

A morphological tagger for standard Albanian
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Morphological Tagger for Standard Albanian Jochen Trommer Institute of Cognitive ScienceKatharinenstrasse 24D-49074 Osnabr¨uck Dalina Kallulli Telecommunications Research CenterDonau-City-Strasse 1A-1220 Abstract In this paper, we present a morphological taggerfor standard Albanian intended as a componentof an annotation tool in the context of theAlbanian Corpus Initiative. The analyzer usesoff-line components for generating sub-regularand irregular word forms based on the verbinflector described in Trommer (1997) and sim-ple morphological rules for main inflectionalpatterns. Part of the tagger are a tokenizer,a complete tagset for Albanian and full formlexica for pronouns and irregular open-classelements. Keywords:  Morphological analysis, part-of-speech tagging, Albanian 1 Introduction Due to the political situation, there has been fewresearch on Albanian in contemporary linguis-tic frameworks and virtually no work in corpuslinguistics. In this paper, we present a mor-phological tagger which is intended as a maincomponent of a complete part-of-speech taggerto become part of an large annotated text cor-pus for standard Albanian. Under a theoreti-cal point of view, tagging Albanian is especiallychallenging since it has extremely rich inflec-tional paradigms. Thus, a verb might have upto 100 different forms. A further complicationare different inflectional patterns for lexemes of the same syntactic category: Verbs fall in 53different conjugational (Buchholz et al., 1992),while the assignment of plural affixes to nounstems does not follow from any known system-atic principle.We assume that a morphological tagger as-signs to all word tokens in a text a set of mor-phological tags which encode the morphologi-cal features of specific word forms such as partof speech, case tense, etc. In a full-fletched part-of-speech tagger, this is supposed to be com-plemented by a morphological disambiguatorwhich chooses from each such tagset a uniquetag for each token given its context (figure 1).Hereisanoverviewoftherestofthepaper: In1  Tokenizer MorphologicalAnalyzer Disambiguator Figure 1: Architecture of PoS Tagger (shaded components are implemented in the system)section 2, we give a short survey of Albanian in-flection. Section 3 introduces the tokenizer, andsection 4 describes the tagset we use in our sys-tem. The morphological analyzer is explainedin section 5 and the architecture of the lexiconin section 6. Section 7 contains some remarkson the implementation of the tagger, and in sec-tion 8, we present preliminary results on the ac-curacy of the analyzer. Finally, in section 9, wediscuss further prospects of the system. 2 Albanian Inflection Wediscusshereonlytheinflectionofopen-classelements which are implemented by rules in oursystem. Pronominal elements show also inter-esting inflectional patterns 1 , but these are cap-tured by listing in a full-form lexicon. 2.1 Adjectives Apart from few irregularlexemes, adjectivesfallinto five different inflectional classes which usethe affixes  -e  (feminine gender),  -a  (feminineplural),  -¨ e  (masculine plural) or zero marking indifferent partially overlapping distributions. Asshown in Trommer (2001), this complex allo-morphy pattern can be derived by rules from thephonological shape and the morphological con-stituency of adjectival stems. 1 See e.g. Trommer (2000) on the so-called preposedarticle and possessive pronouns. 2.2 Nouns Nouns are inflected for number (singular, plu-ral), case (nominative, dative, accusative, abla-tive) 2 such as in  sht ¨ epi-a-ve-t  , houses-PL-ABL-DEF, ‘from the houses’. While definiteness andcase marking is quite regular, i.e. predictable onthe basis of phonology, stem gender and num-ber, the choice of the plural suffix ( -¨ e , -Ø,  -e , or -a ) is largely unpredictable. 2.3 Verbs Verbs are the most complex area of Albanianinflection. Apart from three different tenses(present tense ,aorist, imperfect) 3 and two dif-ferent voices (active and non-active), there arefive different moods (indicative, subjunctive,optative, imperative and admirative). Allomor-phy in verbal inflection is partly phonologicallygoverned. Thus verbs ending in vowels formthe 1st person aorist with  -va  (e.g.  puno-va , ‘Iworked’) while stems ending in consonants take -a  (e.g.  hap-a , ‘I opened’). More complex is thedivision of verbs in different inflectional classeswhich results partly in different allomorphs of  2 Traditional Albanian grammars also assume a geni-tive case which however falls together in all forms withthe dative. 3 In additionto these synthetictenses, thereare two an-alytic tenses: future (formed with the present subjunctiveandthe particle do andthe perfectformedwiththe partici-ple form and fi nite forms of the auxiliaries  kam , ‘have’,and  jam , ‘be’. 2  affixes (e.g. for 1sg  -j  in  m¨ eso-j , ‘I learn’ and  -m in the-m , ‘Isay’), partlyinmodificationofthefi-nal vowels and/or consonants of the verb stems(e.g.  vret  , ‘he kills’ vs.  vris-ni , ‘you (pl.) kill’).A detailed analysis of Albanian verb inflectioncan be found in Trommer (1997) 3 The Tokenizer The tokenizer is a small Python script whichcrucially isolates word forms, punctuationmarks and numbers, etc. Note that we treatsome punctuation marks, such as “.” (dot) and“”’ (apostrophe) as a single token in some cir-cumstances and as part of a more complex tokenin others. Thus  s’punon , (‘(s)he doesn’t work’)results in three tokens “ s ” (‘not’), “”’ and  punon (‘s(he) works’), while the clitic group  t’i  (‘toyou them’) is analyzed as one token, since westore clitic groups showing many idiosyncrasiesas full forms in the lexicon. 4 The Tagset Since to our knowledge there is no publishedtagset for Albanian, we had to develop a com-plete tagset for the language. 4 As in the EA-GLE guidelines standard (Leech and Wilson,1999), tags consist of sets of attribute-valuepairs. However, attributes and values are de-signed to fit optimally the description of Alba-nianand to allowaperspicuousabbreviatoryno-tation (see below). (1a) shows a representativetag for a feminine definite (i.e., bearing an arti-cle suffix) singular common noun. To enhance 4 See˜atag/ for acomplete list of the tagset. legibility, we use for most practical purposesthe abbreviatory notation exemplified in (1b),where all binary-valued attribute-value pairs arewritten by prefixing “+” or “-” to the attribute(e.g. “+def” instead of “def:+”) and attributesare omitted for all other pairs (e.g. “n” insteadof “cat:n”). This is possible since each (non-binary) value in our tag set corresponds to a sin-gle attribute.(1)  Short Notation for Tags a. [ cat:n case:nom num:sg def:+ gen:fem]b. [ n nom sg +def fem]In addition to standard part-of-speech cate-gories, we use “pa” for preposed articles, gram-matical morphemes unique to Albanian occur-ring with most adjectives and possessor phrasesand “ptl” for a specific class of verb-adjacentparticles (e.g. future  do ).The implementationuses intermediaterepresen-tations to collapse different tags for syncreticforms of the same lexeme. Thus, the indefinitenominative and singular of all nouns is identicalto the corresponding accusative form. Instead of writing the two tags (2a,b) we use the tag (2c):(2)  Collapsed Tags a. [ n  nom  sg -def fem]b. [ n  acc  sg -def fem]c. [ n  { nom,acc }  sg -def fem] 5 The Analyzer The morphological analyzer consists of threecomponents, an operative lexicon stored in adatabase, a set of morphological rules and a rule3  InputTokensInterpreterMorphologicalRulesOperativeLexiconOutputTags Figure 2: Structure of the morphological analyzerinterpreter (figure 5). The operative lexicon it-self is partially precompiled by rules, but thishappens off-line (see section 6 for discussion).Here, we will focus on the format of morpho-logical rules and their application. 5.1 Morphological Rules Following a long tradition in descriptive gram-mar and generative rule-based approaches tomorphology (e.g Anderson, 1992), the morpho-logicalrules weusedenoterelationsbetween in-put (lexicon) and output (derived) forms, whereforms are ordered pairs of strings (e.g. “punoj’)and tags (e.g. “[v]”). (3) shows as an examplethe lexeme  punoj , ‘work’ and its 2nd/3rd personsingular form  punon :(3)  Input-Output PairInput:  < punoj, [v]  > Output:  < punon, [v  { 2 3 }  sg ind pres]  > Rulesarequintuplesoftheform <left context , remove ,  add, lexicon category, tag > , where left context  and  remove  are regular ex-pressions and all other components strings. lexicon category  specifies the category tag of the entry in the operative lexicon and  tag  theresulting tag.  add  is the suffix which is addedto the stem after removing an expression corre-sponding to (stem-final)  remove  to get the wordform. The rule can only be applied if the suffixof the input stem corresponding to  remove  ispreceded by a string matched by  left context .Figure 3 contains a slightly simplified exam-ple of a morphological rule. This rule deletes afinal  j  ( remove ) from an item which has the lex-icon category ”[v]” if   j  is preceded by a vowel( left context ), and adds  n  instead which getsthe tag ”[v 2 3 sg ind pres]”. Figure 4 showshow the rule applies to the example pair from(3).The morphological rules we use do not dif-ferentiate between phonology and morphology.Thus the fact that the 1st person singular aoristsuffix for verb stems ending in a consonant is -a  (e.g.  hap-a  ‘I opened’, while it is  -va  aftervowels (e.g  pi-va , ‘I drank’ is not captured by4  [:vok:] j n [v] [v  { 2 3 }  sg ind pres] | | | | | ( left context ) ( remove ) ( add ) ( lexicon category ) ( tag )Figure 3: Example for a morphological rulepun o j [v]        ( left context ) ( remove ) ( lexicon category )( add ) ( tag )   pun o n [v  { 2 3 }  sg ind pres]Figure 4: The rule from figure 3 applied to “punon [v]”a separate phonological rule, but simply by twodifferent morphological rules:(4)  Morphological Rules for 1sg aorist a. [:vok:] 0 va [v] [v 1 sg aor]b. [:kons:] 0 a [v] [v 1 sg aor]In approaches to morphological analysis suchas two-level morphology (see Karttunen andBeesley, 2001, andreferences cited there)whichseparate phonology and morphology, an alter-native to assuming two different affixal itemsfor 1sg aorist would be to assume just one (say -va ) and derive the other form by a phonologi-cal rule (here: delete  v  after a consonant). Wethink that these approaches are well-motivatedin languages with rich sandhi phenomena suchas Finnish, but lead to unnecessary complexityin a language like Albanian which shows – atleast at the orthographic level – few such pro-cesses. 5.2 The Rule Interpreter Recall that morphological rules, although wehave discussed them as devices to derive wordforms, are declarative statements on relationsbetween lexicon entries and word forms. In fact,our rule interpreter uses these rules to infer pos-sible lexical entries for a given word form. Ittransforms the  left context  and  add  parts of each rule into one regular expression. For eachword form which matches this expression fora rule  R  with suffixes  S  , it combines the re-maining prefixes  P   of the word form with the remove  parts compatible by  R  with  S   to geta set of potential lexicon forms which are thenchecked against the lexicon data base. Since5
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks