Standardizing Tweets with Character-level Machine Translation Nikola Ljubeˇsi´c 1 , Tomaˇz Erjavec 2 , and Darja Fiˇser 3 1 University of Zagreb, Faculty of Humanities and Social Sciences, Zagreb, Croatia 2 Joˇzef Stefan Institute, Department of Knowledge Technologies, Ljubljana, Slovenia 3 University of Ljubljana, Faculty of Arts, Ljubljana, Slovenia Abstract. This paper presents the
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Standardizing Tweets with Character-levelMachine Translation Nikola Ljubeˇsi´c 1 , Tomaˇz Erjavec 2 , and Darja Fiˇser 3 1 University of Zagreb, Faculty of Humanities and Social Sciences, Zagreb, Croatia 2 Joˇzef Stefan Institute, Department of Knowledge Technologies, Ljubljana, Slovenia 3 University of Ljubljana, Faculty of Arts, Ljubljana, Slovenia Abstract.  This paper presents the results of the standardization pro-cedure of Slovene tweets that are full of colloquial, dialectal and foreign-language elements. With the aim of minimizing the human input requiredwe produced a manually normalized lexicon of the most salient out-of-vocabulary (OOV) tokens and used it to train a character-level statisticalmachine translation system (CSMT). Best results were obtained by com-bining the manually constructed lexicon and CSMT as fallback with anoverall improvement of 9.9% increase on all tokens and 31.3% on OOVtokens. Manual preparation of data in a lexicon manner has proven to bemore efficient than normalizing running text for the task at hand. Finallywe performed an extrinsic evaluation where we automatically lemmatizedthe test corpus taking as input either srcinal or automatically standard-ized wordforms, and achieved 75.1% per-token accuracy with the formerand 83.6% with the latter, thus demonstrating that standardization hassignificant benefits for upstream processing. Keywords:  twitterese, standardization, character-level machine trans-lation 1 Introduction This paper deals with the problem of processing non-standard language forsmaller languages that cannot afford to develop new text processing tools foreach language variety. Instead, language varieties need to be standardized so thatthe existing tools can be utilized with as little negative impact of the noisy dataas possible. Slovene, the processing of which is difficult already due to its highlyinflecting nature, is even harder to process when orthographic, grammatical andpunctuation norms are not followed. This is often the case in non-standard andless formal language use, such as in the language of tweets which is becoming apredominant medium for the dissemination of information, opinions and trendsand as such an increasingly important knowledge source for data mining and textprocessing tasks. Another important characteristics of twitterese is that it is rich  2 Nikola Ljubeˇsi´c, Tomaˇz Erjavec, Darja Fiˇser in colloquial, dialectal and foreign-language elements, causing the standard textprocessing tools to underperform.This is why we propose an approach to standardizing Slovene tweets with theaim of increasing the performance of the existing text processing tools by train-ing a character-level statistical machine translation (CSMT) system. CSMT hasrecently become a popular method for translating between closely related lan-guages, modernizing historical lexicons, producing cognate candidates etc. Thespecificity of CSMT is that the translation and language model are not builtfrom sequences of words, but characters. In all experiments we use the well-known Moses system 1 with default settings if not specified differently. In orderto minimize the human input required, we explore the following strategy: weproduce a manually validated lexicon of the 1000 most salient out-of-vocabulary(OOV) tokens in respect to a reference corpus, where the lexicon contains pairs(srcinal wordform, standardized wordform). We also annotate a small corpus of tweets with the standardized wordform and use the lexicon resource for trainingthe CSMT system and the corpus for evaluating different settings. We comparethe efficiency of normalizing a lexicon of most-salient OOV tokens to the stan-dard approach of normalizing running text. Finally, we also manually lemmatizeour test corpus in order to evaluate how much the standardization helps withthe task of lemmatization. The datasets used in this work are made availabletogether with the paper 2 .The rest of this paper is structured as follows: Section 2 discusses relatedwork, Section 3 introduces the dataset we used for the experiments, Section 4gives the experiments and results, while Section 5 concludes and gives somedirections for future work. 2 Related work Text standardization is rapidly gaining in popularity because of the explosionof user-generated text content in which language norms are not followed. SMSmessages used to be the main object of text standardization [2,3] while recentlyTwitter has started taking over as the most prominent source of informationencoded with non-standard language [7,6].There are two main approaches to text standardization. The unsupervisedapproach mostly relies on phonetic transcription of non-standard words to pro-duce standard candidates and language modeling on in-vocabulary (IV) datafor selecting the most probable candidate [6]. The supervised approach assumesmanually standardized data from which standardization models are built.Apart from using standard machine learning approaches to supervised stan-dardization, such as HMMs over words [3] or CRFs for identifying deletions [8],many state-of-the-art supervised approaches rely on statistical machine trans-lation which defines the standardization task as a translation problem. Therehas been a series of papers using phrase-based SMT for text standardization [2, 1 2  Standardizing Tweets with Character-level Machine Translation 3 7] and, to the best of our knowledge, just two attempts at using character-levelSMT (CSMT) for the task [9,4]. Our work also uses CSMT but with a few im-portant distinctions, the main one being data annotation procedure. While [9,4] annotate running tweets, we investigate the possibility of extracting a lexiconof out-of-vocabulary (OOV) but highly salient words with respect to a refer-ence corpus. Furthermore, we apply IV filters on the n-best CSMT hypotheseswhich proved to be very efficient in the CSMT approach to modernizing histor-ical texts [11]. Finally, we combine the deterministic lexicon approach with theCSMT approach as fallback for tokens not covered by the lexicon. 3 Dataset The basis for our dataset was the database of tweets from the now no longer ac-tive aggregator containing (mostly) Slovene tweets posted between2007-01-12 and 2011-02-20. The database contains many tweets in other lan-guages as well, so we first used a simple filter that keeps only those that containone of the Slovene letters ˇc, ˇs or ˇz. This does not mean that there is no foreignlanguage text remaining, as some closely related languages, in particular Croa-tian, also use these letters. Also it is fairly common to mix Slovene and anotherlanguage, mostly English, in a single tweet. However, standard methods for lan-guage identification do not work well with the type of language found in tweets,and are also bad at distinguishing closely related languages, especially if a singletext uses more than one language. In this step we also shuffled the tweets in thecollection so that taking any slice will give a random selection of tweets, makingit easier to construct training and testing datasets.In the second step we anonymized the tweets by substituting hashtags, men-tions and URLs with special symbols ( XXX-HST ,  XXX-MNT ,  XXX-URL ) andsubstituted emoticons with  XXX-EMO . This filter is meant to serve two pur-poses. On the one hand, we make the experimental dataset freely available andby using rather old and anonymized tweets we hope to evade problems withthe Twitter terms of use. On the other, tweets are difficult to tokenize correctlyand by substituting symbols for the most problematic tokens, i.e. emoticons, wemade the collection easier to process.We then tokenized the collection and stored it in the so called vertical format,where each line is either an XML tag (in particular,  < text >  for an individualtweet) or one token. With this we obtained a corpus of about half a milliontweets and eight million word tokens which is the basis for our datasets. 3.1 Support lexicons As will be discussed in the following sections, we also used several support lex-icons to arrive at the final datasets for our experiments. In the first instance,this is Sloleks 3 [1], a CC-BY-NC available large lexicon of Slovene containing 3  4 Nikola Ljubeˇsi´c, Tomaˇz Erjavec, Darja Fiˇser the complete inflectional paradigms of 100,000 Slovene lemmas together withtheir morphosyntactic descriptions and frequency of occurrence in the Gigafidareference corpus of Slovene. We used only wordforms and their frequency fromthis lexicon, not making use of the other data it contains. In other words, toapply the method presented here to another language only a corpus of standardlanguage is needed, from which a frequency lexicon, equivalent to the one usedhere, can then be extracted.As mentioned, Slovene tweets often mix other languages with Slovene and,furthermore, the language identification procedure we used is not exact. As pro-cessing non-Slovene words was not the focus of this experiment, it was thereforeuseful to be able to identify foreign words. To this end, we made a lexicon of words in the most common languages appearing in our collection, in particularEnglish and Croatian. For English we used the SIL English wordlist 4 , and forCroatian the lexicon available with the Apertium MT system 5 .A single lexicon containing all three languages was produced, where eachwordform is marked with one or more languages. It is then simple to matchtweet wordforms against this lexicon and assign each such a word a flag givingthe language(s) it belongs or marking it as OOV. 3.2 Lexicon of Twitterese The most straightforward way to obtain standardizations of Twitter-specificwordforms is via a lexicon giving the wordform and its manually specified stan-dardized form. If we choose the most Twitter-specific wordforms, this will covermany tokens in tweets and also take care of some of the more unpredictableforms.To construct such a lexicon, we first extracted the frequency lexicon fromthe tweet corpus vertical file. We then used Sloleks to determine the 1,000 mosttweet-specific words using the method of frequency profiling [10] which, for eachword, compares its frequency in the specialized corpus to that in the referencecorpus using log-likelihood. These words were then manually standardized, aprocess that took about three hours, i.e. on the average about 10s per entry,making it an efficient way of constructing a useful resource for standardization.This lexicon makes no attempt to model ambiguity, as a tweet wordform cansometimes have more than one standardization. We simply took the most obviousstandardization candidate, typically without inspecting the corpus, which wouldhave taken much more time. Sometimes one word is standardized to severalstandard words, i.e., a word is mapped to a phrase, so the relation betweentokens in tweets and standardized ones is not necessarily one-to-one. Along withmanual standardization, words were also flagged as being proper nouns (names),foreign words or errors in tokenization. The first are important as they can beOOV words as regards Sloleks, even though they are in fact standard words, the 4 5 and Slovenian
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks