Paintings & Photography

A new life for a dead parrot: Incentive structures in the Phrase Detectives game

A new life for a dead parrot: Incentive structures in the Phrase Detectives game
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A new life for a dead parrot:Incentive structures in the Phrase Detectives game Jon Chamberlain University of EssexSchool of Computer Scienceand Electronic Engineering Poesio University of EssexSchool of Computer Scienceand Electronic Engineering Kruschwitz University of EssexSchool of Computer Scienceand Electronic Engineering He’s passed on! This parrot is no more! He has ceased to be! He’s expired and gone to meet his maker! He’s kicked the bucket,he’s shuffled off his mortal coil, run down the curtain and joined the bleedin’ choir invisibile! THIS IS AN EX-PARROT!   1 ABSTRACT In order for there to be significant improvements in certainareas of natural language processing (such as anaphora res-olution) large linguistically annotated resources need to becreated which can be used to train, for example, machinelearning systems. Annotated corpora of the size needed formodern computational linguistics research cannot howeverbe created by small groups of hand-annotators. Simple Web-based games have demonstrated how it might be possible todo this through Web collaboration. This paper reports onthe ongoing work of   Phrase Detectives  , a game developed inthe ANAWIKI project designed for collaborative linguisticannotation on the Web. In this paper we focus on how werecruit and motivate players, incentivise high quality anno-tations and assess the quality of the data. Categories and Subject Descriptors H.1.2 [ User/Machine Systems ]: Human factors; Humaninformation processing; H.2.7 [ Artificial Intelligence ]: Nat-ural Language Processing Keywords Web-based games, incentive structures, user motivation, dis-tributed knowledge acquisition, anaphoric annotation 1. INTRODUCTION The statistical revolution in natural language processing(NLP) has resulted in the first NLP systems and compo-nents which are usable on a large scale, from part-of-speech(POS) taggers to parsers [7]. However it has also raised theproblem of creating the large amounts of annotated linguis-tic data needed for training and evaluating such systems.Potential solutions to this problem include semi-automaticannotation and machine learning methods that make betteruse of the available data. Unsupervised or semi-supervisedtechniques hold great promise, but for the foreseeable fu-ture at least, the greatest performance improvements are 1 Copyright is held by the author/owner(s). WWW2009  , April 20-24, 2009, Madrid, Spain.. still likely to come from increasing the amount of data to beused by supervised training methods. These crucially relyon hand-annotated data. Traditionally, this requires trainedannotators, which is prohibitively expensive both financiallyand in terms of person-hours (given the number of trainedannotators available) on the scale required.Recently, however, Web collaboration has emerged as aviable alternative. Wikipedia and similar initiatives haveshown that a surprising number of individuals are willing tohelp with resource creation and scientific experiments. TheOpen Mind Common Sense project [16] demonstrated thatsuch individuals are also willing to participate in the cre-ation of databases for Artificial Intelligence (AI), and vonAhn showed that simple Web games are an effective way of motivating participants to annotate data for machine learn-ing purposes [23].The goal of the ANAWIKI project 1 is to experiment withWeb collaboration as a solution to the problem of creatinglarge-scale linguistically annotated corpora, both by devel-oping Web-based annotation tools through which membersof the scientific community can participate in corpus cre-ation and through the use of game-like interfaces. We willpresent ongoing work on  Phrase Detectives  2 , a game de-signed to collect judgments about anaphoric annotations.We will also report results which include a substantial cor-pus of annotations already collected. 2. RELATED WORK Related work comes from a range of relatively distinctresearch communities including, among others, Computa-tional Linguistics / NLP, the games community and re-searchers working in the areas of the Semantic Web andknowledge representation.Large-scale annotation of low-level linguistic information(part-of-speech tags) began with the Brown Corpus, in whichvery low-tech and time consuming methods were used. Forthe creation of the British National Corpus (BNC), the first100M-word linguistically annotated corpus, a faster method-ology was developed consisting of preliminary annotationwith automatic methods followed by partial hand-correction[1]. This was made possible by the availability of relativelyhigh quality automatic part-of-speech taggers (CLAWS).With the development of the first high-quality chunkers,this methodology became applicable to the case of syntac-tic annotation. It was used for the creation of the Penn 1 2  Treebank [10] although more substantial hand-checking wasrequired.Medium and large-scale semantic annotation projects (forwordsense or coreference) are a recent innovation in Compu-tational Linguistics. The semi-automatic annotation method-ology cannot yet be used for this type of annotation, as thequality of, for instance, coreference resolvers is not yet highenough on general text. Nevertheless the semantic annota-tion methodology has made great progress with the devel-opment, on the one end, of effective quality control methods[4] and on the other, of sophisticated annotation tools suchas Serengeti [20].These developments have made it possible to move fromthe small-scale semantic annotation projects, the aim of which was to create resources of around 100K words in size[14], to the efforts made as part of US initiatives such asAutomatic Context Extraction (ACE), Translingual Infor-mation Detection, Extraction and Summarization (TIDES),and GALE to create 1 million word corpora. Such tech-niques could not be expected to annotate data on the scaleof the BNC.Collaborative resource creation on the Web offers a differ-ent solution to this problem. The motivation for this is theobservation that a group of individuals can contribute to acollective solution, which has a better performance and ismore robust than an individual’s solution as demonstratedin simulations of collective behaviours in self-organizing sys-tems [6].Wikipedia is perhaps the best example of collaborativeresource creation, but it is not an isolated case. The gamingapproach to data collection, termed  games with a purpose  ,has received increased attention since the success of the ESPgame [22]. Subsequent games have attempted to collect datafor multimedia tagging ( OntoTube  3 ,  Tag a Tune  4 ) and lan-guage tagging ( Verbosity  5 , OntoGame  6 , Categorilla  7 , Free Association  8 ). As Wikipedia has demonstrated however,there is not necessarily the need to turn every data collec-tion task into a game. Other current efforts in attempting toacquire large-scale world knowledge from Web users includeFreebase 9 and True Knowledge 10 .The  games with a purpose   concept has now also beenadopted by the Semantic Web community in an attempt tocollect large-scale ontological knowledge because currently“the Semantic Web lacks sufficient user involvement almosteverywhere” [17].It is a huge challenge to recruit enough users to makedata collection worthwhile and, as we will explore later, itis also important to attract the right kind of player. Previ-ous games have attracted exceptional levels of participationsuch as the ESP game (13,500 players in 4 months) [22],Peekaboom (14,000 players in 1 month) [24] and OpenMind(15,000 users) [16] which encourages one to believe massparticipation might be possible for similar projects. 3 4 5 6 7 8 9 10 Figure 1: A screenshot of the Annotation Mode. 3. THE PHRASE DETECTIVES GAME Phrase Detectives   is a game offering a simple interface fornon-expert users to learn how to annotate text and to makeannotation decisions [2]. The goal of the game is to identifyrelationships between words and phrases in a short text.An example of a task would be to highlight an anaphor-antecedent relation between the markables (sections of text) ’This parrot’   and  ’He’   in ’ This parrot is no more! He has ceased to be!’   Markables are identified in the text by au-tomatic pre-processing. There are two ways to annotatewithin the game: by selecting a markable that corefers toanother one (Annotation Mode, called  Name the Culprit   inthe game); or by validating a decision previously submit-ted by another player (Validation Mode, called  Detectives Conference   in the game).Annotation Mode (see Figure 1) is the simplest way of collecting judgments. The player has to locate the closestantecedent markable of an anaphor markable, i.e. an ear-lier mention of the object. By moving the cursor over thetext, markables are revealed in a bordered box. To select itthe player clicks on the bordered box and the markable be-comes highlighted. They can repeat this process if there ismore than one antecedent markable (e.g. for plural anaphorssuch as  ’they’  ). They submit the annotation by clicking the Done!   button. The player can also indicate that the high-lighted markable has not been mentioned before (i.e. it isnot anaphoric), that it is non-referring (for example,  ’it’   in ’Yeah, well it’s not easy to pad these Python files out to150 lines, you know.’  ) or that it is the property of anothermarkable (for example,  ’a lumberjack’   being a property of  ’I’   in  ’I wanted to be a lumberjack!’  ). Players can also makea comment about the markable (for example, if there is anerror in the automatic text processing) or skip the markableand move on to the next one.In Validation Mode (see Figure 2) the player is presentedwith an annotation from a previous player. The anaphormarkable is shown with the antecedent markable(s) that theprevious player chose. The player has to decide if he agreeswith this annotation. If not he is shown the Annotation  Figure 2: A screenshot of the Validation Mode. Mode to enter a new annotation. The Validation Mode notonly sorts ambiguous, incorrect and/or malicious decisionsbut also provides a social training mechanism [9].When the users register they begin with the training phaseof the game. Their answers are compared with Gold Stan-dard texts to give them feedback on their decisions and toget a user rating, which is used to determine whether theyneed more training. Contextual instructions are also avail-able during the game.The corpus used in the game is created from short texts in-cluding: Wikipedia articles selected from the ’Featured Arti-cles’ and the page of ’Unusual Articles’; stories from ProjectGutenberg including Aesop’s Fables, Sherlock Holmes andGrimm’s Fairy Tales; and dialogue texts from Textfile.comincluding Monty Python’s Dead Parrot sketch. Selectionsfrom the GNOME and ARRAU corpora are also includedto analyse the quality of the annotations. 4. THE SCORING SYSTEM One of the most significant problems when designing agame that collects data is how to reward a player’s decisionwhen the correct answer is not known (and in some casesthere may not be just one correct answer). Our solutionis to motivate players using comparative scoring (awardingpoints for agreeing with the Gold Standard) and collabo-rative scoring (increasing the reward the more the playersagree with each other).In the game groups of players work on the same task overa period of time as this is likely to lead to a collectivelyintelligent decision [21]. An initial group of players are askedto annotate a markable. For each decision the player receivesa ’decision’ point. If all the players agree with each otherthen they are all awarded an additonal ’agreement’ pointand the markable is considered complete.However it is likely that the first group of players willnot agree with each other (62% of markables are given morethan one relationship). In this case each unique relationshipfor the markable is validated by another group of players.The validating players receive an ’agreement’ point for ev-ery player from the first group they agree with (either byagreeing or disagreeing). The players they agree with alsoreceive an ’agreement’ point.This scoring system motivates the initial annotating groupof players to choose the best relationship for the markablebecause it will lead to more points being added to theirscore later. The validating players are motivated to agreewith these relationships as they will score more agreementpoints.Contrary to expectations [3] it took players almost twiceas long to validate a relationship than to annotate a mark-able (14 seconds compared to 8 seconds). 5. INCENTIVE STRUCTURES The game is designed to use 3 types of incentive structure:personal, social and financial. All incentives were appliedwith caution as rewards have been known to decrease an-notation quality [12]. The primary goal is to motivate theplayers to provide high quality answers, rather than largequantities of answers. •  Document topic •  Task speed •  User contributed documents •  Leaderboards •  Collaborative scoring •  Weekly and monthly prizes 5.1 Personal incentives Personal incentives are evident when simply participatingis enough of a reward for the user. For example, a Webuser submitting information to Wikipedia does not usuallyreceive any reward for what they have done but are contentto be involved in the project. Similarly the progress of aplayer through a computer game will usually only be of in-terest to themselves, with the reward being the enjoymentof the game.Generally, the most important personal incentive is thatthe user feels they are contributing to a worthwhile project.News and links to the research were posted on the homepageto reinforce the credibility of the project.Also important for the players of Phrase Detectives is thatthey read texts that they find interesting. The choice of documents is important in getting users to participate in thegame, to understand the tasks and to keep playing. Playerscan specify a preference for particular topics, however only4% do so. This could be an indication that the corpus as awhole was interesting but it is more likely that they simplydidn’t change their default options [11].It is also important for the players to read the documentsat a relatively normal speed whilst still being able to com-plete the tasks. By default the tasks are generated randomly(although displayed in order) and limited (50 markable tasksselected from each document) which allows a normal read-ing flow. Players are given bonus points if they change theirprofile settings to select every markable in each document(which makes reading slower). Only 5% of players chose tosacrifice readability for the extra points.In early versions of the game the player could see how longthey had taken to do an annotation. Although this had no  Figure 3: A screenshot of the player’s homepage. influence on the scoring, players complained that they feltunder pressure and that they didn’t have enough time tocheck their answers. This is in contrast to previous sugges-tions that timed tasks motivate players [23]. The timingof the annotations is now hidden from the players but stillrecorded with annotations. The relationship between thetime of the annotation, the user rating and the agreementwill be crucial in understanding how a timed element in areading game influences the data that is collected.The throughput of Phrase Detectives is 450 annotationsper human hour (compared to the ESP game at 233 labelsper human hour [23]). There is, however, a difference indata input between the 2 games, the former only requiringclicks on pre-selected phrases and the latter requiring theuser to type in a phrase. The design of a game task mustconsider the speed at which the player can process the inputsource (e.g. text, images) and deliver their response (e.g. aclick, typing) in order to maximise throughput and hencethe amount of data that is collected.We allowed users to submit their own text to the corpus.This would be processed and entered into the game. Weanticipated that, much like Wikipedia, this would motivateusers to generate content and become much more involvedin the game. Unfortunately this was not the case, with onlyone user submitting text. We have now stopped advertisingthis incentive however the concept may still hold promise forgames where the user-submitted content is more naturallycreated (e.g. collaborative story writing). 5.2 Social incentives Social incentives reward users by improving their standingamongst their peers (in this case their fellow players).Phrase Detectives features the usual incentives of a com-puter game, including weekly, monthly and all-time leader-boards, cups for monthly top scores and named levels forreaching a certain amount of points (see Figure 3). Inter-esting phenomenon have been reported with these rewardmechanisms, namely that players gravitate towards the cut-off points (i.e. they keep playing to reach a level or highscore before stopping) [24]. The collaborative agreementscoring in Phrase Detectives prevents us from effectivelyanalysing this (as players continue to score even when theyhave stopped playing) however our high-scoring players canbe regularly seen outscoring each other on the leaderboards.In addition to the leaderboards that are visible to all play-ers, each player can also see a leaderboard of other playerswho agreed with them. Although there is no direct incentive(as you cannot influence your own agreement leaderboard) itreinforces the social aspect of how the scoring system works.The success of games integrated into social networking siteslike Sentiment Quiz 11 on Facebook indicates that visible so-cial interaction within a game environment motivates theplayers to contribute more. 5.3 Financial incentives Financial incentives reward effort with money. We intro-duced a weekly prize where a player is chosen by randomlyselecting an annotation made during that week. This prizemotivates low-scoring players because any annotation madeduring the week has a chance of winning (much like a lot-tery) and the more annotations you make, the higher yourchance of winning.We also introduced monthly prizes for the 3 highest scor-ers of the month. The monthly prize motivates the high-scoring players to compete with each other by doing morework, but also motivates some of the low-scoring players inthe early parts of the month when the high score is low.The weekly prize was  £ 15 and the monthly prizes were £ 75,  £ 50 and  £ 25 for first, second and third places. Theprizes were sent as Amazon vouchers by email. 6. QUALITY OF DATA The psychological impact of incentive structures, espe-cially financial ones, can create a conflict of motivation inplayers (i.e. how much time they should spend on their de-cisions). They may decide to focus on ways to maximise re-wards rather than provide high quality answers. The game’sscoring system and incentive structures are designed to re-duce this to a minimum. We have identified four aspectsthat need to be addressed to control annotation quality: en-suring users understand the task; attention slips; maliciousbehaviour; and genuine ambiguity of data [9].Further analysis will reveal if changing the number of play-ers in the annotating and validating groups will effect thequality of the annotations. The game currently uses 8 play-ers in the annotating group and 4 in the validating groupwith an average of 18 players looking at each markable.Some types of task can achieve high quality annotationswith as few as 4 annotators [18] but other types of tasks(e.g anaphor resolution) may require more [15]. 7. ATTRACTING & MOTIVATING USERS The target audience for the game are English-speakerswho spend significant amounts of time online, either playingcomputer games or casually browsing the Internet.In order to attract the number of participants requiredto make a success of this methodology it is not enough todevelop attractive games, but also successful advertising.Phrase Detectives was written about in local and nationalpress, on science websites, blogs, bookmarking websites and 11  gaming forums. The developer of the game was also inter-viewed by the BBC. At the same time a pay-per-click adver-tising campaign was started on the social networking websiteFacebook, as well as a group connected to the project.We investigated the sources of traffic since live release us-ing Google Analytics. Incoming site traffic didn’t show any-thing unusual: direct (46%); from a website link (29%); fromthe Facebook advert (13%); from a search (12%). Howeverthe bounce rate (the percentage of single-page visits, wherethe user leaves on the page they entered on) revealed howuseful the traffic was. This showed a relatively consistentfigure for direct (33%), link (29%) and search (44%) traffic.However for the Facebook advert it was significantly higher(90%), meaning that 9 out of 10 users that came from thissource did not play the game. This casts doubt over theusefulness of pay-per-click advertising as a way of attract-ing participants to a game.The players of Phrase Detectives were encouraged to re-cruit more players by giving them extra points every timethey referred a player and whenever that player gained alevel. The staggered reward for referring new players was todiscourage players from creating new accounts themselves inorder to get the reward. The scores of the referred playersare displayed to the referring player on the recruits leader-board. 4% of players have been referred by other players.Attracting large numbers of players to a game is onlypart of the problem. It is also necessary to attract play-ers who will make significant contributions. Since its releasethe game has attracted 750 players but we found that thetop 10 players (5% of total) had 60% of the total points onthe system and had made 73% of the annotations. This in-dicates that only a handful of users are doing the majorityof the work, which is consistent with previous findings [18],however the contribution of one-time users should not beignored [8]. Most of the players who have made significantcontributions have a language-based background.Players are invited to report on their experiences eitherthrough the feedback page or by commenting on a markable.Both methods send a message to the administrators who canaddress the issues raised and reply to the player if required.General feedback included suggestions for improvements tothe interface and clarification of instructions and scoring.Frequent comments included reporting markables with er-rors from the pre-processing and discussing ambiguous ordifficult markable relations.It was intended to be a simple system of communicationfrom player to administrator that avoids players colludingto gain points. However it is apparent that a more sophisti-cated community message system would enhance the playerexperience and encourage the development of a community. 8. IMPLEMENTATION Phrase Detectives   is running on a dedicated Linux server.The pre-processed data is stored in an MySQL database andmost of the scripting is done via PHP.The Gold Standard is created in Serengeti (a Web-basedannotation tool developed at the University of Bielefeld [20])by computational linguists. This tool runs on the sameserver and accesses the same database.The database stores the textual data in Sekimo GenericFormat (SGF) [19], a multi-layer representation of the src-inal documents that can easily be transformed into othercommon formats such as MAS-XML and PAULA. We ap-ply a pipeline of scripts to get from raw text to SGF format.For English texts this pipeline consists of these main steps: •  A pre-processing step normalises the input, applies asentence splitter and runs a tokenizer over each sen-tence. We use the  openNLP  12 toolkit to perform thisprocess. •  Each sentence is analysed by the  Berkeley Parser  13 . •  The parser output is interpreted to identify markablesin the sentence. As a result we create an XML rep-resentation which preserves the syntactic structure of the markables (including nested markables, e.g. nounphrases within a larger noun phrase). •  A heuristic processor identifies a number of additionalfeatures associated with markables such as person, case,number etc. The output format is MAS-XML.The last two steps are based on previous work within theresearch group at Essex University [15]. Finally, MAS-XMLis converted into SGF. Both MAS-XML and SGF are alsothe formats used to export the annotated data. 9. RESULTS Before going live we evaluated a prototype of the gameinterface informally using a group of randomly selected vol-unteers from the University of Essex [2]. The beta versionof   Phrase Detectives   went on-line in May 2008, with the firstlive release in December 2008. Over 1 million words of texthave been added to the live game.In the first 3 months of live release the game collected over200,000 annotations and validations of anaphoric relations.To put this in perspective, the GNOME corpus, producedby traditional methods, included around 3,000 annotationsof anaphoric relations [13] whereas OntoNotes 14 3.0, with 1million words, contains around 140,000 annotations.The analysis of the results is an ongoing issue. However,by manually analyzing 10 random documents we could notfind a single case in which a misconceived annotation wasvalidated by other players. This confirms the assumptionswe made about quality control. It will need to be furtherinvestigated by more thorough analysis methods which willbe part of the future work. 10. CONCLUSIONS The incentives structures used in Phrase Detectives weresuccessful in motivating the users to provide high qualitydata. In particular the collaborative and social elements(agreement scoring and leaderboards) seem to offer the mostpromise if they can be linked with existing social networks.The methodology behind collaborative game playing hasbecome increasingly more widespread. Whilst the good-willof Web volunteers exists at the moment, there may be apoint of saturation, where it becomes significantly more dif-ficult to attract users and more novel incentive structureswill need to be developed. 12 13 14
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks