A Multidimensional Lexicon for Interpersonal Stancetaking

The sociolinguistic construct of stancetaking describes the activities through which discourse participants create and signal re- lationships to their interlocutors, to the topic of discussion, and to the talk itself. Stancetaking underlies a wide
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Multidimensional Lexicon for Interpersonal Stancetaking Umashanthi Pavalanathan Georgia Institute of TechnologyAtlanta, GA Jim Fitzpatrick University of PittsburghPittsburgh, PA Scott F. Kiesling University of PittsburghPittsburgh, PA Jacob Eisenstein Georgia Institute of TechnologyAtlanta, GA Abstract The sociolinguistic construct of stancetak-ing describes the activities through whichdiscourse participants create and signal re-lationships to their interlocutors, to thetopic of discussion, and to the talk it-self. Stancetaking underlies a wide rangeof interactional phenomena, relating toformality, politeness, affect, and subjec-tivity. We present a computational ap-proach to stancetaking, in which we builda theoretically-motivated lexicon of stancemarkers, and then use multidimensionalanalysis to identify a set of underlyingstance dimensions. We validate thesedimensions intrinsically and extrinsically,showing that they are internally coherent,match pre-registered hypotheses, and cor-relate with social phenomena. 1 Introduction What does it mean to be welcoming or standoffish,light-hearted or cynical? Such interactional stylesare performed primarily with language, yet little isknown about how linguistic resources are arrayedto create these social impressions. The sociolin-guistic concept of   interpersonal stancetaking  at-temptstoanswerthisquestion, byprovidingacon-ceptual framework that accounts for a range of in-terpersonal phenomena, subsuming formality, po-liteness, and subjectivity (Du Bois, 2007). 1 This 1 Stancetaking is distinct from the notion of   stance  whichcorresponds to a position in a debate (Walker et al., 2012). Similarly, Freeman et al. (2014) correlate phonetic features with the  strength  of such argumentative stances. framework has been applied almost exclusivelythrough qualitative methods, using close readingsof individual texts or dialogs to uncover how lan-guage is used to position individuals with respectto their interlocutors and readers.We attempt the first large-scale operationaliza-tion of stancetaking through computational meth-ods. Du Bois (2007) formalizes stancetaking as a multi-dimensional construct, reflecting the re-lationship of discourse participants to (a) the au-dience or interlocutor; (b) the topic of discourse;(c) the talk or text itself. However, the multi-dimensional nature of stancetaking poses prob-lems for traditional computational approaches, inwhich labeled data is obtained by relying on anno-tator intuitions about scalar concepts such polite-ness (Danescu-Niculescu-Mizil et al., 2013) and formality (Pavlick and Tetreault, 2016). Instead, our approach is based on atheoretically-guided application of unsupervisedlearning, in the form of factor analysis, appliedto lexical features. Stancetaking is characterizedin large part by an array of linguistic featuresranging from discourse markers such as  actually to backchannels such as  yep  (Kiesling, 2009). We therefore first compile a lexicon of stancemarkers, combining prior lexicons from Biber andFinegan (1989) and the Switchboard Dialogue Act Corpus (Jurafsky et al., 1998). We then extend this lexicon to the social media domain using wordembeddings. Finally, we apply multi-dimensionalanalysis of co-occurrence patterns to identify asmall set of   stance dimensions .To measure the internal coherence (constructvalidity) of the stance dimensions, we use a word  intrusion task (Chang et al., 2009) and a set of pre- registered hypotheses. To measure the utility of the stance dimensions, we perform a series of ex-trinsic evaluations. A predictive evaluation showsthat the membership of online communities is de-termined in part by the interactional stances thatpredominate in those communities. Furthermore,the induced stance dimensions are shown to alignwith annotations of politeness and formality. Contributions  We operationalize the sociolin-guistic concept of stancetaking as a multi-dimensional framework, making it possible tomeasure at scale. Specifically, •  we contribute a lexicon of stance markers basedon prior work and adapted to the genre of onlineinterpersonal discourse; •  we group stance markers into latent dimensions; •  we show that these stance dimensions are inter-nally coherent; •  we demonstrate that the stance dimensions pre-dict and correlate with social phenomena. 2 2 Related Work From a theoretical perspective, we build onprior work on interactional meaning in language.Methodologically, our paper relates to prior work on lexicon-based analysis and contrastive studiesof social media communities. 2.1 Linguistic Variation and Social Meaning In computational sociolinguistics (Nguyen et al.,2016), language variation has been studied pri-marily in connection with macro-scale social vari-ables, such as age (Argamon et al., 2007; Nguyen et al., 2013), gender (Burger et al., 2011; Bam- man et al., 2014), race (Eisenstein et al., 2011; Blodgett et al., 2016), and geography (Eisenstein et al., 2010). This parallels what Eckert (2012) has called the “first wave” of language variationstudies in sociolinguistics, which also focused onmacro-scale variables.More recently, sociolinguists have dedicated in-creased attention to situational and stylistic varia-tion, and the  interactional meaning  that such vari-ation can convey (Eckert and Rickford, 2001). This linguistic research can be aligned with com-putational efforts to quantify phenomena such 2 Lexicons and stance dimensions are available at as subjectivity (Riloff and Wiebe, 2003), senti- ment (Wiebe et al., 2005), politeness (Danescu- Niculescu-Mizil et al., 2013), formality (Pavlick  and Tetreault, 2016), and power dynamics (Prab- hakaran et al., 2012). While linguistic research on interactional meaning has focused largely onqualitative methodologies such as discourse anal-ysis (e.g., Bucholtz and Hall, 2005), these com- putational efforts have made use of crowdsourcedannotations to build large datasets of, for example,polite and impolite text. These annotation effortsdraw on the annotators’ intuitions about the mean-ing of these sociolinguistic constructs.  Interpersonal stancetaking  represents an at-tempt to unify concepts such as sentiment, polite-ness, formality, andsubjectivityunderasinglethe-oretical framework (Jaffe, 2009; Kiesling, 2009). The key idea, as articulated by Du Bois (2007), is that stancetaking captures the speaker’s relation-ship to (a) the topic of discussion, (b) the inter-locutor or audience, and (c) the talk (or writing)itself. Various configurations of these three legsof the “stance triangle” can account for a rangeof phenomena. For example, epistemic stance re-lates to the speaker’s certainty about what is be-ing expressed, while affective stance indicates thespeaker’s emotional position with respect to thecontent (Ochs, 1993). The framework of stancetaking has been widelyadopted in linguistics, particularly in the discourseanalytic tradition, which involves close readingof individual texts or conversations (K¨arkk¨ainen, 2006;Keisanen,2007;Precht,2003;White,2003). But despite its strong theoretical foundation, weare aware of no prior efforts to operationalizestancetaking at scale. Since annotators may nothave strong intuitions about stance — in the waythat they do about formality and politeness — wecannot rely on the annotation methodologies em-ployed in prior work. We take a different ap-proach, performing a multidimensional analysis of the distribution of likely stance markers. 2.2 Lexicon-based Analysis Our operationalization of stancetaking is based onthe induction of lexicons of stance markers. Thelexicon-based methodology is related to earlierwork from social psychology, such as the Gen-eral Inquirer (Stone, 1966) and LIWC (Tausczik  and Pennebaker, 2010). In LIWC, the basic cate- gorieswereidentifiedfirst, basedonpsychological  constructs (e.g., positive emotion, cognitive pro-cesses, drive to power) and syntactic groupings of words and phrases (e.g., pronouns, prepositions,quantifiers). The lexicon designers then manuallycontructed lexicons for each category, augmentingtheir intuitions by using distributional statistics tosuggest words that may have been missed (Pen-nebaker et al., 2015). In contrast, we follow the approach of  Biber (1991), using multidimensional analysis to identify latent groupings of markersbased on co-occurrence statistics. We then usecrowdsourcing and extrinsic comparisons to val-idate the coherence of these dimensions. 2.3 Multicommunity Studies Social media platforms such as Reddit, Stack Ex-change, and Wikia can be considered  multicom-munity environments , in that they host multiplesubcommunities with distinct social and linguis-tic properties. Such subcommunities can be con-trasted in terms of topics (Adamic et al., 2008; Hessel et al., 2014) and social networks (Back- strom et al., 2006). Our work focuses on Red- dit, emphasizing community-wide differences innorms for interpersonal interaction. In the samevein, Tan and Lee (2015) attempt to characterize stylistic differences across subreddits by focusingon very common words and parts-of-speech; Tranand Ostendorf  (2016) use language models and topic models to measure similarity across threadswithin a subreddit. One distinction of our ap-proach is that the use of multidimensional analy-sis gives us interpretable dimensions of variation.This makes it possible to identify the specific in-terpersonal features that vary across communities. 3 Data Reddit, one of the internet’s largest social me-dia platforms, is a collection of subreddits or-ganized around various topics of interest. Asof January 2017, there were more than one mil-lion subreddits and nearly 250 million users, dis-cussing topics ranging from politics ( r/politics )to horror stories ( r/nosleep ). 3 Although Redditwas srcinally designed for sharing hyperlinks, italso provides the ability to post srcinal textualcontent, submit comments, and vote on contentquality (Gilbert, 2013). Reddit’s conversation-like threads are therefore well suited for the study of interpersonal social and linguistic phenomena. 3 Subreddits 126,789Authors 6,401,699Threads 52,888,024Comments 531,804,658Table 1: Dataset sizeFor example, the following are two commentsfrom the subreddit  r/malefashionadvice , posted inresponse to a picture posted by a user asking forfashion advise.U 1 :  “I think the beard   looks pretty good  .  Defi- nitely  not the goatee. Clean shaven is alwaysthe safe option.” U 2 :  “  Definitely  the beard. But keep it trimmed.” The phrases in  bold  face are markers of stance,indicating a  evaluative  stance. The followingexample is a part of a thread in the subreddit r/photoshopbattles  where users discuss an editedimage posted by the original poster  OP . Thephrases in  bold  face are markers of stance,indicating an  involved   and  interactional  stance.U 3 :  “  Ha ha  awesome!” U 4 :  ‘ ‘are those..... furries?”OP:  “  yes  , sir. They are!” U 4 :  “ Oh cool  . That makes sense!” We used an archive of 530 million commentsposted on Reddit in 2014, retrieved from the pub-lic archive of Reddit comments. 4 This datasetconsists of each post’s textual content, along withmetadata that identifies the subreddit, thread, au-thor, and post creation time. More statistics aboutthe full dataset are shown in Table 1. 4 Stance Lexicon Interpersonal stancetaking can be characterized inpart by an array of linguistic features such ashedges (e.g.,  might  ,  kind of  ), discourse markers(e.g.,  actually ,  I mean ), and backchannels (e.g.,  yep ,  um ). Our analysis focuses on these markers,which we collect into a lexicon. 4.1 Seed lexicon We began with a seed lexicon of stance markersfrom Biber and Finegan (1989), who compiled an 4 reddit_comments_corpus  extensive list by surveying dictionaries, previousstudies on stance, and texts in several genres of English. This list includes certainty adverbs (e.g., actually ,  of course ,  in fact  ), affect markers (e.g., amazing ,  thankful ,  sadly ), and hedges (e.g.,  kind of  ,  maybe ,  something like ) among other adverbial,adjectival, verbal, and modal markers of stance. Intotal, this list consists of 448 stance markers.The Biber and Finegan (1989) lexicon is pri- marily based on written genres from the pre-socialmedia era. Our dataset — like much of the re-cent work in this domain — consists of online dis-cussions, which differ significantly from printedtexts (Eisenstein, 2013). One difference is that online discussions contain a number of dialog actmarkers that are characteristic of spoken language,such as  oh yeah ,  nah ,  wow . We accounted forthis by adding 74 dialog act markers from theSwitchboard Dialog Act Corpus (Jurafsky et al.,1998). The final seed lexicon consists of 517unique markers, from these two sources. Note thatthe seed lexicon also includes markers that containmultiple tokens (e.g.  kind of  ,  I know ). 4.2 Lexicon expansion Online discussions differ not only from writ-ten texts, but also from spoken discussions,due to their use of non-standard vocabulary andspellings. To measure stance accurately, thesegenre differences must be accounted for. Wetherefore expanded the seed lexicon using auto-mated techniquesbased on distributional statistics.This is similar to prior work on the expansion of sentiment lexicons (Hatzivassiloglou and McKe-own, 1997; Hamilton et al., 2016). Our lexicon expansion approach used word em-beddings to find words that are distributionallysimilar to those in the seed set. We trained wordembeddings on a corpus of 25 million Reddit com-ments and a vocabulary of 100K most frequentwords on Reddit using the structured skip-grammodels of both  WORD 2 VEC  (Mikolov et al., 2013) and  WANG 2 VEC  (Ling et al., 2015) with default parameters. The  WANG 2 VEC  method augments WORD 2 VEC  by accounting for word order infor-mation. We found the similarity judgments ob-tained from  WANG 2 VEC  to be qualitatively moremeaningful, so we used these embeddings to con-struct the expanded lexicon. 5 5 We used the following default parameters: 100 dimen-sions, a window size of five, a negative sampling size of ten,five-epoch iterations, and a sub-sampling rate of   10 − 4 . Seed term Expanded terms (  Example seeds fr om Biber and Finegan (1989) )significantly considerably, substantially, dramaticallycertainly surely, frankly, definitelyincredibly extremely, unbelievably, exceptionally(  Example seeds fr om Jurafsky et al. (1998 ) )nope nah, yup, nevermindgreat fantastic, terrific, excellent Table 2: Stance lexicon: seed and expanded terms.To perform lexicon expansion, we constructeda dictionary of candidate terms, consisting of allunigrams that occur with a frequency rate of atleast  10 − 7 in the Reddit comment corpus. Then,for each single-token marker in the seed lexi-con, we identified all terms from the candidateset whose embedding has cosine similarity of atleast 0.75 with respect to the seed marker. 6 Ta-ble 2 shows examples of seed markers and re-lated terms we extracted from word embeddings.Through this procedure, we identified 228 addi-tional markers based on similarity to items in theseed list from Biber and Finegan (1989), and 112 additional markers based on the seed list of dia-log acts. In total, our stance lexicon contains 812unique markers. 5 Linguistic Dimensions of Stancetaking To summarize the main axes of variation acrossthe lexicon of stance markers, we apply a multi-dimensional analysis (Biber, 1992) to the distribu- tional statistics of stance markers across subred-dit communities. Each dimension of variation canthen be viewed as a spectrum, characterized by thestance markers and subreddits that are associatedwith the positive and negative extremes. Multi-dimensional analysis is based on singular valuedecomposition, which has been applied success-fully to a wide range of problems in natural lan-guage processing and information retrieval (e.g.,Landaueretal.,1998). WhileBayesiantopicmod- els are an appealing alternative, singular value de-composition is fast and deterministic, with a min-imal number of tuning parameters. 6 We tried different thresholds on the similarity value andthe corpus frequency, and the reported values were chosenbased on the quality of the resulting related terms. This wasdone prior to any of the validations or extrinsic analyses de-scribed later in the paper.  5.1 Extracting Stance Dimensions Our analysis is based on the co-occurrence of stance markers and subreddits. This is motivatedby our interest in comparisons of the interactionalstyles of online communities within Reddit, andby the premise that these distributional differencesreflect socially meaningful communicative norms.A pilot study applied the same technique to the co-occurrence of stance markers and individual au-thors, and the resulting dimensions appeared to beless stylistically coherent.Singular value decomposition is often used incombination with a transformation of the co-occurrence counts by pointwise mutual informa-tion (Bullinaria and Levy, 2007). This transforma- tion ensures that each cell in the matrix indicateshow much more likely a stance marker is to co-occur with a given subreddit than would happenbychanceunderanindependenceassumption. Be-cause negative PMI values tend to be unreliable,we use positive PMI (PPMI), which involves re-placing all negative PMI values with zeros (Niwaand Nitta, 1994). Therefore, we obtain stance di- mensions by applying singular value decomposi-tion to the matrix constructed as follows: X  m,s  =  log Pr( marker  =  m, subreddit  =  s )Pr( marker  =  m )Pr( subreddit  =  s )  + . Truncated singular value decomposition per-forms the approximate factorization  X   ≈  U  Σ V    ,where each row of the matrix U   is a k -dimensionaldescription of each stance marker, and each row of  V    is a  k -dimensional description of each subred-dit. We included the 7,589 subreddits that receivedat least 1,000 comments in 2014. 5.2 Results: Stance Dimensions From the SVD analysis, we extracted the six prin-cipal latent dimensions that explain the most vari-ation in our dataset. 7 The decision to include onlythe first six dimensions was based on the strengthof the singular values corresponding to the dimen-sions. Table 3 shows the top five stance markersfor each extreme of the six dimensions. The stancedimensions convey a range of concepts, such asinvolved versus informational language, narrative 7 Similar to factor analysis, the top few dimensions of SVD explain the most variation, and tend to be most inter-pretable. A scree plot (Cattell, 1966) showed that the amount of variation explained dropped after the top six dimensions,and qualitative interpretation showed that the remaining di-mension were less interpretable. Figure 1: Mapping of subreddits in dimensiontwo and dimension three, highlighting especiallypopular subreddits. Picture-oriented subreddits r/gonewild   and  r/aww  map high on dimension twoand low on dimension three, indicating involvedand informal style of discourse. Subreddits ded-icated for knowledge sharing discussions such as r/askscience  and  r/space  map low on dimensiontwo and high on dimension three indicating infor-mational and formal style.versus dialogue-oriented writing, standard versusnon-standard variation, and positive versus nega-tive affect. Figure 1 shows the distribution of sub-reddits along two of these dimensions. 6 Construct Validity Evaluating model output against gold-standard an-notations is appropriate when there is some no-tion of a correct answer. As stancetaking is amultidimensional concept, we have taken an unsu-pervised approach. Therefore, we use evaluationtechniques based on the notion of   validity , whichis the extent to which the operationalization of aconstruct truly captures the intended quantity orconcept. Validation techniques for unsupervisedcontent analysis are widely found in the social sci-ence literature (Weber, 1990; Quinn et al., 2010) and have also been recently used in the NLP andmachine learning communities (e.g., Chang et al.,2009; Murphy et al., 2012; Sim et al., 2013). We used several methods to validate the stancedimensions extracted from the corpus of Redditcomments. This section describes intrinsic eval-uations, which test whether the extracted stancedimensions are linguistically coherent and mean-
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks