School Work

Products of Random Latent Variable Grammars

Description
Products of Random Latent Variable Grammars
Categories
Published
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
   Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL , pages 19–27,Los Angeles, California, June 2010. c  2010 Association for Computational Linguistics Products of Random Latent Variable Grammars Slav Petrov Google ResearchNew York, NY, 10011 slav@google.com Abstract We show that the automatically induced latentvariable grammars of Petrov et al. (2006) varywidely in their underlyingrepresentations,de-pending on their EM initialization point. Weuse this to our advantage, combining multipleautomatically learned grammars into an un-weighted product model, which gives signif-icantly improved performance over state-of-the-art individual grammars. In our model,the probability of a constituent is estimated asa product of posteriors obtained from multi-ple grammars that differ only in the randomseed used for initialization, without any learn-ing or tuning of combinationweights. Despiteits simplicity, a product of eight automaticallylearned grammars improves parsing accuracyfrom 90.2% to 91.8% on English, and from80.3% to 84.5% on German. 1 Introduction Learning a context-free grammar for parsing re-quires the estimation of a more highly articulatedmodel than the one embodied by the observed tree-bank. This is because the naive treebank grammar(Charniak, 1996) is too permissive, making unreal-istic context-freedom assumptions. For example, itpostulates that there is only one type of noun phrase(NP), which can appear in all positions (subject, ob- ject, etc.), regardless of case, number or gender. Asa result, the grammar can generate millions of (in-correct) parse trees for a given sentence, and has aflat posterior distribution. High accuracy grammarstherefore add soft constraints on the way categoriescan be combined, and enrich the label set with addi-tional information. These constraints can be lexical-ized (Collins, 1999; Charniak, 2000), unlexicalized(Johnson, 1998; Klein and Manning, 2003b) or au-tomatically learned (Matsuzaki et al., 2005; Petrovet al., 2006). The constraints serve the purpose of weakening the independence assumptions, and re-duce the number of possible (but incorrect) parses.Here, we focus on the latent variable approach of Petrov et al. (2006), where an Expectation Maxi-mization (EM) algorithm is used to induce a hier-archy of increasingly more refined grammars. Eachround of refinement introduces new constraints onhow constituents can be combined, which in turnleads toahigher parsing accuracy. However, EMis alocal method, and there are no guarantees that it willfind the same grammars when initialized from dif-ferent starting points. In fact, it turns out that eventhough the final performance of these grammars isconsistently high, there are significant variations inthe learned refinements.We use these variations to our advantage, andtreat grammars learned from different random seedsas independent and equipotent experts. We use aproduct distribution for joint prediction, which givesmore peaked posteriors than a sum, and enforces allconstraints of the individual grammars, without theneed to tune mixing weights. It should be noted herethat our focus is on improving parsing performanceusing a single underlying grammar class, which issomewhat orthogonal to theissue of parser combina-tion, that has been studied elsewhere in the literature(Sagae and Lavie, 2006; Fossum and Knight, 2009;Zhang et al., 2009). In contrast to that line of work,we also do not restrict ourselves to working with k-best output, but work directly with a packed forestrepresentation of the posteriors, much in the spiritof Huang (2008), except that we work with severalforests rather than rescoring a single one. 19  In our experimental section we give empirical an-swers to some of the remaining theoretical ques-tions. We address the question of averaging versusmultiplying classifier predictions, weinvestigate dif-ferent ways of introducing more diversity into theunderlying grammars, and also compare combiningpartial (constituent-level) and complete (tree-level)predictions. Quite serendipitously, the simplest ap-proaches work best in our experiments. A productof eight latent variable grammars, learned on thesame data, and only differing in the seed used inthe random number generator that initialized EM,improves parsing accuracy from 90.2% to 91.8%on English, and from 80.3% to 84.5% on German.These parsing results are even better than those ob-tained by discriminative systems which have accessto additional non-local features (Charniak and John-son, 2005; Huang, 2008). 2 Latent Variable Grammars Before giving the details of our model, we brieflyreview the basic properties of latent variable gram-mars. Learning latent variable grammars consists of two tasks: (1) determining the data representation(the set of context-free productions to be used in thegrammar), and (2) estimating the parameters of themodel (the production probabilities). We focus onthe randomness introduced by the EM algorithm andrefer the reader to Matsuzaki et al. (2005) and Petrovet al. (2006) for a more general introduction. 2.1 Split & Merge Learning Latent variable grammars split the coarse (but ob-served) grammar categories of a treebank into morefine-grained (but hidden) subcategories, which arebetter suited for modeling the syntax of naturallanguages (e.g. NP becomes NP 1  through NP k ).Accordingly, each grammar production A → BCover observed categories A,B,C is split into a setof productions A x → B y C z  over hidden categoriesA x ,B y ,C z . Computing the joint likelihood of the ob-served parse trees  T   and sentences  w  requires sum-ming over all derivations  t  over split subcategories:  i P ( w i ,T  i ) =  i  t : T  i P ( w i ,t )  (1)Matsuzaki et al. (2005) derive an EM algorithmfor maximizing the joint likelihood, and Petrov etal. (2006) extend this algorithm to use a split&mergeprocedure to adaptively determine the optimal num-ber of subcategories for each observed category.Starting from a completely markovized X-Bar gram-mar, each category is split in two, generating eightnew productions for each srcinal binary production.To break symmetries, the production probabilitiesare perturbed by 1% of random noise. EM is theninitialized with this starting point and used to climbthe highly non-convex objective function given inEq. 1. Each splitting step is followed by a mergingstep, which uses a likelihood ratio test to reverse theleast useful half of the splits. Learning proceeds byiterating between those two steps for six rounds. Toprevent overfitting, the production probabilities arelinearly smoothed by shrinking them towards theircommon base category. 2.2 EM induced Randomness While the split&merge procedure described aboveis shown in Petrov et al. (2006) to reduce the vari-ance in final performance, we found after closerexamination that there are substantial differencesin the patterns learned by the grammars. Sincethe initialization is not systematically biased in anyway, one can obtain different grammars by simplychanging the seed of the random number genera-tor. We trained 16 different grammars by initial-izing the random number generator with seed val-ues 1 through 16, but without biasing the initial-ization in any other way. Figure 1 shows that thenumber of subcategories allocated to each observedcategory varies significantly between the differentinitialization points, especially for the phrasal cate-gories. Figure 2 shows posteriors over the most fre-quent subcategories given their base category for thefirst four grammars. Clearly, EM is allocating the la-tent variables in very different ways in each case.As a more quantitative measure of difference, 1 weevaluated all 16 grammars on sections 22 and 24 of the Penn Treebank. Figure 3 shows the performanceon those two sets, and reveals that there is no singlegrammar that achieves the best score on both. Whilethe parsing accuracies are consistently high, 2 there 1 While cherry-picking similarities is fairly straight-forward,it is less obvious how to quantify differences. 2 Note that despite their variance, the performance is alwayshigher than the one of the lexicalized parser of Charniak (2000). 20   10 20 30 40 50 60    N   P   V   P   P   P   A   D   V   P   A   D   J   P S   S   B   A   R   Q   P   N   N   P J   J   N   N   S   N   N   R   B   V   B   N   V   B   G   V   B   I   N   C   D   V   B   D   V   B   Z   D   T   V   B   P Automatically determined number of subcategories Figure 1: There is large variancein the numberof subcat-egories (error bars correspondto one standard deviation). is only a weak correlation between the accuracieson the two evaluation sets (Pearson coefficient 0.34).This suggests that no single grammar should be pre-ferred over the others. In previous work (Petrov etal., 2006; Petrov and Klein, 2007) the final grammarwas chosen based on its performance on a held-outset (section 22), and corresponds to the second bestgrammar in Figure 3 (because only 8 different gram-mars were trained).A more detailed error analysis is given in Fig-ure 4, where we show a breakdown of F 1  scores forselected phrasal categories in addition to the overallF 1  score and exact match (on the WSJ developmentset). While grammar G 2  has the highest overall F 1 score, its exact match is not particularly high, andit turns out to be the weakest at predicting quanti-fier phrases (QP). Similarly, the performance of theother grammars varies between the different errormeasures, indicating again that no single grammardominates the others. 3 A Simple Product Model It should be clear by now that simply varying therandom seed used for initialization causes EM todiscover very different latent variable grammars.While this behavior is worrisome in general, it turnsout that we can use it to our advantage in this partic-ular case. Recall that we are using EM to learn both,the data representation, as well as the parameters of the model. Our analysis showed that changing theinitialization point results in learning grammars thatvary quite significantly in the errors they make, buthave comparable overall accuracies. This suggeststhat the different local maxima found by EM corre-spond to different data representations rather than to 4%7%10%1 2 3 4 5 6 7 8 NP 0%15%25%1 2 3 4 5 6 7 8 PP 0%15%30%1 2 3 4 5 6 7 8 IN 0%30%60%1 2 3 4 5 6 7 8 DT Figure 2: Posterior probabilities of the eight most fre-quenthiddensubcategoriesgiventheir observedbase cat-egories. The four grammars (indicated by shading) arepopulating the subcategories in very different ways. suboptimal parameter estimates.To leverage the strengths of the individual gram-mars, we combine them in a product model. Productmodels have the nice property that their Kullback-Liebler divergence from the true distribution willalways be smaller than the average of the KL di-vergences of the individual distributions (Hinton,2001). Therefore, as long as no individual gram-mar G i  is significantly worse than the others, we canonly benefit from combining multiple latent variablegrammars and searching for the tree that maximizesP ( T  | w )  ∝  i P ( T  | w, G i )  (2)Here, wearemaking the assumption that theindivid-ual grammars are conditionally independent, whichis of course not true in theory, but holds surprisinglywell in practice. To avoid this assumption, we coulduse a sum model, but we will show in Section 4.1that the product formulation performs significantlybetter. Intuitively speaking, products have the ad-vantage that the final prediction has a high poste-rior under  all  models, giving each model veto power.This is exactly the behavior that we need in the caseof parsing, where each grammar has learned differ-ent constraints for ruling out improbable parses. 3.1 Learning Joint training of our product model would couple theparameters of the individual grammars, necessitat-ing the computation of an intractable global parti-tion function (Brown and Hinton, 2001). Instead,we use EM to train each grammar independently, 21   89.5 89.6 89.7 89.8 89.9 90 90.1 90.2 90.6 90.7 90.8 90.9 91 91.1 91.2 91.3 91.4    F   1   S  c  o  r  e  o  n   S  e  c   t   i  o  n   2   4 F1 Score on Section 22 Figure 3: Parsing accuracies for grammars learned fromdifferentrandomseeds. The largevariance and weak cor-relationsuggest that no single grammaris to be preferred. but from a different, randomly chosen starting point.To emphasize, we do not introduce any systematicbias (but see Section 4.3 for some experiments), orattempt to train the models to be maximally dif-ferent (Hinton, 2002) – we simply train a randomcollection of grammars by varying the random seedused for initialization. We found in our experimentsthat the randomness provided by EM is sufficientto achieve diversity among the individual grammars,and gives results that are as good as more involvedtraining procedures. Xu and Jelinek (2004) madea similar observation when learning random forestsfor language modeling.Our model is reminiscent of Logarithmic OpinionPools (Bordley, 1982) and Products of Experts (Hin-ton, 2001). 3 However, because we believe that noneof the underlying grammars should be favored, wedeliberately do not use any combination weights. 3.2 Inference Computing the most likely parse tree is intractablefor latent variable grammars (Sima’an, 2002), andtherefore also for our product model. Thisisbecausethere are exponentially many derivations over splitsubcategories that correspond to a single parse treeover unsplit categories, and there is no dynamic pro-gram to efficiently marginalize out the latent vari-ables. Previous work on parse risk minimization hasaddressed this problem in two different ways: bychanging the objective function, or by constraining 3 As a matter of fact, Hinton (2001) mentions syntactic pars-ing as one of the motivating examples for Products of Experts. G1G2G3G4P 90% 91.5% 93% F1 Score G1G2G3G4P 40% 45% 50% Exact Match G1G2G3G4P 91% 93% 95% NP G1G2G3G4P 90% 92% 94% VP G1G2G3G4P 85% 88% 91% PP G1G2G3G4P 90% 92.5% 95% QP Figure 4: Breakdown of different accuracy measures forfour randomly selected grammars (G 1 -G 4 ), as well as aproduct model (P) that uses those four grammars. Notethat no single grammar does well on all measures, whilethe product model does significantly better on all. the search space (Goodman, 1996; Titov and Hen-derson, 2006; Petrov and Klein, 2007).The simplest approach is to stick to likelihood asthe objective function, but to limit the search spaceto a set of high quality candidates  T    : T  ∗ = argmax T  ∈T   P ( T  | w )  (3)Because the likelihood of a given parse tree can becomputed exactly for our product model (Eq. 2), thequality of this approximation is only limited by thequality of the candidate list. To generate the candi-date list, we produce k-best lists of Viterbi deriva-tions with the efficient algorithm of Huang and Chi-ang (2005), and erase the subcategory informationto obtain parse trees over unsplit categories. We re-fer to this approximation as T REE -L EVEL  inference,because it considers a list of complete trees fromthe underlying grammars, and selects the tree thathas the highest likelihood under the product model.While the k-best lists are of very high quality, this isa fairly crude and unsatisfactory way of approximat-ing the posterior distribution of the product model,as it does not allow the synthesis of new trees basedon tree fragments from different grammars.An alternative is to use a tractable objective func-tion that allows the efficient exploration of the entire 22
Search
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks