A relevance feedback approach for the author name disambiguation problem

A relevance feedback approach for the author name disambiguation problem
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Relevance Feedback Approach for the Author NameDisambiguation Problem Thiago A. Godoi, Ricardoda S. Torres, and AriadneM. B. R. Carvalho Institute of ComputingUniversity of Campinas {thiago.godoi,rtorres,ariadne}@ic.unicamp.brMarcos André Gonçalves Dept. of Computer ScienceFederal University of MinasGerais mgoncalv@dcc.ufmg.brAnderson A. Ferreira Dept. of Computer ScienceFederal University of OuroPreto ferreira@iceb.ufop.brWeiguo Fan andEdward A. Fox Dept. of Computer ScienceVirginia Tech {wfan,fox} ABSTRACT This paper presents a new name disambiguation methodthat exploits user feedback on ambiguous references acrossiterations. An unsupervised step is used to define pure train-ing samples, and a hybrid supervised step is employed tolearn a classification model for assigning references to au-thors. Our classification scheme combines the Optimum-Path Forest (OPF) classifier with complex reference simi-larity functions generated by a Genetic Programming frame-work. Experiments demonstrate that the proposed methodyields better results than state-of-the-art disambiguation me-thods on two traditional datasets. Categories and Subject Descriptors H.3.3 [ Information Search and Retrieval ]: InformationRetrieval; I.5.2 [ Pattern Recognition ]: Classifier designand evaluation General Terms Algorithms, Experimentation Keywords Name Disambiguation, Relevance Feedback, Genetic Pro-gramming, Optimum-Path Forest Classifier 1. INTRODUCTION Scholarly digital libraries (DLs) such as DBLP 1 , Citeseer 2 , 1  (as of Jan. 2013). 2  (as of Jan. 2013). Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from  JCDL’13,  July 22–26, 2013, Indianapolis, Indiana, USA.Copyright 2013 ACM 978-1-4503-2077-1/13/07 ...$15.00. BDBComp 3 , and MEDLINE 4 are fundamental tools for theadvance of science and technology. However, one fundamen-tal issue which concerns all of them is the  quality   of theirdata, which affects not only data but also the effectivenessof the services provided. Among the many potential dataquality problems faced by these DLs, the  author name am-biguity   problem is one of the hardest to solve. This problemoccurs when a same author publishes under similar but dis-tinct names (synonyms) or distinct authors share the samename or close variations of this name (homonyms).The task of disambiguating author names may be formu-lated as follows. Consider a collection of   citation records  ,each of which has a list of   attributes  . That list necessarilyincludes author names, work title, and publication venue ti-tle. Each attribute associates a specific value with a citation.An attribute may be composed of several  elements  . In caseof the attribute “author names”, an element corresponds tothe name of a single unique author. Each author name ele-ment is a  reference   to an author. The objective of the dis-ambiguation task is to produce a disambiguation functionthat is used to partition the set of references to authors, sothat each partition block contains ideally all the referencesto a same author and no references to other authors.We assume that each reference to an author extractedfrom a citation record contains at least the following at-tributes: author name, coauthor names, work title, andpublication venue title. To disambiguate the bibliographiccitations of a digital library (DL), first we split the set of ref-erences to authors into groups of references called  ambiguous groups  . In each ambiguous group, the author name valuesare similar. The ambiguous groups may be obtained by us-ing blocking methods [1] which address scalability issues,avoiding the need for comparisons among all references.A large number of methods have been proposed in the lit-erature to deal with the author name ambiguity problem [2].However, there is still room for substantial improvement, 3  (as of Jan.2013). 4  (as of Jan.2013). 209  mainly in the case where only the minimum amount of in-formation exists in the citations, i.e, each citation containsonly author and coauthor names, work title, and publicationvenue title. Moreover, the best methods in the literature areusually supervised, requiring very costly training data.In this paper we propose a novel disambiguation methodwhich only requires a small manual intervention from the DLadministrator. Indeed, the feedback provided by the admin-istrator is given only in a few iterations, applied to the mostambiguous cases. The method detects these cases by ex-ploiting a combination of automatically generated similarityfunctions. These functions are discovered using genetic pro-gramming (GP) and a state-of-the-art classifier (optimum-path forest (OPF) [3,4]). The classifier uses these similarityfunctions to decide whether or not citations belong to a sameauthor. As we shall see in our experiments, our proposedmethod reduces drastically the amount of manual interven-tion. Furthermore, it outperforms some state-of-the-art dis-ambiguation methods, including one that also exploits somefeedback from the user.The rest of this article is organized as follows. Section 2discusses related work. Section 3 describes our proposedapproach. Section 4 presents the results of our experimentalevaluation. Finally, Section 5 presents our conclusions andoffers possible directions for future work. 2. RELATED WORK This section presents an overview of related work on au-thor name disambiguation methods and relevance feedback. 2.1 Author Name Disambiguation Methods Most of the automatic name disambiguation methods pro-posed in the literature adopt solutions that may be classifiedaccording to the main type of approach they exploit. In [2],the authors classify these approaches as performing either author grouping   or  author assignment  .Author grouping methods try to group the references tothe same author using some type of similarity among ref-erence attributes [5–20]. These methods attempt to groupreferences to a same author using some clustering techniquealong with a similarity function. The similarity function isapplied to the attributes of the references (or group of ref-erences) in order to decide whether to place references in asame group or not, using the clustering technique. This sim-ilarity function may be predefined (based on existing onesand depending on the type of the attribute) [5, 8, 11, 12,17,20], learned using a supervised machine learning tech-nique [6, 9, 15, 16, 21], or extracted from the relationshipsamong authors and coauthors, usually represented as a graph[7,19,22].Author assignment methods aim at directly assigning thereferences to their respective authors [18, 23–28]. Thesemethods use either a supervised classification technique [18,23, 27, 28], or a model-based clustering technique [24–26].They infer models that represent the authors (for instance,the probabilities of an author publishing an article withother (co-)authors, in a given publication venue and using alist of specific terms in the work title). 2.2 Relevance Feedback Relevance Feedback (RF) has been used mainly in searchtasks in Information Retrieval [29]. In such tasks, the resultsobtained from a given search are evaluated as relevant or notby the user who issued the query. This information is givenas feedback to the system which modifies the srcinal querywith terms belonging to the indicated documents. The newmodified query is then used to retrieve new documents andthis process continues until the user is satisfied or gives up.Some types of user feedback in author name disambigua-tion tasks were proposed in [28,30,31]. Wang et al. [30] pro- posed ADANA (Active Name Disambiguation) that works inan interactive mode. It asks the user for corrections, insteadof passively waiting for user inputs. For this, it chooses a fewpotentially erroneous disambiguated results, after a disam-biguation process has been run. The active selection aims atminimizing the number of interactions needed to maximizeeffectiveness. Notice that in order to obtain good results, theauthors make use of a lot of additional information, such asaffiliation, bibliographic references etc., that is usually notavailable in most cases. We, on the other hand, focus onlyon the minimum amount of information in a citation (i.e.,author and coauthor names, publication, and venue title).Li et al. [31] proposed the use of a Perceptron as a clas-sifier. They incorporated the user feedback of a scientificcooperation network as features and constraints, in order toimprove the disambiguation performance. Initially, they useonly features extracted from the publication and from theWeb, such as homepage. After the first disambiguation, theusers look for possible problems. The feedback features andconstraints are then used by a classification system, jointlywith the other features, to revise the classification model.That work differs from ours because it also uses other at-tributes beyond those commonly found in all publications.Moreover, it attempts to solve only the homonym problemand does not suggest possible errors (e.g., ambiguous refer-ences) to guide user feedback.Ferreira et al. [28] extended a self-training author namedisambiguator [18, 27] by incorporating user feedback. Itworks in two steps. In step 1, the references are grouped inclusters using the coauthorship graph. Some of these clus-ters are automatically selected and their references are usedas initial training examples in the next step. The remainingreferences will compose the test set. Step 2 uses an associa-tive classifier that is capable of detecting new authors. Thisclassifier works in two phases. In the first phase, it selects apercentage of the most doubtful predictions and asks the ad-ministrator to assign them to the correct authors (user feed-back). In the second phase, the doubtful predictions whoselabels were defined by the user are inserted into the train-ing set and the authors of the references in the test set arepredicted. Our method instead works iteratively, i.e., usersmay provide feedback on ambiguous references across sev-eral iterations. Furthermore, we propose to learn referencesimilarity functions that are employed in the assignment of references to authors by the OPF classifier. The state-of-the-art methods described in [18,28] are used as baselines inour evaluation protocol.Relevance feedback (RF) approaches have been proposedbased on the OPF classifier [32–34]. Those initiatives, how-ever, focus on content-based image retrieval [32,33] and clas-sification [34] tasks. In the approaches presented in [32,34],GP is also used to determine edge weights in OPF graphs.Those RF methods, however, adopt a binary classificationmodel in which each image is classified as either relevant ornon-relevant. In our case, we have a multiple-class prob-lem, because references may be assigned to different au- 210  thors. Multi-classification is intrinsically a much more com-plex problem. First, the definition of suitable training sets,in terms of coverage and size, is much harder. Furthermore,many classifiers that are successful in binary classificationproblems fail when used or extended for dealing with mul-tiple classes. In practical situations, the common strategyemployed is costly, as it relies on the combination of multiplebinary classifiers. 3. PROPOSED METHOD The proposed disambiguation approach exploits some of the strategies used by both author grouping and author as-signment methods. For instance, we use a machine learningframework to learn reference similarity functions and a su-pervised classification system to assign references to authors.The novelty here relies on the machine learning methods em-ployed to implement each strategy and on the combination of those strategies with learning samples defined through userfeedback. We use genetic programming (GP) for learningreference similarity functions and the Optimum-Path Forest(OPF) classifier for labeling references. The combination of OPF and GP has been very effective in relevance feedbackapproaches [32], yielding better results than methods basedon using only GP-based similarity functions.The disambiguation method consists in partitioning a setof references to authors  R  =  { r 1 ,r 2 ,...,r m }  into a set  S   = { s 1 ,s 2 ,...,s n } , so that each partition block  s i  has all thereferences to the same author.We present an interactive approach for the name disam-biguation problem. Our strategy relies on the use of a rel-evance feedback scheme, which enables the correct assign-ment of ambiguous references to their corresponding au-thors. We propose a two-step approach to perform thispartition, as illustrated in Figure 1. The first step is un-supervised and has as objective of the automatic creationof training examples to be used in the following step (mod-ule labeled with  A  in Figure 1). The second step follows arelevance feedback scheme, and is supervised and interactive(dashed box  B ). Users interact with the disambiguation sys-tem providing feedback on ambiguous references (arrow la-beled with  D ). The disambiguation system, in turn, trains aclassifier based on user’s feedback and the training examplesobtained in the first step. The system then uses the gener-ated classification model to relate each input reference to anauthor. The training and classification phases are performedin module  C  . This second step is executed iteratively untila stopping criterion is reached (e.g., a pre-defined numberof iterations). The following subsections present both stepsin more detail. 3.1 Unsupervised Step This step is inspired by the method proposed in [18,28]to automatically generate “pure” clusters of references tobe used as a training set. Two references within the samecluster should be very similar to each other while dissimilarto references in other clusters. A cluster is  pure   if it containsmostly (ideally only) references to the same author.Ferreira et al. [18,28] proposed the use of coauthorshipgraphs to create pure clusters: two references belong to thesame cluster if they have at least one co-author in common.It is based on the observation that two authors with ambigu-ous names very rarely will have two different co-authors whoalso have ambiguous names. That strategy, however, may PureClustersReferences Relevancefeedback loop  Assignmentof ambiguousreferencesLabeledreferencesLabeledreferences SupervisedStepUser  Figure 1: Proposed two-step name disambiguationmethod. lead to fragmented clusters, i.e., distinct references to thesame author may be placed into different clusters. The au-thors proposed a strategy based on both the size of clustersand their similarities to discard fragmented clusters [18,28]to compose the training data for the next step of the method. 3.2 Supervised Step Figure 2 summarizes the supervised step. The elementsof this figure are described in this section. Initially, the dis-ambiguation system learns appropriate similarity functionsbased on (I) the pure clusters produced by the unsupervisedstep described before (arrow 1 in Figure 2), and (II) the userfeedback (arrow 6), i.e., cluster assignments of ambiguousreferences. A classifier then uses these similarity functionsto determine the authorship of the remaining ambiguous ref-erences (arrow 5).The process of learning similarity functions is implementedusing a Genetic Programming (GP) framework (module  II  in Figure 2). GP is an evolutionary mechanism widely usedin optimization problems. Solutions of a target problem arerepresented as individuals of a population that evolve alonggenerations by means of genetic operations (e.g., crossover,mutation, and reproduction). That process continues until astopping criterion is reached. At the end, the best solutions(individuals) are selected. In our problem, an individual en-codes a reference similarity function, which can be used tocompute the similarity between two references (arrow 4). Wechose GP for several reasons, including (i) high effectivenessin previous studies on similarity function discovery [35–41],and (ii) ability to find near-optimal solutions in large searchspaces.In order to use the disambiguation system in practicalsituations, we have to employ a fast and robust learningapproach. We adopt the Optimum-Path Forest (OPF) clas-sifier [3,4] (module  I  ), a graph-based technique, which hassome advantages with respect to the traditional classifiers:(i) it is parameter-free, (ii) it does not assume any  a pri-ori   separability of objects in the feature space, and (iii) ithandles natively multiclass classification. OPF models theclassification task as a graph partition problem. The graphis partitioned into optimal discrete regions called Optimum-Path Trees (OPTs), which are rooted at some representa- 211  tive class objects (prototypes). The optimality criterion isgiven by the path-cost function, which guides the competi-tion process among the prototypes. In our interactive dis-ambiguation system, each OPF class represents a cluster of references to a same author. The similarity score defined byGP individuals weights the arc linking two references. OPFGP Reference similarity scoresSimilarityFunctionsTraining Set(Prototypes) SupervisedStep Labels of ambiguous references(user feedback)Labels of ambiguousreferences(classifier output)Pureclusters Figure 2: Supervised step of the disambiguation sys-tem. 3.2.1 Discovery of the ReferenceSimilarity Functionbased on GP We propose the use of GP to discover appropriate refer-ence similarity functions. A GP individual is represented asa binary tree. Every internal node of the tree is an operatorand every leaf node, known as  terminal  , represents either avariable or a constant. An example of a GP individual ispresented in Figure 3. In this example, an individual com-bines the values of three distinct features ( d 1 ,  d 2 , and  d 3 ) –the similarity value or distance (e.g., cosine similarity) be-tween two references  r 1  and  r 2  – into a single score value( d ( r 1 ,r 2 )). This individual corresponds to the function: f  ( d 1 ( r 1 ,r 2 ) ,d 2 ( r 1 ,r 2 ) ,d 3 ( r 1 ,r 2 )) =  d 1 ( r 1 ,r 2 ) × d 2 ( r 1 ,r 2 ) d 3 ( r 1 ,r 2 )  −   d 1 ( r 1 ,r 2 ) +  d 2 ( r 1 ,r 2 ).The higher the score  d ( r 1 ,r 2 ), the higher the probabilityof references  r 1  and  r 2  belonging to the same author. Figure 3: Example of GP individual Using this representation model, an initial population of individuals is created in a random fashion. Once an ini-tial population is generated, the evolutionary process starts.The first step is to assess the quality of the solution pro-duced by each individual. For that, a fitness function isused to assign a fitness score to each individual. Next, indi-viduals associated with the highest fitness values are selectedto construct the subsequent generation. After that, a newpopulation is created by applying to each of those individu-als genetic transformation operations, such as  reproduction  , crossover  , and  mutation  .The reproduction operator selects the best individuals andcopies them to the next generation. The crossover operatorcombines the genetic material of two parents by swapping asubtree of one parent with a part of the other. The mutationoperator selects a point at random in the tree of an individ-ual and replaces the existing subtree at that point with anew randomly generated subtree.In this paper, we adopt the GP framework proposed in [38]for combining reference similarity functions. The overallframework is presented in Algorithm 1. Starting with a setof training data containing labeled references, GP first op-erates on a large population of random combination func-tions. These combination functions are then evaluated. If the stopping criterion is not met, genetic transformationsare executed in order to create and evaluate the next gen-eration iteratively. Finally, after a predefined number of generations, the individual with the highest fitness value isselected. Algorithm 1  GP Framework1: Generate an initial population of random similarity trees2:  while  number of generations  ≤ N  gen  do 3: Calculate the fitness of each similarity tree4: Record the top  N  top  similarity trees5:  for all  the  N  top  individuals  do 6: Create the new population, using the operations of reproduction, crossover, and mutation7:  end for 8:  end while 9: Apply the best similarity tree (i.e., the best tree of thelast generation) on a set of testing dataThe input of the GP framework is a set of reference clus-ters. This input is converted into a list of pairs of references,considering all possible combinations. A GP individual isused to compute the similarity between two references andthis score is used to rank all pairs. A good individual shouldassign a high similarity score to references that belong to thesame author. Suppose that references  r 1  and  r 2  belong tothe same author, say  a 1 , while references  r 3  and  r 4  belongto authors  a 2  and  a 3 , respectively. Let  F   be a function thatreceives a pair of references ( r i ,r j ) as input and returns 1,if   r i  and  r j  belong to the same cluster (e.g.,  F  (( r 1 ,r 2 )) = 1)and 0, otherwise (e.g.,  F  (( r 1 ,r 3 )) = 0). Table 1 presentsan example of ranking of reference pairs according to a GPindividual and their respective  F   function values.The fitness of an individual is computed as a functionof the quality of the ranking of reference pairs. An idealranking would have all pairs whose  F   function returns 1 atthe first positions. In that case, the accuracy is maximum,i.e., there is a position in the rank that separates all pairsthat belong to a same author ( F   = 1) from the others. Inother cases, the position that maximizes the accuracy canbe determined (e.g., the second position in the ranked listin our example). 212  Table 1: Individual Fitness Pairs of references GP similarity score  F   output( r 2 ,r 3 ) 0.92 0( r 1 ,r 2 ) 0.89 1( r 1 ,r 3 ) 0.76 0( r 3 ,r 4 ) 0.69 0( r 2 ,r 4 ) 0.65 0( r 1 ,r 4 ) 0.45 0 The fitness is defined as the number of pairs correctlyclassified divided by the number of pairs. In our example,the fitness is equal to 5 / 6 = 0 . 83 (where 5 is the numberof pairs above the double line that has  F  output  = 1 plus thenumber of pairs below the double line that has  F  output  = 0,and 6 is the total number of pairs). 3.2.2 Classification of ambiguous references usingthe OPF classifier  We use a classification system based on the optimum-pathforest (OPF) classifier [3,4] to assign new references to ap-propriate authors.Algorithm 2 outlines the OPF training process. The firststep refers to the creation of a graph-based representationof training samples. Let  T   be a training set defined in termsof the pure clusters of references, i.e., each cluster containsreferences to a same author. The set  T   includes clustersdefined in the unsupervised step of the disambiguation al-gorithm and also references labeled by users along feedbackiterations.Let  G  = ( T,E  ) be a weighted complete graph, where theset of vertices of   G  is the set  T   of references and  E   is theset of edges. The arcs ( r i ,r j ) linking references  r i  and  r j are weighted by  d ( r i ,r j ), a function inversely proportionalto the similarity of   r i  and  r j . The weight  d ( r i ,r j ) is definedby a GP individual (see Section 3.2.1) that combines refer-ence similarity scores based on the author names, work title,and publication title of references  r i  and  r j . Figure 4(a) il-lustrates a training set with 11 references divided into threeclusters (three authors). No edges of the complete graph areshown (for improved visibility).In the next step, a set  S   of class representantives (proto-types) is obtained by computing a minimum spanning tree(MST) [42] in the complete graph. Let  λ ( r i ) be a functionthat returns the class label of a reference  r i  (e.g., authorname). For each arc ( r i ,r j ) in the MST, if   λ ( r i )   =  λ ( r j )then  r i  and  r j  are selected as prototypes and added to theset  S  . Figure 4(b) illustrates the definition of prototypes. Algorithm 2  OPF Classifier: Training Phase1: Generate the complete graph  G  = ( V,E  ) using the train-ing set  T  2: Compute the prototype set  S   ⊆  V   using the MST3: Split the graph into trees (whose roots are the proto-types) by computing the optimum-path forestFor every node  r i  ∈  T  , we wish to compute a minimum-cost path with terminus  r i  and srcin in  S  . The idea isto obtain an optimum partition of   T   such that the differ-ent classes represented by the prototypes in  S   will competewith each other. The use of prototypes, therefore, avoidsmisclassification caused by an optimum path coming froma different class. Every training node  r i  ∈  T   will be as-  Author 1Author 2 Author 3 r  11 r  1 r  5 r  3 r  2 r  4 r  7 r  6 r  9 r  8 r  10 (a) r  11 r  1 r  5 r  3 r  2 r  4 r  7 r  6 r  9 r  8 r  10 0.620.450.210.370.41 (c) r  11 r  1 r  5 r  3 r  2 r  4 r  7 r  6 r  9 r  8 r  10 (b)(d) r  11 r  1 r  5 r  3 r  2 r  4 r  7 r  6 r  9 r  8 r  10 0.620.450.210.370.410.110.710.810.690.280.90 Figure 4: (a) Example of a training set in a graphwith three classes represented by black nodes, whitenodes, and white nodes delimited by heavy circles,(b) prototypes (bounded by dashed cycles) obtainedfrom the MST, (c) optimum-path forest generatedfrom the prototypes, (d) classification of reference r 12  through its predecessor node  r 2  in the optimum-path forest. signed to the class of its most strongly connected prototype.This partition is computed as an optimum-path forest  P  ,an acyclic graph which stores all optimum paths in a pre-decessor map  P  . More precisely, for each node  r i  ∈  T   \ S  , P  ( r i ) denotes its predecessor node in the optimum path from S  , whereas  P  ( r i ) = nil for each node  r i  ∈  S  . Figure 4(c)illustrates the optimum-path forest generated from the pro-totypes defined in Figure 4(b).Algorithm 3 outlines the OPF classification phase. Let D  =  { C  1 ,C  2 ,...,C  k }  be a set of   k  clusters whose refer-ences need to be labeled. In the classification phase, forevery  r i  ∈  C  j  ( C  j  ∈  D ), the optimum path with terminus r i  can be easily identified by finding which training node r  ∈  T   provides the minimum-cost path from  r i  to anynode in  S  . Node  r  is the predecessor  P  ( r i ) in the optimumpath with terminus  r i . The reference  r i  is then classifiedas  λ ( r ). Figure 4(d) illustrates the classification phase. Inthis example, the shortest path from  r 12  to any node in  S   is r 12  →  r 2  →  r 11 , where the path cost is 0.48, then reference r 12  is assigned to the same class as  r 2 , its predecessor in thispath. 3.2.3 Identification of novel authors Some instances of the training set do not cover all thepossible authors in the ambiguous groups. In order to han-dle this problem in the classification phase of our approach,we define a cost threshold  τ   to guide the creation of newclusters. For each reference  r i , we denote by  d  ( r i ) theoptimum-path cost from  r i  to a prototype  r  . If   d  ( r i )  ≥  τ  ,then a new cluster is created to represent a new author, andreference  r i  is assigned to this new cluster. 213
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks