Games & Puzzles

A Unified Approach for Schema Matching, Coreference,and Canonicalization

The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely
of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  A Uni fi ed Approach for Schema Matching, Coreferenceand Canonicalization Michael Wick Department of ComputerScienceUniversity of MassachusettsAmherst, MA 01003 mwick@cs.umass.eduKhashayarRohanimanesh Department of ComputerScienceUniversity of MassachusettsAmherst, MA 01003 khash@cs.umass.eduKarl Schultz Department of ComputerScienceUniversity of MassachusettsAmherst, MA 01003 kschultz@cs.umass.eduAndrew McCallum Department of Computer ScienceUniversity of MassachusettsAmherst, MA 01003 ABSTRACT The automatic consolidation of database records from manyheterogeneous sources into a single repository requires solv-ing several information integration tasks. Although taskssuch as coreference, schema matching, and canonicalizationare closely related, they are most commonly studied in iso-lation. Systems that do tackle multiple integration prob-lems traditionally solve each independently, allowing errorsto propagate from one task to another. In this paper, we de-scribe a discriminatively-trained model that reasons aboutschema matching, coreference, and canonicalization jointly.We evaluate our model on a real-world data set of people anddemonstrate that simultaneously solving these tasks reduceserrors over a cascaded or isolated approach. Our experi-ments show that a joint model is able to improve substan-tially over systems that either solve each task in isolationor with the conventional cascade. We demonstrate nearly a50% error reduction for coreference and a 40% error reduc-tion for schema matching. Categories and Subject Descriptors H.2 [ Information Systems ]: Database Management; H.2.8 General Terms Algorithms, Keywords Data Integration, Coreference, Schema Matching, Canoni-calization, Conditional Random Field, Weighted Logic Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for pro fi t or commercial advantage and that copiesbear this notice and the full citation on the fi rst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior speci fi cpermission and/or a fee. KDD’08, August 24–27, 2008, Las Vegas, Nevada, USA.Copyright 2008 ACM 978-1-60558-193-4/08/08 ...$5.00. 1. INTRODUCTION As the amount of electronically available information con-tinues to grow, automatic knowledge discovery is becomingincreasingly important. Unfortunately, electronic informa-tion is typically spread across multiple heterogeneous re-sources (databases with di ff  erent schemas, or web documentswith di ff  erent structures) making it necessary to consolidatethe data into a single repository or representation beforedata mining can be successfully applied. However, data in-tegration is a challenging problem. Even the task of mergingtwo databases with similar schemas about the same real-world entities is non-trivial. An automatic system must beable to perform coreference (to identify duplicate records),canonicalization (to pick the best string representation of the duplicate record), and schema matching (to align thefields across schemas).Coreference and other integration tasks have been stud-ied almost exclusively in isolation, yet the individual prob-lems are highly correlated. As an example, consider thetwo di ff  erent data records of a person named John Smithin Table 1. Each data record is represented using a dif-ferent schema. In this example, knowing that the Contact attribute from schema A maps to the Phone attribute fromschema B provides strong evidence that the two mentionsare coreferent, indicating that schema matching is a valuableprecursor to coreference. However, knowing that the twoJohn Smith mentions are coreferent provides strong evidenceabout which fields should be matched across the schemas(for example, the FirstName and LastName attributesof schema A should be mapped to the Name attribute of schema B ). The high correlation of these two tasks indicatethat a cascaded approach, where one task must be solvedbefore the other, is likely to lead to gratuitous error propa-gation.To motivate the idea further, consider the task of canoni-calization, the process of creating a single standardized rep-resentation of a record from several di ff  erent alternatives.The result of canonicalization on a set of records is a singlerecord containing a high density of information about a real-world entity. Intuitively, these canonical representations of   Schema A Schema BFirstName John Name John Smith MiddleName R. Phone 1 -(222)-222-2222 LastName Smith Contact 222-222-2222 Table 1: Two records from di ff  erent schemas representing the same John Smith entities contain valuable evidence for coreference. We wouldlike exploit this entity-levelinformation, yet canonicalizationassumes coreference has already been performed.In this paper we investigate a unified approach to dataintegration that jointly models several tasks and their de-pendencies. More precisely, we propose a conditional ran-dom field for simultaneously solving coreference resolution,record canonicalization, and schema matching. As describedin Section 3.1, one particular feature of our model is that itautomatically discovers the top level canonical schema. Weuse first order logic clauses for parameter tying, e ff  ectivelycombining logic and probability in a manner similar to [24,7, 19]. Exact inference and learning in these models areintractable, thus we present approximate solutions to boththese problems. Our approximations prove to be e ff  ectiveallowing us to achieve almost a 50% reduction in error forcoreference and a 40% error reduction in schema matchingover non-joint baselines. 2. RELATED WORK2.1 Coreference Resolution Coreference is a pervasiveproblem in data integration thathas been studied in several di ff  erent domains. The ACEand MUC corpora have helped initiate a line of research onnewswire coreference, beginning with approaches that exam-ine mention pairs [25, 18, 12] to more complicated modelsthat reason over entire sets of mentions [7]. Person disam-biguation, another form of coreference, has also been studiedin detail in [21, 11, 10, 26, 14, 4, 1]. However, these worksonly resolve coreference between objects of the same repre-sentation (e.g., database schema). The coreference problemwe tackle involves objects that contain di ff  erent represen-tations, making direct comparisons between these objectsdi ffi cult (for example, we may not know in advance that Phone from schema A maps to Contact in schema B ). Thecoreference portion of our model factorizes over sets of men-tions and incorporates first order logic features making itmost similar to Culotta et al. [7]. 2.2 Canonicalization Since records can refer to the same underlying entity inmultiple ways (common aliases, acronyms and abbreviations),it is often necessary to choose a single and standardized rep-resentation when displaying the result to a user, or storing itcompactly in a database. Additionally, because the canon-ical record is constructed from multiple records, it containsa high density of information about the entity, making it aconvenient source of evidence for coreference resolution.Canonicalization has played an important role in systemsthat perform coreference and database merging. Tradition-ally, it is performed post hoc and often relies on metrics forevaluating distances between strings. An example of canon-icalization in a database merging task is Zhu and Unger[28], who obtain positive results by learning string edit pa-rameters with a genetic algorithm. McCallum et al. [17]extend usual edit distance models with a conditional ran-dom field, demonstrating more accurate distance evaluationson several corpora; however, they do not apply their stringdistance model to the problem of canonicalization. Otherapproaches include Ristad and Yianilos [23], who use expec-tation maximization to learn the parameters of a generativemodel that defines a string in terms of the string edit op-erations required to create it. This work is extended andapplied successfully to record deduplication by Bilenko andMooney [2]. Recently, Culotta et al. [6] describe severalmethods for canonicalization of database records that arerobust to noisy data and customizable to user preferences(e.g., a preference for acronyms versus full words). 2.3 Schema Matching Schema and ontology mapping are fundamental problemsin many database application domains such as data inte-gration, E-business, data warehousing, and semantic queryprocessing. In general we can identify two major challengeswith the schema matching (and ontology mapping) problem:(1) structural heterogeneity and, (2) semantic heterogeneity.The former concerns the di ff  erent representations of infor-mation where the same information can be represented indi ff  erent ways. This is a common problem with heteroge-neous databases. The latter deals with the intended mean-ing of the described information. More specifically we canidentify the following di ff  erences between schemas [27]: (1)structural conflicts concerned with di ff  erent semantic struc-tures; (2) naming conflicts where di ff  erent attribute namesmay be used for the same type of information, or the samename for slightly di ff  erent types of information; (3) conflictswhere di ff  erent formats maybe used to represent the valuesof attributes (for example, di ff  erent units, di ff  erent preci-sion, or di ff  erent abbreviation styles). This problem hasbeen extensively studied primarily by the database and ma-chine learning communities [20, 13, 15, 8, 9, 27] (for a surveyof the traditional approaches refer to [22]).Our model is able to reason about all three di ff  erent kindsof conflicts mentioned above, based on a set of first orderlogic features employed by the CRF modeling the task. Ourapproach also di ff  ers from previous systems in that schemamatching is performed jointly along with coreference andcanonicalization, resulting in a significant error reduction aswe will see in Section 5. One important aspect of our modelis that it will automatically discover the top level canonicalschema for the integrated data as will be demonstrated inSection 3.1. 3. PROBLEM DEFINITION We seek a general representation that allows joint reason-  ing over a set of tasks defined on a set of possibly hetero-geneous objects in the world. In our unified data integra-tion approach we aim for a representation that enables usto perform inference and learning over two di ff  erent types of objects: (1) data records (in relation to the coreference reso-lution and canonicalization tasks); (2) schema attributes (inrelation to the schema matching task). In abstract terms,our model finds solutions to this problem in terms of a setof partitions (clusters) of data records where all the recordswithin a particular partition are coreferent and canonical-ized; and also a set of partitions (clusters) of the schema at-tributes across di ff  erent databases, where all the attributeswithin a schema partition are mapped together. As we willsee shortly, although data record clusters and schema clus-ters are kept disjoint in terms of the type of objects theycontain, they are tied together through a set of factors thatcompute di ff  erent features associated with each task. Thus,inference is performed jointly over all three tasks. In thenext section we describe a graphical model representation of this approach in more detail. 3.1 Model We use a conditional random field (CRF) [16] for jointlyrepresenting schema matching, coreference resolution, andcanonicalization tasks as follows: • Let D = {D 1 ,..., D k } denote a set of  k databases. Eachdatabase D i is represented as a 2-tuple D i ≡  S  i , R i  , where S  i is a database schema, and R i = { r ij } n D i j =1 is a set of datarecords generated using the schema S  i . Each schema S  i isrepresented by a set of attributes { a ij } n S i j =1 where a ij is the  j th attribute of the schema S  i . We use S  = {S  1 ,..., S  k } to represent the complete set of schemas, and A = { a ij } to denote the complete set of schema attributes across allschemas. • Let X = { x 1 ...x n } ∪ A denote a set of observed vari-ables, where x i is a data record drawn from some database D l , and A is the set of all schema attributes { a ij } . • Let X  i denote a particular grouping of the observed vari-ables X constrained by the fact that all the variables in acluster X  i denote the same type of the objects (all the ob-served variables in a cluster X  i are either data record objects,or schema attribute objects). For clarity we use X  ri to de-note a cluster of data records, and X  sj to denote a cluster of schema attributes. • Let Y  = { y 1 ...y m } denote a set of unobserved vari-ables that we wish to predict. In this paper we focus onlyon a particular class of clustering models 1 where the vari-ables y i indicate some compatibility among clusters of  X variables (i.e., clusters X  ri and X  sj ). We employ three typesof  Y  variables: (1) variables that indicate the compatibilityamong instances within a cluster of  X variables; (2) variablesthat indicate the compatibility between a pair  of clusters of the same type (compatibility between two clusters of datarecords, or two clusters of schema attributes); (3) variablesthat indicate the compatibility between a pair  of clusters of di ff  erent types (compatibility between a data record clusterand a schema attribute cluster). Note that this represen-tation renders an exponential complexity when instantiat-ing the Y  , X  ri , and X  sj variables (e.g., | Y  | = O (2 | X | ), and |X  ri | = O (2 n ), and |X  ri | = O (2 | A | )). 1 In general the inference and learning methods could be ap-plied to a variety of model structures [5].Let X  i denote a particular grouping of the observed vari-ables X constrained by the fact that all variables in a cluster X  i denote the same type of the objects (all the observed vari-ables in a cluster X  i are either data record objects, or schemaattribute objects). For clarity we use X  ri to denote a clusterof data records, and X  sj to denote a cluster of schema at-tributes. Our goal is to learn a clustering X  = {X  ri } ∪ {X  sj } such that all the data records in X  ri are coreferent, and allthe schema attributes in X  sj bear the correct schema match-ing.Next, the conditional probability distribution P  ( Y  | X ) iscomputed as: P  ( Y  | X ) =1 Z  X Y y i ∈ Y  f  w ( y i , X  i ) Y y i ,y j ∈ Y  f  b ( y ij , X  ij ) (1)where Z  X is the input-dependent normalizer, factor f  w pa-rameterizes the within-cluster  compatibility, and factor f  b parameterizes the between-cluster  compatibility 2 . Note thatin our model there are two types of within cluster factors f  w :those measuring the compatibility within a data record clus-ter (e.g., X  ri ), and those measuring the compatibility withina schema attribute cluster (e.g., X  si ). Similarly, there aretwo types of between cluster factors f  b : those measuringthe compatibility between a pair of homogeneous clusters(two data record clusters, or two schema attribute clusters),and those measuring the compatibility between a pair of heterogeneous clusters (a data record cluster and a schemaattribute cluster). We also employ a log-linear model of po-tential functions (i.e., f  w and f  b ): f  ( y i ,x i ) = exp( X k λ k g k ( y i ,x i ))This model can be intuitively described as follows: everypossible clustering of the data induces a di ff  erent set of in-stantiations of  Y  variables and possibly gives di ff  erent as-signments to them. The conditional distribution P  ( Y  | X )gives the probability of a configuration Y  measured in termsof a normalized score of how likely that configuration is. Weparameterize this score with a set of potential functions thatevaluate the compatibility of both within-cluster attributesand between-cluster attributes. Schema ADatabase A Database BSchema BDatabase CSchema C A B C n n n Figure 1: Three example databases: each databaseuses a di ff  erent schema to generate a set of datarecords. Each schema is visually represented as acollection of color-coded objects of the same shape(solid rectangles in schema A, dotted lines in schemaB, and hollow rectangles in schema C). A desirable facet of our model is that it factorizes intoclusters of data rather than pairs (Equation 1). This en-ables us to define features of entire clusters using first-order  2 In the above equation we use the notation X  ij to denote apair of clusters X  i and X  j .  logic features  : features that can universally and existentiallyaggregate properties of a set of objects [5].To further illustrate the model consider the simple exam-ple task demonstrated in Figure 1. There are three databases,where each database uses a di ff  erent schema to generate aset of data records. Each schema is visually represented asa collection of color-coded objects of the same shape (solidrectangles in schema A , dotted lines in schema B , and hol-low rectangles in schema C ). Within each schema, di ff  erentattributes are color-coded, and similar color across di ff  er-ent schemas may refer to the same attribute concept. Eachdatabase consists of a number of data records generated us-ing its own schema (for example, database A contains n A data records generated using schema A ). The goal is to per-form joint inference among coreference resolution, canoni-calization, and schema matching.Figure 2 displays a factor graph of the conditional randomfield modeling the above joint task. There are two levels of clustering processes: • Schema attribute clusters (top level): Each clusterin this level consists of a subset of the complete set of schemaattributes. Note that two or more attributes of the sameschema may be placed within the same cluster together withthe attributes of other schemas. For example, one databasemay use a schema that has an attribute Name to repre-sent the full name of a person, while a second database mayuse a schema with two di ff  erent attributes, First Name and Last Name , for representing the same concept. Intuitively,we would like to place all three in the same cluster. Someclusters such as X  7 may contain a single schema attribute,meaning that it does not match to any other schema at-tributes in the other databases. Note that the set of schemaattribute clusters establishes the top level canonical schemafor the integrated data (lightly shaded clusters). • Data record clusters (bottom level): Each clusterrepresents a set of coreferent data records. Note that everydata record is visually represented as an encapsulation of the schema from which it was generated. For example clus-ter X  1 consists of a single data record from database A , asingle data record of database B , and two data records fromdatabase C . There may also exists clusters that contain aset of data records from the same database (for example thecluster X  3 ) due to duplication errors in that database. • Factors: Although the clusterings at di ff  erent levelsare defined over the same type of objects (data records orschema attributes), they are tied using a set of factors. Wecan identify three types of factors in general: (1) factorsthat measure compatibility among instances within a clus-ter (e.g., f  1 , or f  4 ); (2) factors that measure compatibilitybetween pairs of clusters of the same type (compatibilitybetween two clusters of data records such as f  12 , or twoclusters of schema attributes such as f  67 ); (3) factors thatmeasure compatibility between pairs of clusters of di ff  erenttypes (compatibility between a data record cluster and aschema attribute cluster such as f  34 ).Although omitted from Figure 2 for clarity, there are ad-ditional canonicalization variables for each attribute in eachcoreference cluster. Even though we lack labeled data forcanonicalization, we set these variables using a centroid-based approach with default settings for string edit param-eters (insert, delete and substitute incur a penalty of one,and no penalty is given for copy). This method is shown inCulotta et al [6] to perform reasonably well and to capture Y 5 Y 8 Y 6 Y 7 Y 78 Y 56 Y 2 Y 1 Y 13 Y 45 Y 14 Y 48 Y 4 Y 24 Y 23 Y 3 YXXX 123 Coreference and Canonicalization XX XX X 456 78 Schema Matching f f f f f  14 Y 12126767 Y 3434 Figure 2: Factor graph representation of the model.There are two clustering processes, one at the levelof schema attributes (top level), and one at the levelof data records (bottom level). Di ff  erent factors tiethese two processes which allows for joint inferenceamong di ff  erent data integration tasks. Note thatfor clarity of the figure not all the factor namesare represented. Note that the top level clusteringalso automatically discovers the top-level canonicalschema in the integrated data (lightly shaded clus-ters).Algorithm 1 Joint Inference1: Input: coreference clustering C schema clustering S  2: while Not Converged do 3: C ⇐ GreedyAgglomerative  (make-singletons( C ) , S  ) 4: S  ⇐ GreedyAgglomerative  (make-singletons( S  ) , C ) 5: end while many of the desirable properties of a canonical string.Even though we are able to achieve greater expressivenessin our model with cluster-wise first order features and highconnectivity, we sacrifice the ability to apply exact inferenceand learning methods, since we cannot instantiate all of the Y  variables. In Culotta and McCallum [5], approximateinference and parameter estimation methods operate withpartial instantiations, where only the di ff  erence  between twoconfigurations are su ffi cient to perform learning. Buildingon these techniques, we briefly demonstrate how learning isperformed in this model in the next section. 3.2 Inference and Parameter Estimation Both the joint model and the individual conditional ran-dom fields for each subtask are too large to be fully instan-tiated, making exact training and inference intractable. Inthis section we describe in detail our approximate and infer-ence methods. 3.2.1 Training To learn the parameters for coreference resolution we fix  the schema matching to ground-truth and fix the canoni-calization to a reasonable default. Next, we sample pairsof clusters C  i and C  j and define the binary random variable y ij = 1 if and only if all the mentions in c i and c j are corefer-ent. Given the fixed schema matching and canonicalization,and a whole set of cluster pairs with their correspondinglabels, we set the coreference parameters to maximize thelikelihood of the training set by performing gradient descent(and regularizing with the usual Gaussian prior). A similarprocedure is used to set the parameters for schema match-ing. However, in this case the coreference ground-truth isheld fixed and the label indicates whether or not all theinstances in two schema clusters all match to each other.Canonicalization is not used for the schema-matching task.This training method can be viewed as a piecewise psuedo-likelihood approximation. 3.2.2 Testing For inference we use a standard greedy agglomerative ap-proximation to each subtask. The algorithm begins with asingleton clustering (each instance is in its own cluster) andgreedily merges clusters until no merge scores are above astopping threshold τ  . Joint inference works in rounds, per-forming greedy agglomerative first on coreference, and thenon schema-matching. The coreference prediction in round i is used to help schema matching in round i , whereas a schemamatching prediction from round i is used to help coreferencein round i +1 . This process can be repeated for a fixed num-ber of iterations. 4. EXPERIMENTS4.1 Data Synthetic data: The synthetic data is generated from asmall number (10) of user records collected manually fromthe web. These records use a canonical schema contain-ing attributes such as first name, phone number, email ad-dress, job title, institution, etc. Next, we created three newschemas derived from the canonical schema by randomlysplitting, merging, or noisifying the attributes of the canon-ical schema. For example, one schema would contain a Name field whereas another would contain two fields, FirstName and LastName , for the same block of information (perhapsdropping the middle name if it existed). In the trainingdata we used the first two schemas and in the testing datawe used one of the schemas from training, and also the thirdschema. This way we train a model on one schema but testit on another schema. For training we used a small numberof user records that were similar in ways such that randompermutations could make coreference, schema matching, andcanonicalization decisions di ffi cult. We first conformed therecords to both schemas, and then made 25-30 copies of eachdata record for each schema while randomly permuting someof the fields to introduce noise. The testing data was cre-ated similarly, but for a di ff  erent set of data records. Therandom permutations included abbreviating fields, deletingan entire field, or removing a random number of tokens fromthe front and/or the end of a field. The result of this wasa large number of coreferent records, but with the possibil-ity that the disambiguating fields between di ff  erent recordshave been altered or removed. Real World Data: For our real-world data we man-ually extracted faculty and alumni listings from 7 univer-sity websites, of which we selected 8 lists that had di ff  er-ent schemas. The information in each schema contains ba-sic user data such as first name, phone number, email ad-dress, job title, institution, etc., as well as fields unique tothe academic domain such as advisor, thesis title, and almamater. For each name that had an e-mail address we usedour DEX [3] system to search the Internet for that person’shomepage, and if found we would extract another record.The DEX schema is similar to the university listings data,bringing the total number of di ff  erent schemas to 9. Of the nearly 1400 mentions extracted we found 294 coreferententities. Table 2 shows the DEX schema and two of thefaculty listing schemas. There are several schema match-ing problems evident in table 2, for example Job Depart-ment  from the UPenn schema is a superset of both of the Job Title fields. Another example, which occurs numeroustimes between other schemas, is where the pair of attributes  First Name , Last Name  from one schema is mapped to thesingleton attribute Name that denotes the full name.For each of the experiments we took all of the 294 coref-erent clusters, and randomly selected 200 additional men-tions that had no coreferent entities. This data was splitinto training and testing sets, where the only schema sharedbetween training and testing was the DEX schema. Thisensures that the possible schema matchings in the trainingset are disjoint from the possible schema matchings in thetest set. The data provides us with a variety of schemasthat map to each other in di ff  erent ways, thus making anappropriate testbed for performing joint schema matching,coreference, and canonicalization. 4.2 Features The features are first-order logic clauses that aggregatepairwise feature extractions. The types of aggregation de-pend on whether the extractor is real-valued or boolean. Forreal-valued features we compute the minimum, the maxi-mum, and the average value over all pairwise combinationsof records. For boolean-valued features we compute the fol-lowing over all pairwise combinations of records: featuredoes not exist; feature exists; feature exists for the major-ity; feature exists for the minority; feature exists for all.Table 3 lists the set of features used in our experiments. Incases where we compute features between records in di ff  erentschemas, sometimes we only compute the feature betweenfields that are aligned in the current schema matching. Thisis indicated by the Matched-fields only  column in table 3.All of these features are used for coreference decisions, butonly substring matching is used for schema matching. 4.3 Systems We evaluate the following three systems with and with-out canonicalization for a total of six systems. Canonical-ization is integrated with coreference inference by interleav-ing canonicalization with greedy agglomerative merges (ev-ery time a new cluster is created, a new canonical record iscomputed for it). Canonicalization has no a ff  ect on schemamatching in isolation and is only relevant when coreferenceis involved.Whenever we use greedy agglomerative inference, we setthe stopping threshold to τ  = 0 . 5. This is a natural choice asit corresponds to the decision boundary for a binary max-imum entropy classifier. Additionally, the joint inferencemethod described earlier is run for four rounds for the joint
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!