Science & Technology

A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND THE ANALYSIS OF THE ITALIAN STATISTICAL BUSINESS REGISTER

Description
A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND THE ANALYSIS OF THE ITALIAN STATISTICAL BUSINESS REGISTER Giuseppe Garofalo and Caterina Viviano, ISTAT Adriano Paggiaro and Nicola Torelli, University
Published
of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND THE ANALYSIS OF THE ITALIAN STATISTICAL BUSINESS REGISTER Giuseppe Garofalo and Caterina Viviano, ISTAT Adriano Paggiaro and Nicola Torelli, University of Padua Nicola Torelli, Dipartimento di Statistica, Via S.Francesco 33, Padova, Italy, ABSTRACT We consider the application of record linkage to the maintenance of the Italian statistical business register (ASIA), which has been explicitly built up by integrating different administrative sources. The main goal of the record linkage procedure we developed is that of avoiding duplications of units and false demographic flows of enterprises. The procedure is quite general and allows us to control the quality of individual characteristics helpful to identify the same business and to estimate linkage weights under fairly unrestrictive assumptions. Results of a preliminary application of the linkage procedure to an administrative data set are presented and discussed. Key Words: Unduplication, Continuity, Enterprise demography. 1. INTRODUCTION Planning and setting-up a complete and updated statistical business register has to exploit, in order to be economically feasible, all information on enterprises stored by administrative bodies. The Italian statistical business register (ASIA) has been built up by integrating data from various administrative archives and large-scale surveys or censuses. The use of administrative sources produces obvious advantages in terms of a decrease in costs, time, data availability and enterprise burden, but raises some definitional and methodological problems. More specifically, it is worth recalling that the legal population represented by administrative sources typically do not correspond to a meaningful statistical population. Unit definitions are different in administrative and statistical archives. More importantly, criteria to include or exclude a unit from administrative archives could be frequently due to reasons (e.g. tax evasion and elusion) not strictly connected to real changes in the economic activity of the enterprise. The linkage of legal units to statistical ones, in order to avoid duplications and false demographic flows of enterprises, requires a complex strategy involving: (i) a clear definition of continuity criteria to understand when changes in statistical units are statistically relevant; (ii) the use of exact matching procedures to decide when data from two different records actually pertain to the same unit. In this paper, after a concise presentation of the Italian statistical business register, relevance of a continuity definition for an enterprise with reference to the Italian case is discussed (section 2). In section 3, it is presented a short review of problems connected to computer record linkage techniques, and in section 4 it is reported a first application of a fairly general record linkage procedure to the reconstruction of enterprise evolution and the identification of non-demographic flows (spurious demography). The application is limited, at this stage, to data from the Italian fiscal register for a single municipality. 2. CRITERIA FOR ENTERPRISE IDENTIFICATION AND CONTINUITY ANALYSIS 2.1. The Italian Business Register Since 1995 the Italian National Statistical Institute has developed a complex proect, called ASIA, for the setting-up of a statistical business register harmonised with the European Community regulations. The first register was completed at the end of 1997, while the register quality has been checked in 1998 using data from 1997 intermediate Census, proected as a sample survey. The Italian statistical business register has been built up by integrating data from administrative sources. The main archives are: the fiscal register managed by the Ministry of Finances ( records), the registers of enterprises managed by the Chambers of Commerce ( records), the social security register (l records), the register of insurance against accidents on work ( records), the electricity users register and the business telephone numbers register. Administrative data are integrated with statistical ones taken from surveys carried out by Istat, usually limited to medium/large enterprises. UE regulations and Eurostat recommendations give a clear and exhaustive table of concepts and definitions useful in this specific context (for details, see Garofalo and Viviano 1998). The statistical definition of enterprise as the smallest combination of legal units that is an organisational 1612 unit and the concept of statistical continuity give a clear indication that the statistical population corresponds to a subset of the legal one Data and Techniques to Identify Unit Relationships In archives, at least partially fed by administrative sources like ASIA, the definition of statistical continuity assumes the identification of dynamic relationships between administrative entities that are apparently different. The identification of exact relationships can use either direct surveys or linkage techniques based on observable attributes of legal units. Surveys have the usual problems of costs and burden and, when information on legal units connections are directly taken from administrative files, coverage could be partial. Linkage techniques could be designed, in order to be useful, paying attention to actually available data and they should be based on a set of rules according to which units having common identification characteristics or relevant attributes are linked one another. With reference to Italian data, three strategies can be used: 1. Identification of legal units links through employees flows. This technique is based on the analysis of employees flows between two (or more) units and can be used in the analysis of mergers and demergers. Its use rests on the fact that if enterprise b takes over enterprise a a flow of (almost) all employees from a towards b will be observed, with the implicit assumption that it is possible to discriminate between physiological movements, produced by employees choices, and spurious ones caused by transitions between enterprises. To use this technique, individual longitudinal data on employees and cross section data on relationships between employees and enterprises are needed. Those information are available in Italy only in the social security register (Pacelli and Revelli 1995). 2. Identification of legal units links through the analysis of enterprise ownership. If the same person results as owner in more than one legal unit this could imply that two legal units correspond to the same statistical unit, or that a spurious opening/closure of activities has occurred. For Italian data, this solution could be used with information from the Chambers of Commerce registers. 3. Identification of legal units links through the similarity of attributes analysis. Through this technique links between legal units are reconstructed on the basis of similarity of attributes like: enterprise name, location, economic activity, size, uridical status. The choice of one of the above listed techniques depends on purposes and data availability as well as on costs and time for data processing. Whereas the technique based on employees flows has given good results in the analysis of spurious demography for medium-large enterprises, it cannot be used for smaller enterprises and for those enterprises without employees, the largest part of ASIA population. Exact matching techniques can be particularly useful in case 3) and this will be the route pursued in the sequel of the paper The Continuity Criterion The analysis of links and relationships among units over time is an important issue in a static context, in presence of delays in updating data on enterprises, and in a dynamic context for micro-economic research. Enterprises demography traditionally holds a main role in competition and production theories, in entrepreneurship supply analysis and as a tool of analysis in ob creation and ob turnover studies. Data to explore such themes are greatly affected by false flows of units (spurious demography). Definition of a reasonable continuity criterion, that is the condition under which different units are deemed to be the same over a given time period, is crucial. An enterprise is recognised by a specific set of resources, functions and products and possible changes in their combinations should be considered to define a continuity concept. The continuity concept proposed by Struis and Willeboordse (1995) has been widely accepted at an international level. Eurostat suggests some practical criteria based on combinations of changes occurring in some characteristics recorded in business registers. According to this concept, an enterprise is considered to be same in time if it modifies without any significant change in its identity in terms of the set of its production factors (employment, machines, raw material, capital management, buildings, etc.). Measuring continuity of all production factors and weighting them can be quite difficult and costly. For those reasons Eurostat suggests, as a practical criterion to identify the enterprise, to use their specific characteristics, available in the register, that can be assumed to be correlated to the most important production factors. The suggested empirical rule is that an enterprise is not considered to be the same if almost two over three modifications in the following characteristics occur: a) Legal unit controlling the enterprise: continuity of management of the enterprise may be assumed to be positively correlated with continuity in the control of the legal unit. 1613 b) Economic activities carried out by the enterprise: continuity of the four-digit NACE Rev.1 code of main activity may be assumed to be positively correlated to the continuity of production factors as employment, machines and equipment. c) Locations where activities are carried out: continuity of locations is of course closely linked to the continuity of land and buildings used by the enterprise. In the suggested rules an element of discontinuity is introduced when changes are of great extent and quick. The concrete applicability of such rules must be evaluated according to the economic structure in which they have to perform, because of the peculiarity of each country. For instance, for some domains of study as for demography of very small enterprises, it does not make sense to separate the uridical subect (the entrepreneur) from the statistical subect (the enterprise). For such cases a new controlling legal unit becomes a factor producing discontinuity even if it is the only one to change (Garofalo and Viviano 1999). 3. RECORD LINKAGE PROCEDURE FOR INTEGRATING DATA FROM BUSINESS REGISTERS Business data often need statistical methods to decide whether two records contain data on the same unit, and the choice is based on information coming from common identifiers such as name, address or individual codes. The most general formulation of a record linkage procedure dates back to the seminal paper by Fellegi and Sunter (1969). An excellent review on more recent development in this field, with connection to their possible use for managing business register, is in Winkler (1995). For our purposes, a short review on the topic will be given, in order to describe more efficiently the main solutions and criteria adopted in our empirical application. Let A and B be two files respectively containing records a and b. The set A B = {( a, b) ; a A, b B} can be partitioned in a set M of pairs representing the same business entity and a set U of pairs representing different entities. If A B, we have a problem of unduplication and the set M contains all the duplicates in the original file. The size of the files usually considered does not often allow explicit consideration of comparison of all pairs of records, and usually only pairs with some common characteristics are actually compared, by using blocking criteria. A record linkage procedure is then characterised by a decision rule that assigns all compared pairs either to M or U (for some pairs no decision is taken, and a set of possible links, usually left to clerical review, is defined). The link/not link decision is based on a matching weight, which is assigned to each compared pair according to the result of a comparison among some matching variables present in both records. A crucial choice is the definition of agreement in those comparisons, going from a simple agreement-disagreement dichotomy to a complex definition taking into account the specific values of the variables. The results of the comparison can be collected in a vector γ defining the agreement pattern for the i-th variable in every -th pair of the N deriving from the blocking criterion: 1 2 i I [ γ, γ,..., γ,..., ], = 1 N γ. = γ... A weight w is then associated to every possible outcome γ, taking to a decision rule depending on two thresholds: if w K u the pair is declared matched, if w K the pair is declared non-matched, while if l Kl w K the u decision is delayed to further analysis. According to this rule, two kind of errors can occur: (a) false matches - non matched pairs erroneously assigned to M; (b) false non-matches - matched pairs which are assigned to U (or left outside the defined comparison blocks). Estimation of false match and false non-match rates is important in order to define the specific choices in the procedure, as the blocking criteria and the thresholds in the decision rule. The crucial step for implementing a record linkage procedure is estimation of matching weights. A probabilistic procedure can be used to estimate the value of the latent variable G (G=1 if a pair is in M, G=0 if the pair is in U) given some information coming from the comparison on the matching variables. The estimation of the weights is usually related to the original formulation by Fellegi and Sunter (1969), with a ratio of probabilities of the form: w ( γ M) m ln ( γ U) u P = ln = P Fellegi and Sunter showed that these weights take to an optimal rule in that for any pair of fixed thresholds the clerical review region is minimised Estimation is difficult in that we can not usually know which pairs belong to M and U. A possible solution comes from the use of iterative methods like the EM algorithm, by which we can use imputed samples to estimate the parameters of interest. Let p be the unknown proportion of matched pairs; the likelihood function is: L N N g 1 g g 1 ( m, u; p) = [ P( M) P( M) ] [ P( U) P( γ U) ] = [ pm ] [( 1 p) u ] = 1 g γ. Given the values of m, u and p, the E step of the algorithm consists in estimating the latent variable g for each pair, given by its expected value (with a direct logit link to the w weights defined by Fellegi and Sunter): w pm m u e (,, p) = = = w pm + ( 1 p) u m u + ( 1 p) p + ( 1 ) Eg m u = 1. e p p In the M step, the likelihood is maximised on m and p given G. Following Jaro (1989), the probabilities u are estimated outside the iterative algorithm on a randomly chosen sample of pairs. In the estimation of m and u probabilities, it is often assumed independence among the probabilities m i and u i of observing respectively the single i output γ in the comparison between pairs in M and U. Note that in many applications this assumption can not be considered realistic. 4. DATA AND RESULTS 4.1. Data An exploratory application of the record linkage procedure has been carried out on a small data set referring to the municipality of Pesaro. The data set contains 9420 records collecting some administrative information on enterprises from the fiscal register in The archive contains some identifying information that can be used for an unduplication analysis aimed to identify records that pertain to the same units and allowing a proper analysis of enterprise demography. The identifying variables are: 1. An alphanumeric code (CF: codice fiscale) which uniquely identifies enterprises. The shape of the code depends on whether it is associated to individual enterprises (alphanumeric 16-characters code) or to partnership and companies (numeric 11-digits code). 2. Full NAME of the enterprise, as an up-to-40 characters string (ragione sociale). The name may be characterised by one first name and surname (individual enterprise), many names and surnames or other denominations often related to the actual economic activity of the enterprise. Moreover, other words are present in the name of nonindividual enterprises, specifying, for instance, the type of company . 3. Address, in a 5-digit code (ADD). 4. Economic activity of the enterprise (ATECO), in a 5-digit code (the Nace Rev.1 plus the fifth digit) Agreement Definition and Estimation Strategy Even with a relatively small data set, a direct comparison between all the possible pairs of records is unfeasible. In this application it has been used a blocking strategy that reduces the number of comparisons and is somehow related to the definition of continuity already outlined in the paper. The only pairs of records considered are those with at least one full agreement on three different variables: name of the enterprise, address or economic activity. Moreover, pairs of records which could never be associated to one another are excluded from the analysis: to avoid the wrong linkage of spurious homonyms, no pair is chosen if both records pertain to individual enterprises, with the only exception of pairs with perfect agreement on the CF code. The number of compared pairs following this complex blocking structure is about A first exploratory step was carried out to identify the better way to define different levels of agreement for every variable. A key variable for linkage is NAME, and in order to use it we took into account the results of a first application of record linkage techniques to the same data set (Garofalo and Viviano 1998). Different levels of agreement are defined for the full string of NAME, taking into account the number of words in the string itself and the number of agreements between the single words. After some sensitivity analysis, the final choice was on a levels definition, also depending on the kind of enterprise. The definition of agreement for the remaining variables is as follows: - An indicator of agreement on CF with three levels (perfect agreement, disagreement between 11-digits codes, disagreement between a 11-digits and a 16-characters code). - A dichotomous indicator of agreement-disagreement for ADD. - Three levels of agreement for ATECO (perfect agreement, different but similar economic activities, completely different economic activities). The choice of which pairs have to be considered matched is finally based on the probabilistic procedure described in the preceding section. Specifically, the estimation strategy we adopted is the following: - The u probabilities are estimated outside the iterative algorithm, by building a data set of randomly matched pairs of records. - The m probabilities are estimated by the EM algorithm, with a sensitivity analysis using different starting values. - Using a-priori information, some probabilities are forced to 1 or 0 (as an example, perfect agreement on CF leads to a linked pair with probability 1) Some Empirical Results In the previous application by Garofalo and Viviano (1998) the linkage procedure there adopted led to the identification of a high number of large clusters containing more than 10 enterprises. On the basis of an analytical review of the single enterprises in the clusters, it appeared quite reasonable that records referring to the same enterprise (who changed name, or street) were into the same cluster. But it was impractical (if not unfeasible) to identify them by a clerical review of all large clusters. In their procedure the estimate of matching weights was obtained by assuming independence among the components u i and m i, but this assumption does not seem very realistic for our proble
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks