Social Media

Applying social network analysis to the information in CVS repositories

Description
Applying social network analysis to the information in CVS repositories
Categories
Published
of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Applying Social Network Analysis to the Information in CVS Repositories Luis Lopez-Fernandez, Gregorio Robles, Jesus M. Gonzalez-BarahonaGSyC, Universidad Rey Juan Carlos { llopez,grex,jgb } @gsyc.escet.urjc.es Abstract The huge quantities of data available in the CVS reposi-tories of large, long-lived libre (free, open source) software projects, and the many interrelationships among those dataoffer opportunities for extracting large amounts of valuableinformation about their structure, evolution and internal processes. Unfortunately, the sheer volume of that informa-tion renders it almost unusable without applying method-ologies which highlight the relevant information for a givenaspect of the project. In this paper, we propose the use of a well known set of methodologies (social network anal- ysis) for characterizing libre software projects, their evo-lution over time and their internal structure. In addition,we show how we have applied such methodologies to realcases, and extract some preliminary conclusions from that experience.  Keywords: source code repositories, visualization tech-niques, complex networks, libre software engineering 1 Introduction The study and characterization of complex systems is anactive research area, with many interesting open problems.Special attention has been paid recently to techniquesbasedon network analysis, thanks to their power to capture someimportant characteristics and relationships. Network char-acterization is widely used in many scientific and techno-logical disciplines, ranging from neurobiology[14] to com-puter networks [1] [3] or linguistics [9] (to mention justsome examples). In this paper we apply this kind of analy-sis to software projects, using as a base the data available intheir source code versioning repository (usually CVS). For-tunately, most large (both in code size and numberof devel-opers) libre (free, open source) software projects maintainsuch repositories, and grant public access to them.The information in the CVS repositories of libre soft-ware projects has been gathered and analyzed using severalmethodologies [12] [5], but still many other approaches arepossible. Among them, we explore here how to apply sometechniques already common in the traditional (social) net-workanalysis. Theproposedapproachis basedonconsider-ing either modules (usually CVS directories) or developers(commitersto the CVS) as vertices, andthe numberofcom-moncommitsas the weightofthe linkbetweenanytwo ver-tices(seesection3foramoredetaileddefinition). Thisway,we end up with a weighted graphwhich captures some rela-tionships between developers or modules, in which charac-teristics as informationflow or communities can be studied.There have been some other works analyzing social net-works in the libre software world. [7] hypothesizes that theorganization of libre software projects can be modeled asself-organizing social networks and shows that this seemsto be true at least when studying SourceForge projects.[6] proposes also a sort of network analysis for libre soft-ware projects, but considering source dependencies be-tween modules. Our approach explores how to apply thosenetwork analysis techniques in a more comprehensive andcomplete way. To expose it, we will start by introducingsome basic concepts of social network analysis which areused later (section 2), and the definition of the networks weconsider 3. In section 4 we introduce the characterizationwe propose for those networks, and later, in section 5, weshow some examples of the application of that characteri-zation to Apache, GNOME and KDE. To finish, we offersome conclusions and discuss some future work. 2 Basic concepts on Social Network Analysis The Theory of Complex Networks is based on repre-senting complex systems as graphs. There are many ex-amples in the literature where this approach has been suc-cessfully used in very different scientific and technologi-cal disciplines, identifying vertices and links as relevant foreach specific domain. For example, in ecological networkseach vertex may represent a particular specie, with a link between two species if one of them “eats” the other. Whendealing with social networks, we may identify vertices withpersons or groups of people, considering a link when thereis some kind of relationship between them.Among the different kinds of networks that can be con- 101  sidered, in this paper, we use affiliation networks. In affil-iation networks there are two types of vertices: actors and groups . When we represent the network in terms of actors,each vertex is associated with a particular person and twovertices are linked together when they belong to the samegroup of people. When we represent the network in termsof groups, each vertex is associated with a group and twogroups are linked through an edge when there is, at least,one person belonging to both at the same time.Social networks can be directed (when the relationshipbetween any two vertices is one way, like “is a boss of”) orundirected(when it is bidirectional, like “live together”). Inaddition, they can be weighted (each edge has an associatednumeric value) or unweighted (each edge exists or not). 3 Definition of the networks of developersand modules In the approach we propose, for each project we buildtwonetworksusingthecommitinformationoftheCVS sys-tem. Both correspond to the two sides of an affiliation net-workobtainedwhenwe considercommitersandmodulesinlibre software projects. In both cases we consider weightedundirected networks as follows: • Commiter network . Each vertex corresponds toa particular commiter (usually, a developer of theproject). Two commiters are linked when they havecontributed to at least one common module, beingthe weight of the corresponding edge the number of commits performed by both developers to all commonmodules. • Module network . Vertices represent a software mod-ule of the project. Two modules are linked when thereis at least one commiter who has contributedto both of them. Edges are weighted by the total number of com-mits performed by common commiters to both mod-ules.The definition of what is a module will be different fromproject to project, but usually will correspond to top leveldirectories in the CVS repository. In the case of both net-works, the weight of each edge ( degree of relationship ) re-flects the closeness of two vertices. The higher it is, thestronger the relationship between the given two vertices.We mayalsodefinethe costofrelationship betweenanytwovertices as the inverse of the degree of relationship . That cost of relationship is a measure of the “distance” betweenthem, in the sense that the higher this parameter the moredifficult to reach one vertex from the other. For this reasonwe use the cost of relationship as the base for defininga dis-tance in our networks. Given a pair of vertices i and j , wedefine the distance between them as d  ij = ∑ e ∈ P i ,  j c e , where P i , j is the set of all the edges in the shortest path from i to  j , and c e is the cost of relationship of edge e of such path. 4 Characterization of the networks consid-ered for each project Forouranalysis,we haveconsideredanumberofparam-eters characterizing the topology of the networks. In partic-ular, we use the following definitions (which are commonin the analysis of affiliation networks): • Degree of a vertex ( k  ): number of edges connected tothat vertex. In the case of commiter networks, for eachcommiter it represents the number of  companion com-miters, contributing to the same modules as the givenone. In the case of module networks, it is the totalnumber of modules with which the given one sharescommiters. • Weighted degree of a vertex : sum of the weights of all edges connected to that particular vertex. This canbe interpreted as the degree of relationship of a givenvertex with its direct neighborhood. • Distance centrality of a vertex [13] (  D c ): proximityto the rest of vertices in the network. It is also called closeness centrality : the higher its value, the closerthat vertex is to the others (on average). Given a vertex v and a graph G , it can be defined as:  D c ( v ) = 1 ∑ t  ∈ G d  G ( v , t  ) , (1)where d  G ( v , t  ) is the minimum distance from vertex v to vertex t  (the sum of the costs of relationship of alledges in the shortest path from v to t  ). The distancecentrality can be interpreted as a measurement of theinfluence of a vertex in a graph: the higher its value,the easiest it is for that vertex to spread informationinto that network. Let’s observe that when a given ver-tex is “far” from the others, it has a low degree of rela-tionship (i.e. a high cost of relationship) with the rest.In that case the term ∑ t  ∈ G d  G ( v , t  ) will be high, mean-ing that the vertex is not placed in a central position inthe network, being its distance centrality low. This pa-rameter can be used to identify modules or commiterswhich are well related  in a project. • Betweenness centrality of a vertex [4, 2]: The be-tweenness centrality of a vertex B c is a measurementof the number of shortest paths traversing that partic-ular vertex. Given a vertex v and a graph G , it can bedefined as:  B c ( v ) = ∑ s  = v  = t  / inG σ st  ( v ) σ st  , (2) 102  Degree 0 50 100 150200250 300 350 400 450020406080100120 Figure 1. Distribution of the degrees of com-miters in Apache, circa February 2004 where σ st  ( v ) is the number of shortest paths from s to t  going through v , and σ st  is the total number of short-est paths between s and t  . The betweenness centralityof a vertex can be interpreted as a measurement of theimportance of a vertex in a given graph, in the sensethat vertices with a high value of this parameter are in-termediate nodes for the communicationof the rest. Inthe case of weighted networks, multiple shortest pathsbetween any pair of vertices are highly improbable.So, the term σ st  ( v ) σ st  takes usually only two values: 1,if the shortest path between s and t  goes through v ,or 0 otherwise. Therefore, the betweenness centralityis just a measurement of the number of shortest pathstraversing a given vertex. • Clustering coefficient of a vertex [14]: The cluster-ing coefficient c of a vertex measures the connectiv-ity of its direct neighborhood. Given a vertex v in agraph G , it can be defined as the probability that anytwo neighbors of  v be connected. Hence c ( v ) = E  ( v ) k  v ( k  v − 1 ) , (3)where k  v is the number of neighbors of  v and E  ( v ) isthe number of edges between those neighbors. A highclustering coefficient in a network indicates that thisnetwork has a tendency to form cliques. Observe thatthe clustering coefficient does not consider the weightof edges. • Weighted clustering coefficient of a vertex [10]: Theweighted clustering coefficient c w of a vertex is anattempt to generalize the concept of clustering coef-ficient to weighted networks. Given a vertex v in aweighted graph G it can be defined as: c w ( v ) = ∑ i  =  j ∈  N  G ( v ) w ij 1 k  v ( k  v − 1 ) , (4) cc (clustering coeficient) 0.3 0.40.5 0.60.7 0.8 0.9 11.1051015202530 cc (clustering coeficient) 0.60.65 0.7 0.75 0.8 0.85 0.90.951 1.05 020 406080100120 Figure 2. Clustering coefficient of modulesin Apache (top) and GNOME (bottom), circaFebruary 2004 (distribution) where N  G ( v ) is the neighborhood of  v in G (the sub-graph of all vertices connected to v ), w ij is the de-gree of relationship of the link between neighbor i andneighbor j ( w ij = 0 if there are no link), and k  v is thenumber of neighbors. The weighted clustering coeffi-cient can be interpreted as a measurement of the localefficiency of the network around a particular vertex.For our networks, remark that the term ∑ i  =  j ∈  N  G ( v ) w ij can be seen as the total degree of relationship in theneighborhood of vertex v , while 1 k  v ( k  v − 1 ) is the totalnumber of relationships that could exists in that neigh-borhood. 5 Case studies: Apache, GNOME and KDEmodules Apache,GNOMEandKDEareallwellknownlibresoft-ware projects, large in size (each well above the millionlines of code), in which several subprojects (modules) canbe identified. They have already been studied (for instancein [11] and [8]) from several points of view. We have usedthem to apply ourmethodology,and in this section some re-sults of that application are shown (just an example of howa project can be characterized from several points of view).In figure 1 the distribution of the degree of relationshipfor each commiter in the Apache project is shown as an ex- 103  Weighted clustering coeficient 0 5000 10000 15000 20000051015202530 Weighted clustering coeficient 0 20000 40000 60000 80000 100000 120000140000050100150200250 Weighted clustering coeficient 0 20000 40000 6000080000 10000002468101214 Figure 3. Weighted clustering coefficient ofmodules in Apache (top), GNOME (middle),and KDE (bottom), circa February 2004 (dis-tribution) ample of how developers can be characterized by how theyrelate to each other. It is easy to appreciate how that dis-tributions shows two peaks, one between 20-40 and otheraround 70-90. Only a handful of developers has direct rela-tionship with more than 200 companions.In figure 2 the distribution of the clustering coefficientof modules in Apache and GNOME is compared. Although inboth cases there is a peak in 1 (meaning that in many casesthe direct neighborhood of a module is completely linkedtogether), there is an interesting peak in GNOME around0.77, which should be studied but probably corresponds toa sparse-connected cluster.Figure 3 shows how, despite differences in the distri-bution of the clustering coefficient, the distribution of theweighted clustering coefficient has more similar shapes,with a quick rise from zero to a maximum, and a slower,asymptotic decline later. This would mean than in the threeprojects most nodes (those near the peak) are in clusterswith a similar interconnection structure.As a final example, on the evolution of a project, fig-ure4 showsthe distributionofthe connectiondegreeoffoursnapshots of the Apache project. It can be seen how there isa tremendous growth in the connection degree of the mostconnected module (from 34 in 2001 to more than 100 in2004),whiletheshapeofthedistributionchangesovertime:from 2001 to 2002 a two-peak structure develops, whichslowly changes into a one-peak distribution through 2003and 2004.For lack of space we do not offer it here, but the anal-ysis of the top modules and developers for each parameterconsidered gives a lot of insight on which ones are helpingto maintain the projects together, to deal with informationflows, or are the aggregators of clusters. 6 Conclusions and further work In this paper we have shown a methodology which ap-plies affiliation networkanalysis to data gatheredfromCVSrepositories. We also offer some examples of how it canbe applied to characterize libre software projects. From amore general point of view, we have learned (demonstra-tion not shownin this paper)that in the three analyzedcases(Apache, GNOME and KDE), both the commiters and themodules networks are small-world networks, which meansthat all the theory developed for them applies here.Our groupis still starting to explorethe many paths openby this methodology. Currently, we are interested in ana-lyzing a large number of projects, looking for correlationswhich can help us to make estimations and predictions of the future evolution of projects. We are also looking forcharacterizations of projects based on the parameters of thecurves that interpolate the distributions of the parameterswe are studying. And of course, applying other techniques 104  Degree 0 510 15 20 2530 35012345678 Degree 010 20 30 4050 60 70024681012 Degree 0 1020 30 40 50 6070 80 9002468101214 Degree 0 20 40 60 80 10012002468101214 Figure 4. Connection degree of modules inApachecircaFebruaryfrom2001(top)to2004(bottom) (distribution) usual in small-world and other social networks.We feel that these research paths will allow for the morecomplete understanding of how libre software projects dif-ferentiate from each other, and also will help to identifycommon patterns and invariants. References [1] R. Albert, A. L. Barabsi, H. Jeong, and G. Bianconi. Power-law distribution of the world wide web. Science , 287, 2000.[2] J.Anthonisse. Therushinadirectedgraph. Technicalreport,StichtingMathemastisch Centrum, Amsterdam, TheNether-lands, 1971.[3] Cancho and R. Sole. The small world of human language. Proceedings of the Royal Society of London. Series B, Bio-logical Sciences , 268:2261–2265, Nov. 2001.[4] C. Freeman. A set of measures of centrality based on be-tweenness. Sociometry 40, 35-41 , 1977.[5] D. Germn and A. Mockus. Automating the measurement of open source projects. In Proceedings of the 3rd Workshop onOpen Source Software Engineering, 25th International Con- ference on Software Engineering , Portland, Oregon, 2003.[6] R. A. Ghosh. Clustering and dependencies in free/opensource software development: Methodology and tools. First  Monday , 2003. http://www.firstmonday.dk/issues/issue8_4/ghosh/index.html .[7] V. F. Greg Madey and R. Tynan. The open source develop-mentphenomenon: Ananalysisbasedonsocial networkthe-ory. In Americas Conference on Information Systems (AM-CIS2002) , pages 1806–1813, Dallas, TX, USA, 2002. http://www.nd.edu/˜oss/Papers/amcis_oss.pdf .[8] S. Koch and G. Schneider. Effort, cooperation and coordina-tioninanopen sourcesoftwareproject: Gnome. InformationSystems Journal , 12(1):27–42, 2002.[9] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.The web and social networks. IEEE Computer  , 35(11):32–36, 2002.[10] V. Latora and M. Marchiori. Economic small-world behav-ior in weighted networks. Euro Physics Journal B 32, 249-263 , 2003.[11] A. Mockus, R. Fielding, and J. Herbsleb. A case study of open source software development: The Apache server. In Proceedings of the 22nd International Conference on Soft-ware Engineering (ICSE 2000) , pages 263–272, Limerick,Ireland, 2000.[12] G. Robles-Martinez, J. M. Gonzalez-Barahona, J. Centeno-Gonzalez, V. Matellan-Olivera, and L. Rodero-Merino.Studying the evolution of libre software projects using pub-licly available data. In Proceedings of the 3rd Workshop onOpen Source Software Engineering, 25th International Con- ference on Software Engineering , pages 111–115, Portland,Oregon, 2003.[13] G.Sabidussi. Thecentralityindexof agraph. Psychometirka31, 581-606  , 1996.[14] D. Watts and S. Strogatz. Collective dynamics of small-world networks. Nature 393, 440-442 , 1998. 105
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x