Applying Social Network Analysis to the Information in CVS Repositories

Luis Lopez-Fernandez, Gregorio Robles, Jesus M. Gonzalez-BarahonaGSyC, Universidad Rey Juan Carlos

{

llopez,grex,jgb

}

@gsyc.escet.urjc.es

Abstract

The huge quantities of data available in the CVS reposi-tories of large, long-lived libre (free, open source) software projects, and the many interrelationships among those dataoffer opportunities for extracting large amounts of valuableinformation about their structure, evolution and internal processes. Unfortunately, the sheer volume of that informa-tion renders it almost unusable without applying method-ologies which highlight the relevant information for a givenaspect of the project. In this paper, we propose the use of a well known set of methodologies (social network anal- ysis) for characterizing libre software projects, their evo-lution over time and their internal structure. In addition,we show how we have applied such methodologies to realcases, and extract some preliminary conclusions from that experience.

Keywords:

source code repositories, visualization tech-niques, complex networks, libre software engineering

1 Introduction

The study and characterization of complex systems is anactive research area, with many interesting open problems.Special attention has been paid recently to techniquesbasedon network analysis, thanks to their power to capture someimportant characteristics and relationships. Network char-acterization is widely used in many scientiﬁc and techno-logical disciplines, ranging from neurobiology[14] to com-puter networks [1] [3] or linguistics [9] (to mention justsome examples). In this paper we apply this kind of analy-sis to software projects, using as a base the data available intheir source code versioning repository (usually CVS). For-tunately, most large (both in code size and numberof devel-opers) libre (free, open source) software projects maintainsuch repositories, and grant public access to them.The information in the CVS repositories of libre soft-ware projects has been gathered and analyzed using severalmethodologies [12] [5], but still many other approaches arepossible. Among them, we explore here how to apply sometechniques already common in the traditional (social) net-workanalysis. Theproposedapproachis basedonconsider-ing either modules (usually CVS directories) or developers(commitersto the CVS) as vertices, andthe numberofcom-moncommitsas the weightofthe linkbetweenanytwo ver-tices(seesection3foramoredetaileddeﬁnition). Thisway,we end up with a weighted graphwhich captures some rela-tionships between developers or modules, in which charac-teristics as informationﬂow or communities can be studied.There have been some other works analyzing social net-works in the libre software world. [7] hypothesizes that theorganization of libre software projects can be modeled asself-organizing social networks and shows that this seemsto be true at least when studying SourceForge projects.[6] proposes also a sort of network analysis for libre soft-ware projects, but considering source dependencies be-tween modules. Our approach explores how to apply thosenetwork analysis techniques in a more comprehensive andcomplete way. To expose it, we will start by introducingsome basic concepts of social network analysis which areused later (section 2), and the deﬁnition of the networks weconsider 3. In section 4 we introduce the characterizationwe propose for those networks, and later, in section 5, weshow some examples of the application of that characteri-zation to Apache, GNOME and KDE. To ﬁnish, we offersome conclusions and discuss some future work.

2 Basic concepts on Social Network Analysis

The Theory of Complex Networks is based on repre-senting complex systems as graphs. There are many ex-amples in the literature where this approach has been suc-cessfully used in very different scientiﬁc and technologi-cal disciplines, identifying vertices and links as relevant foreach speciﬁc domain. For example, in ecological networkseach vertex may represent a particular specie, with a link between two species if one of them “eats” the other. Whendealing with social networks, we may identify vertices withpersons or groups of people, considering a link when thereis some kind of relationship between them.Among the different kinds of networks that can be con-

101

sidered, in this paper, we use afﬁliation networks. In afﬁl-iation networks there are two types of vertices:

actors

and

groups

. When we represent the network in terms of actors,each vertex is associated with a particular person and twovertices are linked together when they belong to the samegroup of people. When we represent the network in termsof groups, each vertex is associated with a group and twogroups are linked through an edge when there is, at least,one person belonging to both at the same time.Social networks can be directed (when the relationshipbetween any two vertices is one way, like “is a boss of”) orundirected(when it is bidirectional, like “live together”). Inaddition, they can be weighted (each edge has an associatednumeric value) or unweighted (each edge exists or not).

3 Deﬁnition of the networks of developersand modules

In the approach we propose, for each project we buildtwonetworksusingthecommitinformationoftheCVS sys-tem. Both correspond to the two sides of an afﬁliation net-workobtainedwhenwe considercommitersandmodulesinlibre software projects. In both cases we consider weightedundirected networks as follows:

•

Commiter network

. Each vertex corresponds toa particular commiter (usually, a developer of theproject). Two commiters are linked when they havecontributed to at least one common module, beingthe weight of the corresponding edge the number of commits performed by both developers to all commonmodules.

•

Module network

. Vertices represent a software mod-ule of the project. Two modules are linked when thereis at least one commiter who has contributedto both of them. Edges are weighted by the total number of com-mits performed by common commiters to both mod-ules.The deﬁnition of what is a module will be different fromproject to project, but usually will correspond to top leveldirectories in the CVS repository. In the case of both net-works, the weight of each edge (

degree of relationship

) re-ﬂects the

closeness

of two vertices. The higher it is, thestronger the relationship between the given two vertices.We mayalsodeﬁnethe

costofrelationship

betweenanytwovertices as the inverse of the

degree of relationship

. That

cost of relationship

is a measure of the “distance” betweenthem, in the sense that the higher this parameter the moredifﬁcult to reach one vertex from the other. For this reasonwe use the

cost of relationship

as the base for deﬁninga dis-tance in our networks. Given a pair of vertices

i

and

j

, wedeﬁne the distance between them as

d

ij

=

∑

e

∈

P

i

,

j

c

e

, where

P

i

,

j

is the set of all the edges in the shortest path from

i

to

j

, and

c

e

is the

cost of relationship

of edge

e

of such path.

4 Characterization of the networks consid-ered for each project

Forouranalysis,we haveconsideredanumberofparam-eters characterizing the topology of the networks. In partic-ular, we use the following deﬁnitions (which are commonin the analysis of afﬁliation networks):

•

Degree of a vertex

(

k

): number of edges connected tothat vertex. In the case of commiter networks, for eachcommiter it represents the number of

companion

com-miters, contributing to the same modules as the givenone. In the case of module networks, it is the totalnumber of modules with which the given one sharescommiters.

•

Weighted degree of a vertex

: sum of the weights of all edges connected to that particular vertex. This canbe interpreted as the degree of relationship of a givenvertex with its direct neighborhood.

•

Distance centrality of a vertex

[13] (

D

c

): proximityto the rest of vertices in the network. It is also called

closeness centrality

: the higher its value, the closerthat vertex is to the others (on average). Given a vertex

v

and a graph

G

, it can be deﬁned as:

D

c

(

v

) =

1

∑

t

∈

G

d

G

(

v

,

t

)

,

(1)where

d

G

(

v

,

t

)

is the minimum distance from vertex

v

to vertex

t

(the sum of the costs of relationship of alledges in the shortest path from

v

to

t

). The distancecentrality can be interpreted as a measurement of theinﬂuence of a vertex in a graph: the higher its value,the easiest it is for that vertex to spread informationinto that network. Let’s observe that when a given ver-tex is “far” from the others, it has a low degree of rela-tionship (i.e. a high cost of relationship) with the rest.In that case the term

∑

t

∈

G

d

G

(

v

,

t

)

will be high, mean-ing that the vertex is not placed in a central position inthe network, being its distance centrality low. This pa-rameter can be used to identify modules or commiterswhich are

well related

in a project.

•

Betweenness centrality of a vertex

[4, 2]: The be-tweenness centrality of a vertex

B

c

is a measurementof the number of shortest paths traversing that partic-ular vertex. Given a vertex

v

and a graph

G

, it can bedeﬁned as:

B

c

(

v

) =

∑

s

=

v

=

t

/

inG

σ

st

(

v

)

σ

st

,

(2)

102

Degree

0 50 100 150200250 300 350 400 450020406080100120

Figure 1. Distribution of the degrees of com-miters in Apache, circa February 2004

where

σ

st

(

v

)

is the number of shortest paths from

s

to

t

going through

v

, and

σ

st

is the total number of short-est paths between

s

and

t

. The betweenness centralityof a vertex can be interpreted as a measurement of theimportance of a vertex in a given graph, in the sensethat vertices with a high value of this parameter are in-termediate nodes for the communicationof the rest. Inthe case of weighted networks, multiple shortest pathsbetween any pair of vertices are highly improbable.So, the term

σ

st

(

v

)

σ

st

takes usually only two values: 1,if the shortest path between

s

and

t

goes through

v

,or 0 otherwise. Therefore, the betweenness centralityis just a measurement of the number of shortest pathstraversing a given vertex.

•

Clustering coefﬁcient of a vertex

[14]: The cluster-ing coefﬁcient

c

of a vertex measures the connectiv-ity of its direct neighborhood. Given a vertex

v

in agraph

G

, it can be deﬁned as the probability that anytwo neighbors of

v

be connected. Hence

c

(

v

) =

E

(

v

)

k

v

(

k

v

−

1

)

,

(3)where

k

v

is the number of neighbors of

v

and

E

(

v

)

isthe number of edges between those neighbors. A highclustering coefﬁcient in a network indicates that thisnetwork has a tendency to form cliques. Observe thatthe clustering coefﬁcient does not consider the weightof edges.

•

Weighted clustering coefﬁcient of a vertex

[10]: Theweighted clustering coefﬁcient

c

w

of a vertex is anattempt to generalize the concept of clustering coef-ﬁcient to weighted networks. Given a vertex

v

in aweighted graph

G

it can be deﬁned as:

c

w

(

v

) =

∑

i

=

j

∈

N

G

(

v

)

w

ij

1

k

v

(

k

v

−

1

)

,

(4)

cc (clustering coeficient)

0.3 0.40.5 0.60.7 0.8 0.9 11.1051015202530

cc (clustering coeficient)

0.60.65 0.7 0.75 0.8 0.85 0.90.951 1.05

020

406080100120

Figure 2. Clustering coefﬁcient of modulesin Apache (top) and GNOME (bottom), circaFebruary 2004 (distribution)

where

N

G

(

v

)

is the neighborhood of

v

in

G

(the sub-graph of all vertices connected to

v

),

w

ij

is the de-gree of relationship of the link between neighbor

i

andneighbor

j

(

w

ij

=

0 if there are no link), and

k

v

is thenumber of neighbors. The weighted clustering coefﬁ-cient can be interpreted as a measurement of the localefﬁciency of the network around a particular vertex.For our networks, remark that the term

∑

i

=

j

∈

N

G

(

v

)

w

ij

can be seen as the total

degree of relationship

in theneighborhood of vertex

v

, while

1

k

v

(

k

v

−

1

)

is the totalnumber of relationships that could exists in that neigh-borhood.

5 Case studies: Apache, GNOME and KDEmodules

Apache,GNOMEandKDEareallwellknownlibresoft-ware projects, large in size (each well above the millionlines of code), in which several subprojects (modules) canbe identiﬁed. They have already been studied (for instancein [11] and [8]) from several points of view. We have usedthem to apply ourmethodology,and in this section some re-sults of that application are shown (just an example of howa project can be characterized from several points of view).In ﬁgure 1 the distribution of the degree of relationshipfor each commiter in the Apache project is shown as an ex-

103

Weighted clustering coeficient

0 5000 10000 15000 20000051015202530

Weighted clustering coeficient

0 20000 40000 60000 80000 100000 120000140000050100150200250

Weighted clustering coeficient

0 20000 40000 6000080000 10000002468101214

Figure 3. Weighted clustering coefﬁcient ofmodules in Apache (top), GNOME (middle),and KDE (bottom), circa February 2004 (dis-tribution)

ample of how developers can be characterized by how theyrelate to each other. It is easy to appreciate how that dis-tributions shows two peaks, one between 20-40 and otheraround 70-90. Only a handful of developers has direct rela-tionship with more than 200 companions.In ﬁgure 2 the distribution of the clustering coefﬁcientof modules in Apache and GNOME is compared. Although inboth cases there is a peak in 1 (meaning that in many casesthe direct neighborhood of a module is completely linkedtogether), there is an interesting peak in GNOME around0.77, which should be studied but probably corresponds toa sparse-connected cluster.Figure 3 shows how, despite differences in the distri-bution of the clustering coefﬁcient, the distribution of theweighted clustering coefﬁcient has more similar shapes,with a quick rise from zero to a maximum, and a slower,asymptotic decline later. This would mean than in the threeprojects most nodes (those near the peak) are in clusterswith a similar interconnection structure.As a ﬁnal example, on the evolution of a project, ﬁg-ure4 showsthe distributionofthe connectiondegreeoffoursnapshots of the Apache project. It can be seen how there isa tremendous growth in the connection degree of the mostconnected module (from 34 in 2001 to more than 100 in2004),whiletheshapeofthedistributionchangesovertime:from 2001 to 2002 a two-peak structure develops, whichslowly changes into a one-peak distribution through 2003and 2004.For lack of space we do not offer it here, but the anal-ysis of the top modules and developers for each parameterconsidered gives a lot of insight on which ones are helpingto maintain the projects together, to deal with informationﬂows, or are the aggregators of clusters.

6 Conclusions and further work

In this paper we have shown a methodology which ap-plies afﬁliation networkanalysis to data gatheredfromCVSrepositories. We also offer some examples of how it canbe applied to characterize libre software projects. From amore general point of view, we have learned (demonstra-tion not shownin this paper)that in the three analyzedcases(Apache, GNOME and KDE), both the commiters and themodules networks are small-world networks, which meansthat all the theory developed for them applies here.Our groupis still starting to explorethe many paths openby this methodology. Currently, we are interested in ana-lyzing a large number of projects, looking for correlationswhich can help us to make estimations and predictions of the future evolution of projects. We are also looking forcharacterizations of projects based on the parameters of thecurves that interpolate the distributions of the parameterswe are studying. And of course, applying other techniques

104

Degree

0 510 15 20 2530 35012345678

Degree

010 20 30 4050 60 70024681012

Degree

0 1020 30 40 50 6070 80 9002468101214

Degree

0 20 40 60 80 10012002468101214

Figure 4. Connection degree of modules inApachecircaFebruaryfrom2001(top)to2004(bottom) (distribution)

usual in small-world and other social networks.We feel that these research paths will allow for the morecomplete understanding of how libre software projects dif-ferentiate from each other, and also will help to identifycommon patterns and invariants.

References

[1] R. Albert, A. L. Barabsi, H. Jeong, and G. Bianconi. Power-law distribution of the world wide web.

Science

, 287, 2000.[2] J.Anthonisse. Therushinadirectedgraph. Technicalreport,StichtingMathemastisch Centrum, Amsterdam, TheNether-lands, 1971.[3] Cancho and R. Sole. The small world of human language.

Proceedings of the Royal Society of London. Series B, Bio-logical Sciences

, 268:2261–2265, Nov. 2001.[4] C. Freeman. A set of measures of centrality based on be-tweenness.

Sociometry 40, 35-41

, 1977.[5] D. Germn and A. Mockus. Automating the measurement of open source projects. In

Proceedings of the 3rd Workshop onOpen Source Software Engineering, 25th International Con- ference on Software Engineering

, Portland, Oregon, 2003.[6] R. A. Ghosh. Clustering and dependencies in free/opensource software development: Methodology and tools.

First Monday

, 2003.

http://www.firstmonday.dk/issues/issue8_4/ghosh/index.html

.[7] V. F. Greg Madey and R. Tynan. The open source develop-mentphenomenon: Ananalysisbasedonsocial networkthe-ory. In

Americas Conference on Information Systems (AM-CIS2002)

, pages 1806–1813, Dallas, TX, USA, 2002.

http://www.nd.edu/˜oss/Papers/amcis_oss.pdf

.[8] S. Koch and G. Schneider. Effort, cooperation and coordina-tioninanopen sourcesoftwareproject: Gnome.

InformationSystems Journal

, 12(1):27–42, 2002.[9] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.The web and social networks.

IEEE Computer

, 35(11):32–36, 2002.[10] V. Latora and M. Marchiori. Economic small-world behav-ior in weighted networks.

Euro Physics Journal B 32, 249-263

, 2003.[11] A. Mockus, R. Fielding, and J. Herbsleb. A case study of open source software development: The Apache server. In

Proceedings of the 22nd International Conference on Soft-ware Engineering (ICSE 2000)

, pages 263–272, Limerick,Ireland, 2000.[12] G. Robles-Martinez, J. M. Gonzalez-Barahona, J. Centeno-Gonzalez, V. Matellan-Olivera, and L. Rodero-Merino.Studying the evolution of libre software projects using pub-licly available data. In

Proceedings of the 3rd Workshop onOpen Source Software Engineering, 25th International Con- ference on Software Engineering

, pages 111–115, Portland,Oregon, 2003.[13] G.Sabidussi. Thecentralityindexof agraph.

Psychometirka31, 581-606

, 1996.[14] D. Watts and S. Strogatz. Collective dynamics of small-world networks.

Nature 393, 440-442

, 1998.

105