Funny & Jokes

Nonparametric Tests for Homogeneity of Species Assemblages: A Data Depth Approach

Description
Nonparametric Tests for Homogeneity of Species Assemblages: A Data Depth Approach
Categories
Published
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Biometrics 67, 1481–1488 December 2011DOI: 10.1111/j.1541-0420.2011.01573.x Nonparametric Tests for Homogeneity of Species Assemblages:A Data Depth Approach Jun Li, 1 , ∗ Jifei Ban, 1 and Louis S. Santiago 2 1 Department of Statistics, University of California, Riverside, California 92521, U.S.A. 2 Department of Botany & Plant Sciences, University of California, Riverside, California 92521, U.S.A. ∗ email:  jun.li@ucr.edu Summary.  Testing homogeneity of species assemblages has important applications in ecology. Due to the unique structureof abundance data often collected in ecological studies, most classical statistical tests cannot be applied directly. In thisarticle, we propose two novel nonparametric tests for comparing species assemblages based on the concept of data depth.They can be considered as a natural generalization of the Kolmogorov–Smirnov and the Cram´er-von Mises tests (KS andCM) in this species assemblage comparison context. Our simulation studies show that the proposed test is more powerful thanother existing methods under various settings. A real example is used to demonstrate how the proposed method is appliedto compare species assemblages using plant community data from a highly diverse tropical forest at Barro Colorado Island,Panama. Key words:  Data depth;  DD  -plot; Nonparametric tests; Permutation tests; Species richness. 1. Introduction Testing homogeneity across different species assemblages isimportant in ecology because it provides crucial informa-tion about the spatial and temporal stability of ecosystems.One typical type of data collected in ecological studies isabundance data, which consists of counts of abundances of individual species in each sampling unit. For example, aspart of the Barro Colorado Island forest dynamics researchproject, a study was carried out to investigate spatial differ-ences between two highly diverse tropical forest census plotsfrom Barro Colorado Island, Panama. Each of the two plots,which were 1 hectare in size, was divided into twenty-five20 m  ×  20 m quadrats. Counts of each individual specieswere then recorded in all of the 25 quadrats. Based on thosespecies abundance data, one fundamental ecological questionis whether the two species assemblages differ significantly. Inthis study, a total of 159 tree species was observed in the twoplots. Therefore, if we treat the vector of the counts of all159 tree species in each of the quadrats as an observation inthe sample, the data we have consists of two 159-dimensionalsamples with both sample sizes being 25. Our task is essen-tially to compare the distributions of the species abundancedata from the two plots based on these two samples.Typically for abundance data, dimensionality, which isequal to the number of species, is often high (in our case it is159), and zeros are common due to the rarity of some species,making it difficult to find a satisfactory parametric model forsuch data. Thus, a nonparametric testing procedure is moredesirable when comparing species assemblages given abun-dance data. Furthermore, for abundance data, measures suchas Bray–Curtis distance (Bray and Curtis, 1957) are usuallypreferred to Euclidean distance for describing the dissimilar-ity between observations (Faith, Minchin, and Belbin, 1987;Clarke, 1993). Therefore, a nonparametric testing procedurethat can incorporate such measures would be the most ap-propriate to carry out the comparison between species assem-blages.In the literature there have been some approaches whichcan incorporate distance measures into the comparison proce-dure for multivariate outcomes (e.g., Gower and Krzanowski,1999; McArdle and Anderson, 2001; Reiss et al., 2010). Mostof them are based on so-called “analysis of distance,” whichpartitions the variation inherent in distance matrices, analo-gous to the well-known multivariate analysis of variance. Simi-lar to multivariate analysis of variance, those approaches weremotivated by testing equal means among distributions, andtherefore are only sensitive to the location differences amongdistributions. In practice, the distributions of abundance datafrom different species assemblages may differ in other charac-teristics. In this article, we propose two novel nonparametrictests, both of which have the flexibility to incorporate any de-sired distance measure and are also capable of detecting anydistributional differences between species assemblages. Morespecifically, the two tests are derived based on the conceptof data depth. Because the data depth we use is based onany distance measure between observations, it can be directlyapplied to abundance data and at the same time is capableof incorporating any desired distance measure for abundancedata. Based on this distance-based depth, we also employ theso-called two-dimensional  DD  -plot (Liu, Parelius, and Singh,1999) to visualize the difference between species assemblages.This graphical tool serves as further motivation for our twoproposed tests for species assemblage comparisons. The twotests can be considered as the analogues of the classical KS C   2011, The International Biometric Society  1481  1482  Biometrics, December   2011 and CM tests in a species assemblage comparison context. Theanalogue of the CM test is shown to have more power thanother existing nonparametric tests for a variety of alternativehypotheses.The rest of this article is organized as follows. In Section 2,we briefly review the general concept of data depth, and thenintroduce the special notion of data depth that we use in thisarticle, distance-based depth. In Section 3, we demonstratethe use of   DD  -plot for graphical comparison of two speciesassemblages. In Section 4, we describe the two proposed non-parametric testing procedures. Simulation studies are carriedout to evaluate the performance of the proposed tests in Sec-tion 5. In Section 6, we demonstrate the application of theproposed procedures by revisiting the species abundance datafrom the two tropical forest census plots in Barro ColoradoIsland, Panama. Finally, we provide concluding remarks inSection 7. 2. A Distance-Based Data Depth A data depth is a measure of how central or how outlyinga given point is with respect to a multivariate data cloudor its underlying distribution. The word  depth   was first usedby Tukey (1975) for picturing data. Since then, many differ-ent notions of data depth have been proposed for capturingdifferent probabilistic features of multivariate data. Amongthe most popular choices of data depths are Mahalanobisdepth (Mahalanobis, 1936; Hu et al., 2009), half-space depth(Hodges, 1955; Tukey, 1975), simplicial depth (Liu, 1990),projection depth (Stahel, 1981; Donoho, 1982; Donoho andGasko, 1992; Zuo, 2003), etc. More discussion on different no-tions of data depth can be found in Liu et al. (1999), Zuo andSerfling (2000), and Mizera (2002).In the last two decades, data depth has provided manynew and powerful nonparametric tools for multivariate data(see, e.g., Liu et al., 1999; Li and Liu, 2004, 2008). However,due to the discrete nature of the abundance data and the spe-cial distance measure required between the observations, mostexisting depths in the literature cannot be directly appliedto abundance data. This motivates us to explore a distance-based depth, the idea of which was briefly mentioned in Bar-toszynski, Pearl, and Lawrence (1997). The definition of thedistance-based depth is given below. Definition  (Distance-based depth).  Let   X  = { X  1 , ... ,X  n }  be a random sample from   F  , where   F   is a distribution of any type. The distance-based depth at   x  w.r.t. F   is defined as  D F   ( x ) = Pr { d ( X  1 ,X  2 )  >  max[ d ( X  1 ,x ) ,d ( X  2 ,x )] } + 12Pr { d ( X  1 ,X  2 ) =  d ( X  1 ,x )  > d ( X  2 ,x ) } + 12Pr { d ( X  1 ,X  2 ) =  d ( X  2 ,x )  > d ( X  1 ,x ) } + 13Pr { d ( X  1 ,X  2 ) =  d ( X  1 ,x ) =  d ( X  2 ,x ) } , and the sample version is  Figure 1.  B ( X  i  ,  X   j  ) in two-dimensional case. D F  n  ( x ) = 1  n 2  i<j I  { d ( X  i ,X   j )  >  max[ d ( X  i ,x ) ,d ( X   j ,x )] } + 12  i<j I  { d ( X  i ,X   j ) =  d ( X  i ,x )  > d ( X   j ,x ) } + 12  i<j I  { d ( X  i ,X   j ) =  d ( X   j ,x )  > d ( X  i ,x ) } + 13  i<j I  { d ( X  i ,X   j ) =  d ( X  i ,x ) =  d ( X   j ,x ) }  , where   d ( x ,  y )  is any suitably chosen distance measure between  x  and   y , and   I  { A }  is the indicator function which takes 1 if   A is true and 0 otherwise. In the above definition, Pr { d ( X  1 ,X  2 )  >  max[ d ( X  1 ,x ) ,d ( X  2 ,x )] } ( ≡  p 1 ) represents the probability that the side join-ing  X  1  and  X  2  is the longest in a triangle with vertices  X  1 , X  2 , and  x . Similarly, we can define  p 2  = Pr { d ( X  1 ,X  2 )  <  min[ d ( X  1 ,x ) ,d ( X  2 ,x )] } and  p 3  = Pr { min[ d ( X  1 ,x ) ,d ( X  2 ,x )]  < d ( X  1 ,X  2 ) <  max[ d ( X  1 ,x ) ,d ( X  2 ,x )] } , which represent the probabilities that the side joining  X  1  and X  2  is the shortest or middle in the triangle with vertices  X  1 , X  2 , and  x . If we consider the case in   2 and Euclidean dis-tance as the distance measure, given  X  1  and  X  2 , we can formtwo circles, each having one of the points as the center andthe other on the circle, as shown in Figure 1. The radiusesof both circles are equal to the Euclidean distance between X  1  and  X  2 ,  d ( X  1 ,  X  2 ). We denote region  k  ( k  = 1, 2, 3) inFigure 1 by  B k  ( X  1 ,  X  2 ). Then the probability  p k   ( k  = 1, 2, 3)is equivalent to the probability of   x  falling into  B k  ( X  1 ,  X  2 ).Similarly, Pr { d ( X  1 ,X  2 ) =  d ( X  1 ,x )  > d ( X  2 ,x ) } +  Pr { d ( X  1 ,X  2 )=  d ( X  2 ,x )  > d ( X  1 ,x ) }  Nonparametric Tests for Homogeneity of Species Assemblages   1483 0 10 20 30 40 50 60 70       0      2      0      4      0      6      0 Figure 2.  A bivariate Poisson-lognormal sample with the 20% deepest points.calculates the probability of   x  falling on the boundary be-tween  B 1 ( X  1 ,  X  2 ) and  B 3 ( X  1 ,  X  2 ), and Pr { d ( X  1 ,X  2 ) =  d ( X  1 ,x ) =  d ( X  2 ,x ) } calculates the probability of   x  falling on the boundary be-tween  B 1 ( X  1 ,  X  2 ),  B 2 ( X  1 ,  X  2 ), and  B 3 ( X  1 ,  X  2 ). Splittingthese probabilities evenly among their adjacent regions hasled to the fractions 1 / 2 and 1 / 3 in the definition of the abovedistance-based depth. As a result, the distance-based depth D F  ( x ) can be considered as the probability of   x  falling into B 1 ( X  1 ,  X  2 ) and its boundary.Given a sample  X  =  { X  1 , ... ,X  n }  in   2 , the sampledistance-based depth,  D F  n  ( x ), has a similar interpretationand it calculates the proportion of   B 1 ( X  i  ,  X   j  ) ( i ,  j  = 1, ... , n ,  i   =  j ) and its boundary containing  x . For any point  x in   2 , if   x  is near the center of the data cloud,  x  should becontained in many of   B 1 ( X  i  ,  X   j  ) and its boundary generatedfrom the sample. On the other hand, if   x  is relatively nearthe outskirts, we would expect that  x  is contained by onlya few of   B 1 ( X  i  ,  X   j  ) and its boundary. In higher dimensionsor with other distance measures being used, the value of theabove depth has similar interpretations. Therefore, the abovenotion of depth provides a reasonable measure of “depth” of  x  w.r.t. the data cloud  { X  1 , ... ,  X  n  } .Because any distance measure can be used in the abovedefinition of distance-based depth, it can be directly appliedto our species abundance data using any desired distancemeasures between observations. Based on this distance-baseddepth, for any given abundance data sample  { X  1 , ... ,  X  n  } ,we can calculate the depth values  D F  n ( X  i ), and then orderthe  X  i  ’s according to their descending depth values. Thisgives rise to a natural center-outward ordering of the samplepoints. As an example and for demonstration purposes, weassume that there are only two species in the species assem-blage. The counts of the two species from 100 sampling unitsare generated from a bivariate Poisson-lognormal distribution(Aitchison and Ho, 1989), where the sample is drawn froma bivariate Poisson with mean ( λ 1 ,  λ 2 ) being random drawsfrom bivariate lognormal distribution. To facilitate the expo-sition, we denote the general multivariate Poisson-lognormaldistribution as  PL ( µ , Σ), where  µ  and Σ are the parame-ters of the multivariate lognormal distribution. In ecology, forthis type of data, Euclidean distance is generally not con-sidered appropriate. Instead, measures such as Bray–Curtisdistance (Bray and Curtis, 1957) are preferred. The Bray–Curtis distance for sample points  X  l   = ( X  l  1 ,  X  l  2 , ... ,  X  lp )  and  X  l   = ( X  l  1 ,X  l  2 , ... ,X  l   p )  is defined as, d ll   =  p  k =1 | X  lk  − X  l  k |  p  k =1 ( X  lk  +  X  l  k ) , and  d ll   = 0 if both  X  l   and  X  l   equal  0  p , where  0  p  is the vectorof   p  zeros. Figure 2 shows the simulated data ordering basedon the distance-based depth when Bray–Curtis distance isused. In the plot, “+” marks the deepest 20% of the observa-tions. 3.  DD -plot: A Graphical Comparisonof Species Assemblages In this section, we demonstrate how the so-called  DD  -plot(depth versus depth plot) can be used to provide a graphicaltool for comparisons of species assemblages. The  DD  -plot was  1484  Biometrics, December   2011 0.0 0.2 0.4 0.6 0.8       0 .      0      0 .      2      0 .      4      0 .      6      0 .      8 (a) D F m ( z )       D       G     n              (     z              ) 0.0 0.2 0.4 0.6 0.8       0 .      0      0 .      2      0 .      4      0 .      6      0 .      8 (b) D F m ( z )       D       G     n              (     z              ) 0.0 0.2 0.4 0.6 0.8       0 .      0      0 .      2      0 .      4      0 .      6      0 .      8 (c) D F m ( z )       D       G     n              (     z              ) 0.0 0.2 0.4 0.6 0.8       0 .      0      0 .      2      0 .      4      0 .      6      0 .      8 (d) D F m ( z )       D       G     n              (     z              ) Figure 3.  DD  -plots: (a)  F   =  G  =  PL ( 1 10 ,I  10 ); (b)  F   =  PL ( 1 10 ,I  10 ) and  G  =  PL (2 1 10 ,I  10 ); (c)  F   =  PL ( 1 10 ,I  10 ) and  G  = PL ( 1 10 , 2 I  10 ); and (d)  F   =  PL ( 1 10 ,I  10 ) and  G  =  PL ( 1 10 , 0 . 8 1 10 1  10  + 0 . 2 I  10 ). In all the plots, the circles represent the observa-tions from  F   and the pluses represent the observations from  G .first introduced by Liu et al. (1999) for graphical comparisonsof two continuous multivariate distributions. Based on ournewly adopted distance-based depth in Section 2, the  DD  -plotcan now be directly applied to our species abundance data.Let  { X  1 , ... ,  X  m  } (  ≡  X ) and  { Y   1 , ... ,  Y  n  } (  ≡  Y ) be theabundance data from two species assemblages, respectively.The  DD  -plot is constructed by DD ( F  m ,G n ) =  { ( D F  m  ( z ) ,D G n  ( z )) ,z  ∈  X ∪ Y } ,  (1)where  D F  m ( z ) and  D G n ( z ) are the sample distance-baseddepths w.r.t. samples  X  and  Y , respectively.From the construction of the above  DD  -plot, we can seethat if the distributions of the abundance data from the twospecies assemblages are the same, all the data points in the DD  -plot should be concentrated along the 1:1 correspondenceline as shown in Figure 3a. Here the abundance data  X  and Y  from the two species assemblages are generated from thesame distribution  PL ( 1 10 ,I  10 ), where  1 d  is a vector of   d  ones,and  I  d   is the  d -dimensional identity matrix. If the two speciesassemblages are different, the  DD  -plot would exhibit a notice-able departure from the 1:1 correspondence line as shown inFigure 3b–d. Here the abundance data  X  and  Y  from the twospecies assemblages are generated from two different distri-butions. More specifically,  X  is generated from  PL ( 1 10 ,I  10 )in all the plots, whereas  Y  is generated from  PL (2 1 10 ,I  10 ), PL ( 1 10 , 2 I  10 ), and  PL ( 1 10 , 0 . 8 1 10 1  10  + 0 . 2 I  10 ), respectively. Tomake the difference between the two samples more visible, un-like the  DD  -plot srcinally used in Liu et al. (1999), where theobservations from different samples were not distinguished, weuse different symbols to indicate different memberships of theobservations in the  DD  -plot. For example, in all the plots inFigure 3, the circles represent the observations from  X , andthe pluses represent the observations from  Y . In all the plots,Bray–Curtis distance is used in calculating the distance-baseddepths, and  m  and  n  are set as 100.In general, if the distributions of abundance data from thetwo species assemblages mainly differ in location, the  DD  -plotwould have a leaf-shaped figure as the one in Figure 3b, be-cause the deepest point with respect to one sample will notbe the deepest point with respect to the other sample andtherefore will have relatively smaller depth value with respectto that sample. If the two distributions mainly have differentscales, for example,  G  is more spread out than  F  , then thedepth of any point with respect to  G  would be no less than itsdepth with respect to  F  . In such a case, the  DD  -plot wouldhave an early-half-moon-shaped figure arching above thediagonal line as the one in Figure 3c. How other distribu-tional differences are associated with particular patterns of   Nonparametric Tests for Homogeneity of Species Assemblages   1485deviation from the 1:1 correspondence line in the  DD  -plotcan be interpreted in a similar way.As we can see from the above plots, the  DD  -plot based onthe distance-based depth provides a simple diagnostic tool forvisual comparison of two species assemblages. 4. Tests of Homogeneity of Species Assemblages We again denote the abundance data from two species assem-blages by  { X  1 , ... ,  X  m  }  and  { Y   1 , ... ,  Y  n  } . We assume thatthey are random samples from the underlying distributions  F  and  G , respectively. The comparison of the two species assem-blages can be formulated as the following hypothesis testingproblem, H  0  :  F   =  G  v.s.  H  1  :  F    =  G  (2)As noted in the previous section, when the two species as-semblages are identical, i.e.,  F   =  G , we would expect all thepoints in the  DD  -plot clustered along the 1:1 correspondenceline. In other words,  D F  m ( z ) and  D G n ( z ) should be approx-imately the same for all the observations from the pooledsample  X ∪ Y . If there is a difference between the two speciesassemblages,  D F  m ( z ) and  D G n ( z ) would be different from eachother. Therefore, the difference between  D F  m ( z ) and  D G n ( z )from all of the observations can be used as an indicator of heterogeneity of the two species assemblages. Motivated bythis observation, we propose the following two test statisticsfor hypothesis testing problem (2), which can be consideredas a natural generalization of KS and CM tests in this speciesassemblage comparison context: •  KS type test statistic: T  K S   = sup z ∈ X ∪ Y | D F  m  ( z ) − D G n  ( z ) |  (3) •  CM type test statistic: T  CM   =  z ∈ X ∪ Y [ D F  m  ( z ) − D G n  ( z )] 2 (4)Define  p K S   =  P  H  0 ( T  K S    T  obs K S  ) ,  and  p CM   =  P  H  0 ( T  CM    T  obs CM   ) , where  T  obs K S   and  T  obs CM   are the observed values of   T  KS   and  T  CM  ,respectively, based on the given sample  X ∪ Y . Then  p KS   and  p CM   are the  p -values of the proposed two tests. To determinetheir values directly from the null distributions of  T  KS   and  T  CM  is not trivial. Instead, we proceed and use the permutationmethod to approximate  p KS   and  p CM  . More specifically, werandomly permute the pooled sample  X ∪ Y  B  times. Here  B is sufficiently large. For each permutation, we treat the first  m elements as the  X  -sample and the remaining elements as the Y   -sample. We denote the outcome of the  i th permutation by X ∗ i  =  { X  ∗ i 1 , ... ,X  ∗ in } , and  Y ∗ i  =  { Y   ∗ i 1 , ... ,Y   ∗ in } , for  i  = 1, ... , B . For each  X ∗ i  ∪ Y ∗ i , we evaluate the corresponding  T  KS   and T  CM   values (following (3) and (4)), denoted, respectively, by T  ∗ i,K S   and  T  ∗ i,CM   ,  i  = 1, ... ,  B . Then  p KS   and  p CM   can beapproximated, respectively, byˆ  p K S   =1 + B  i =1 I   T  ∗ i,K S    T  obs K S   B  + 1  , andˆ  p CM   =1 + B  i =1 I   T  ∗ i,CM    T  obs CM   B  + 1  , (see, e.g., Fay, Kim, and Hachey, 2007). In the following, we re-fer to our permutation tests based on  T  KS   and  T  CM   as a depth-based KS test and a depth-based CM test, respectively. 5. Simulation Study In this section, we conduct several simulation studies to eval-uate the performance of our proposed two tests. In particular,we compare our tests with two tests available in the literature,which can also be applied to the species assemblage compar-ison context.The first one is the test proposed by Nettleton and Baner- jee (2001) (NB hereafter), which applied the testing proce-dure of Friedman and Rafsky (1979) to compare distribu-tions of random vectors with categorical components. Let Z  =  { Z  1 , ... ,Z  m + n }  denote the pooled sample  X ∪ Y . TheNB test statistic is defined as T  N B  = m + n  i =1 I  { the nearest neighbor of   Z  i belongs to different sample } , where the nearest neighbor of   Z  i   is the one which minimizes δ  ( Z  i  ,  Z  k  ),  k  = 1, ... ,  i  −  1,  i  + 1, ... ,  m  +  n , and  δ  ( · , · ) isany distance measure which is appropriate for the application.The test rejects  H  0 : F   =  G  if   T  NB   is too small.The second test we will consider was proposed by Hall andTajvidi (2002) (HT hereafter). Again we consider the pooledsample  Z . We define  M  i  (  j ) as the number of observationsbeing from sample  Y  in the neighborhood of   X  i  , where theneighborhood is bounded by a circle with center at  X  i   andradius as the distance between  X  i   and its  j th nearest neigh-bor. Similarly, we define  N  i  (  j ) as the number of observationsbeing from sample  X  in the neighborhood of   Y  i  , where theneighborhood is bounded by a circle with center at  Y  i   and ra-dius as the distance between  Y  i   and its  j th nearest neighbor.Under  H  0 , it can be shown that E  0 ( M  i (  j )) =  njm  +  n − 1 and  E  0 ( N  i (  j )) =  mjm  +  n − 1 . Define the deviations of   M   and  N   from their expected valuesunder  H  0  as DM  i (  j ) =  M  i (  j ) −  njm  +  n − 1  and  DN  i (  j ) =  N  i (  j ) −  mjm  +  n − 1  . The HT test statistic is then defined as T  H T   = 1 m m  i =1 n   j =1 DM  i (  j ) γ  w 1 (  j ) + 1 n n  i =1 m   j =1 DN  i (  j ) γ  w 2 (  j ) , where  w 1 (  j ) and  w 2 (  j ) denote nonnegative weights and  γ   issome positive value. Like the NB test, the HT test can bebased on any distance measure. The test rejects  H  0 : F   =  G if   T  HT   is too large. Based on the simulation studies reported
Search
Similar documents
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks