Resumes & CVs

Object clustering and recognition using multi-finite mixtures for semantic classes and hierarchy modeling

Description
Object clustering and recognition using multi-finite mixtures for semantic classes and hierarchy modeling
Categories
Published
of 18
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Object clustering and recognition using multi-finite mixtures forsemantic classes and hierarchy modeling Taoufik Bdiri a , Nizar Bouguila b, ⇑ , Djemel Ziou c a Department of Electrical and Computer Engineering, Concordia University, Montreal, QC H3G 1T7, Canada b The Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC H3G 1T7, Canada c DI, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K 2R1, Canada a r t i c l e i n f o Keywords: Data clusteringMixture modelsHierarchical modelsSemantic clusteringInverted Dirichlet distributionMaximum likelihoodVisual objects a b s t r a c t Model-based approaches have become important tools to model data and infer knowledge. Suchapproaches are often used for clustering and object recognition which are crucial steps in many applica-tions, including but not limited to, recommendation systems, search engines, cyber security, surveillanceand object tracking. Many of these applications have the urgent need to reduce the semantic gap of datarepresentation between the system level and the human being understandable level. Indeed, the lowlevel features extracted to represent a given object can be confusing to machines which cannot differen-tiate between very similar objects trivially distinguishable by human beings (e.g. apple vs. tomato). Inthispaper,weproposeanovelhierarchicalmethodologyfordatarepresentationusingahierarchicalmix-turemodel. Theproposedapproachallowstomodelagivenobjectclassbyasetofmodesdeducedbythesystemandgroupedaccordingtoalabeledtrainingdatarepresentingthehumanlevelsemantic.Wehaveused the inverted Dirichlet distribution to build our statistical framework. The proposed approach hasbeen validated using both synthetic data and a challenging application namely visual object clusteringand recognition. The presented model is shown to have a flexible hierarchy that can be changed onthe fly within costless computational time.   2013 Elsevier Ltd. All rights reserved. 1. Introduction Available digital data has increased significantly in the recentyears with the intensive use of technological devices and Internet.Due to the huge amount of such heterogeneous data, an urgentneed has been triggered to automate its analysis and modelingfor different purposesandapplications. Onechallengingcrucial as-pect in data analysis is clustering, a formof unsupervised learning,which is defined as the process of assigning observations sharingsimilar characteristics to subgroups, such that their heterogeneityis minimized within a given subgroup and maximized betweenthe subgroups. Such an assignment is not trivial especially whenwe deal with high dimensional data. Indeed, it has been shownthat clustering is considered as one of the most important aspectsof artificial intelligence and data mining ( Jain, Murty, & Flynn,1999; Fisher, 1996). Given a data set that we need to extractknowledgefromit,theultimategoalistoconstructconsistenthighquality clusters using a computationally inexpensive way. Statisti-cal-based approaches for data clustering have recently become aninteresting and attractive research domain with the advancementof computational power that enables researchers to implementcomplex algorithms and deploy them in real time applications.One major approach based on statistics is model-based clusteringusingfinite mixture models. A finite mixture model can be definedas a weighted sum of probability distributions where each distri-bution represents the population of a given subgroup. The authorsinFraleyandRaftery(2002)tracedtheuseoffinitemixturemodelsback to the 1960s and 1970s, citing amongst others, works in Ed-wards and Cavalli-Sforza (1965), Day (1969) and Binder (1978).Althoughtheir use backs at least as far as 1963, it is only in the re-cent decades that mixture models applications started to covermany fields including, but not limited to, digital image processingand computer vision (Sefidpour & Bouguila, 2012; Stauffer &Grimson, 2000; Allili, Ziou, Bouguila, & Boutemedjet, 2010), socialnetworks (Couronne, Stoica, & Beuscart, 2010; Handcock, Raftery,& Tantrum, 2007; Morinaga & Yamanishi, 2004), medicine(Koestler et al., 2010; Tao, Cheng, & Basu, 2010; Neelon, Swamy, Burgette, & Miranda, 2011; Schlattmann, 2009; Rattanasiri,Böhning, Rojanavipart, &Athipanyakom, 2004), and bioinformatics(Kim, Cho, & Kim, 2010; Meinicke, Ahauer, & Lingner, 2011; Ji, Wu,Liu, Wang, & Coombes, 2005).The consideration of mixture models is practical for manyapplications. In many cases, however, the complexity of theobserved data may render the use of one single distribution to 0957-4174/$ - see front matter   2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.eswa.2013.08.005 ⇑ Corresponding author. Tel.: +1 5148482424; fax: +1 5148483171. E-mail addresses:  t_bdiri@encs.concordia.ca (T. Bdiri), nizar.bouguila@concordia.ca (N. Bouguila), djemel.ziou@usherbrooke.ca (D. Ziou). Expert Systems with Applications 41 (2014) 1218–1235 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa  represent a given class insufficient for inference. Many techniqueshave been proposed to select the best number of mixture compo-nents that best represents the data. Examples include Bayesianinference criterion (BIC) (Schwarz, 1978), minimum descriptionlength (MDL) (Rissanen, 1999) and minimum message length(MML) (Wallace, 2005) criteria. These criteria are mainly used inunsupervised algorithms where the systemhandles data modelingwithout any intervention during the learning process. This ap-proach has a serious drawbackin manyapplications as the seman-tic meaning of the mixture modes in the selected model does notnecessarily fit with a human comprehensible semantic. Considerfor instance an object recognition application where the systemhas to recognize different objects according to the user need. AnMML, BIC or MDL criterion would in most cases consider classeswithimportantvisualsimilarities(e.g., appleandtomato),asbeingtheclassesthatshouldberepresentedbyasinglemodeinthemix-ture. This is not always the case in real applications, where a hu-man being or an application would need to differentiate betweendifferent classes even when they have close visual properties andsimilarities, thus we talk here about a semantic meaning of themixturemodes.Therefore,thegapbetweenthesystemrepresenta-tion and the human representation of data is still high when usingthesemethods. Aneventual solutionis to formahierarchical mod-el based on some ontology as in Ziou, Hamri, and Boutemedjet(2009), Bakhtiari and Bouguila (2010), Bakhtiari and Bouguila(2011) and Bakhtiari and Bouguila (2012) where the data isgroupedintoclustersandsubclusters (i.e., tree-structuredcluster-ing).Yet, this approach still form the model using the visual simi-larities and groups data according to the system choice.Furthermore, since the model is based on estimating the model’sparameters as the algorithmgoes deeper in the hierarchy (the dis-tributions’ parameters of the children clusters depend on theparameters of the parents clusters), when changing the hierarchi-cal model, a whole new estimation should take place, which in-creases the computational cost. The user intervention to build ahierarchical mixture has been introduced in Bishop and Tipping(1998) which developed the concept of hierarchical visualizationwheretheconstructionofthehierarchicaltreeproceedstop-down,andforeachlevel,theuserdecidesonthesuitablenumberofmod-els to fit at the next level down. Indeed this interaction, may serveto have an optimal number of clusters for each level according tothe user, but it does not permit the user to define any semanticmeaning to the clusters or group the clusters as he/she needs.Moreover, the user cannot define any ontological model to thedata, and there is a new estimation of the parameters to be calcu-lated at each level when the model goes deeper in the tree.Inthiswork, wepresentanovelwaytomodeldataandassignasemantic meaning to clusters according to the user needs whichcan reduce significantly the gap between the system representa-tion and the user level representation. We tackle the challengingproblem of object clustering, and recognition of new unseen datain terms of affectation to the appropriate clusters forming the ob- jectclass.Naturally,thechoiceofthedistributionformingthemix-ture model is crucial in terms of clustering efficiency and accuracyof the classification of unseen data. Indeed, many works have fo-cused on Gaussian mixture models (GMM) to build their applica-tions such as in Permuter, Francos, and Jermyn (2003), Zivkovic(2004), Yang and Ahuja (1999) and Weiling, Lei, and Yang (2010),but recent researches have shown that it is not appropriate to al-ways assume that data follows a normal distribution. For instance,the works in Boutemedjet, Bouguila, and Ziou (2009), Bouguila,Ziou, and Vaillancourt (2004), Bouguila and Ziou (2006), Bouguila,Ziou, andHammoud(2009)haveconsideredtheDirichletandgen-eralized Dirichlet mixture models, to model proportional data,which have been shown to outperform the GMM. We have devel-oped in our previous work the inverted Dirichlet mixture model(IDMM) which has better capabilities than the GMM when model-ing positive data that occurs naturally in many real applications(Bdiri & Bouguila, 2012; Bdiri & Bouguila, 2011). Hence, we pro-pose our newmethodologyusingIDMM, althoughit is noteworthyto bear in mindthanany other distributioncanbe usedas the pre-sented framework is general.The rest of this paper is organized as follows. In Section 2, wepresent our statistical frameworkbyconsideringa two-levelshier-archy for ease of representation and understanding of the generalmethodology.InSection3,weproposeadetailedapproachtolearnthe proposed statistical model. In Section 4, we propose a general-izationof our modelingframework to cover many hierarchical lev-els. Section 5 is devoted to present the experimental results usingboth synthetic data and a real-life application concerning objectrecognition. Finally, Section 6 gives a conclusion and future per-spectives for research. 2. Statistical framework: the model We propose to develop a statistical framework that can modeldata in a hierarchical fashion. The attribution of a semantic mean-ing to the model is discussed in sub Section 5.2.1. In this section,we consider a two-levels hierarchy where we have a set of   super classes , composed each, of a set of classes. The generalization of the model is discussed in Section 4. Let us consider a set  X   of   N D -dimensional vectors, such that  X ¼ ð ~  X  1 ;~  X  2 ; . . . ;~  X  N  Þ . Let  M   de-notes the number of different  super classes  and  K   j  the number of classes forming the  super class j . We assume that  X   is controlledbyamixtureofmixtures, suchthateach super class j  isrepresentedbyamixtureof  K   j  componentsandtheparentmixtureiscomposedof  M  mixturesrepresentingthe super classes .Thus,weconsidertwoviews or levels for the statistical model. The first view focuses onthe  super classes  and the second one zooms on the classes (seeFig. 1). We suppose that the vectors follow a common but un-known probability density function  p ð ~  X  n j N Þ , where N is the set of its parameters. Let  Z   ¼ f ~  Z  1 ;~  Z  2 ; . . . ;~  Z  N  g  denotes the missing groupindicator, where  ~  Z  n  ¼ ð  z  n 1 ;  z  n 2 ; . . . ;  z  nM  Þ  is the label of   ~  X  n , such that  z  nj  2  {0,1},  P M  j ¼ 1  z  nj  ¼  1 and  z  nj  is equal to one if   ~  X  n  belongs to  super class j  and zero, otherwise. Then, the distribution of   ~  X  n  given the super class  label  ~  Z  n  is:  p ð ~  X  n j ~  Z  n ; H Þ ¼ Y M  j ¼ 1  p ð ~  X  n j h  j Þ  z  nj ð 1 Þ where H ={ h 1 , h 2 , . . .  , h M  } and  h  j  is the set of parameters of the  super class j . In practice,  p ð ~  X  n j H Þ  can be obtained by marginalizing thecomplete likelihood  p ð ~  X  n ;~  Z  n j H Þ  over the hidden variables. We de-fine the prior distribution of   ~  Z  n  as follows:  p ð ~  Z  n j ~ p Þ ¼ Y M  j ¼ 1 p  z  nj  j  ð 2 Þ where  ~ p ¼ ð p 1 ; . . . ; p M  Þ ; p  j  >  0 and  P M  j ¼ 1 p  j  ¼  1, then we have:  p ð ~  X  n ;~  Z  n j H ;~ p Þ ¼  p ð ~  X  n j ~  Z  n ; H Þ P  ð ~  Z  n j ~ p Þ ¼ Y M  j ¼ 1 ð  p ð ~  X  n j h  j Þ p  j Þ  z  nj ð 3 Þ We proceed by the marginalization of Eq. (3) over the hidden vari-able (see Appendix A), so the first level of our mixture for a givenvector  ~  X  n  can be written as follows:  p ð ~  X  n j H ;~ p Þ ¼ X M  j ¼ 1  p ð ~  X  n j h  j Þ p  j  ð 4 Þ Thus, according to the previous equation, the set of parameters  N corresponding to the first level is  N ¼ ð H ;~ p Þ . When we examinethe second level which considers the classes, given that  ~  X  n  isgenerated from the mixture  j , we suppose that it is also generated T. Bdiri et al./Expert Systems with Applications 41 (2014) 1218–1235  1219  from one of the  K   j  modes of the mixture  j . Thus, we consider Y   j  ¼ f ~ Y  1  j ;~ Y  2  j ; . . . ;~ Y  Nj g  that denotes the missing group indicatorwhere the  i th element of   ~ Y  nj ;  y nji  is equal to one if   ~  X  n  belongs tothe class  i  of the  super class j  and zero, otherwise. Let f ~ Y  nj g ¼ f ~ Y  n 1 ;~ Y  n 2 ; . . . ;~ Y  nM  g  denotes the classes label of   ~  X  n . Then,the distribution of   ~  X  n  given the  super class  label  ~  Z  n  and the classeslabels  f ~ Y  nj g  is  p ð ~  X  n j ~  Z  n ; f ~ Y  nj g ; H ; u Þ ¼  p ð ~  X  n j ~  Z  n ; f ~ Y  nj g ; u Þ¼ Y M  j ¼ 1 Y K   j i ¼ 1  p ð ~  X  n j ~ u  ji Þ  y nji !  z  nj ð 5 Þ where  u  ¼ f ~ u  ji g ,  j  =1, . . .  , M  ,  i  =1, . . .  , K   j , is the set of parameters of the modes representing the different classes. It is noteworthy tomention that  K   j  depends on the number of classes that the  super class j  contains. We define the prior distribution of   ~ Y  nj  and  f ~ Y  nj g as follows:  p ð ~ Y  nj j ~  Z  n ; f w  ji g ;~ p Þ ¼  p ð ~ Y  nj j ~  Z  n ; f w  ji gÞ ¼ Y K   j i ¼ 1 w  y nji  ji  ð 6 Þ  p ðf ~ Y  nj gj ~  Z  n ; f w  ji g ;~ p Þ ¼  p ðf ~ Y  nj gj ~  Z  n ; f w  ji gÞ ¼ Y M  j ¼ 1 Y K   j i ¼ 1 w  y nji  ji !  z  nj ð 7 Þ where  w  ji  >  0 ; P M  j ¼ 1 P K   j i ¼ 1 w  ji  ¼  1, and { w  ji } is the set of the modes’mixing weights with  j  =1, . . .  , M   and  i  =1, . . .  , K   j , then we have:  p ð ~  X  n ;~  Z  n ; f ~ Y  nj gj u ; f w  ji g ;~ p Þ ¼  p ð ~  X  n j ~  Z  n ; f ~ Y  nj g ; u Þ  p ðf ~ Y  nj gj ~  Z  n ; f w  ji gÞ  p ð ~  Z  n j ~ p Þ¼ Y M  j ¼ 1 p  z  nj  j Y K   j i ¼ 1 w  y nji  ji  p ð ~  X  n j ~ u  ji Þ  y nji !  z  nj ð 8 Þ We proceed by the marginalization of Eq. (8) over the hidden vari-ables (see Appendix A), so the second level mixture can be writtenas:  p ð ~  X  n j u ; f w  ji g ;~ p Þ ¼ X M  j ¼ 1 p  j X K   j i ¼ 1  p ð ~  X  n j ~ u  ji Þ w  ji !  ð 9 Þ Thus, according to the previous equation, the set of parameters  N corresponding to the second level is N ¼ ð u ; f w  ji g ;~ p Þ . 3. Model learning   3.1. Log-likelihood of the complete data The model for the complete data  hX  ;  Z  ; f Y   j gi  is given by:  p ðX  ;  Z  j H ;~ p Þ ¼ Y N n ¼ 1 Y M  j ¼ 1  p ð ~  X  n j h  j Þ  z  nj p  z  nj  j  |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}  First level ¼ Y N n ¼ 1 Y M  j ¼ 1 Y K   j i ¼ 1 ð  p ð ~  X  n j ~ u  ji Þ w  ji Þ  y nji   z  nj  p  z  nj  j  |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}  Second level ¼  p ðX  ;  Z  ; f Y   j gj u ; f w  ji g ;~ p Þ ð 10 Þ We maximize the log-likelihood instead of the likelihood. The log-likelihood is given by: U c  ðX  ;  Z  ; f Y   j gj u ; f w  ji g ;~ p Þ ¼  log ð  p ðX  ;  Z  ; f Y   j gj u ; f w  ji g ;~ p ÞÞ¼ X N n ¼ 1 X M  j ¼ 1 X K   j i ¼ 1  z  nj  log ð p  j Þ þ  y nji ð log ð w  ji Þþ log ð  p ð ~  X  n j ~ u  ji ÞÞÞ   ð 11 Þ Let us recall that u  ¼ f ~ u  ji g ;  j  ¼  1 ; . . . ; M  ,  i  =1, . . .  , K   j , such that  ~ u  ji  isthe set of parameters of the inverted Dirichlet distribution, for the class i , of the  super class j . In order to estimate the parameters, weuse the expectation maximization (EM) algorithm which proceedsiterativelyintwosteps; the expectation(E) step andthemaximiza-tion(M)step.IntheE-step,wecomputetheconditionalexpectationof   U c  ðX  ;  Z  ; f Y   j gj u ; f w  ji g ;~ p Þ  which is reduced to the computation of theposterior probabilities (i.e. theprobabilitythat a vector ~  X  n  is as-signed to a mixture  j , and the probability that  ~  X  n  is assigned to themode  i  of   j ), such that (see Appendix B):  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ ¼  w  ji p  j  p ð ~  X  n j ~ u  ji Þ P M  j ¼ 1 P K   j i ¼ 1 w  ji p  j  p ð ~  X  n j ~ u  ji Þð 12 Þ and  p ð  j j ~  X  n ; H ;~ p Þ ¼  p  j  p ð ~  X  n j h  j Þ P M  j ¼ 1 p  j  p ð ~  X  n j h  j Þ¼  p  j P K   j i ¼ 1 w  ji  p ð ~  X  n j ~ u  ji Þ P M  j ¼ 1 p  j P K   j i ¼ 1 w  ji  p ð ~  X  n j ~ u  ji Þ¼  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ ð 13 Þ Thus, we have: log  p ðXj u ; f w  ji g ;~ p Þ ¼ X N n ¼ 1 X M  j ¼ 1 X K   j i ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  log ð p  j Þ þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þð log ð w  ji Þ  þ log ð  p ð ~  X  n j ~ u  ji ÞÞÞ   ð 14 Þ Then, the conditional expectation of the complete-data log likeli-hood, using Lagrange multipliers to introduce the constraints aboutthe mixture proportions { p  j } and { w ij }, is given by: Q  ðX  ; u ; f w  ji g ;~ p ; K Þ ¼  log  p ðXj u ; f w  ji g ;~ p Þ þ  k 1  1   X M  j ¼ 1 p  j ! þ  k 2  1  X M  j ¼ 1 X K   j i ¼ 1 w ij !  ð 15 Þ where K ={ k 1 , k 2 } is the Lagrange multiplier.  3.2. Parameters estimation Althoughourmodelisgeneralandcanthereforeadopt anypos-sibleprobabilitydensityfunction,weuseheretheinvertedDirichletdistribution as the density model. This distribution permits multi- Fig. 1.  Two-levels hierarchical model.1220  T. Bdiri et al./Expert Systems with Applications 41 (2014) 1218–1235  plesymmetricandasymmetricmodesanditmaybeskewedtotheright,skewedtotheleftorsymmetric,whichgivesitsuitableprop-erties to model different forms of positive data. If a  D -dimensionalpositivevector ~  X   ¼ ð  X  1 ;  X  2 ; . . . ;  X  D Þ followsaninvertedDirichletdis-tribution, the joint density function is given by Bdiri and Bouguila(2012), Bdiri and Bouguila (2011) and Tiao and Cuttman (1965):  p ð ~  X  j ~ u Þ ¼  C ðj ~ u jÞ Q D þ 1 d ¼ 1 C ð ~ u ð d ÞÞ Y Dd ¼ 1  X  ~ u ð d Þ 1 d  1  þ X Dd ¼ 1  X  d ! j ~ u j ð 16 Þ where  C (.) is the gamma function,  X  d  >0,  d  =1, 2, . . .  , D , ~ u  ¼ ð ~ u ð 1 Þ ; . . . ; ~ u ð D þ 1 ÞÞ  is the vector of parameters and j ~ u j ¼  P D þ 1 d ¼ 1 ~ u ð d Þ ; ~ u ð d Þ  >  0 ;  d  ¼  1 ; 2 ; . . . ; D þ 1. The parameters esti-mation is based on the maximization of the log likelihood (Eq.(15)). The maximization gives the following for the mixturesweights (see Appendix B): p  j  ¼ P N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ N   ð 17 Þ w  ji  ¼ P N n ¼ 1  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ N   ð 18 Þ Calculating the derivative with respect to  ~ u  ji ð d Þ ,  d  =1, . . .  , D , andusing the inverted Dirichlet distribution, we obtain: @  Q  ðX  ; u ; f w  ji g ;~ p ; K Þ @ ~ u  ji ð d Þ ¼ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ  @  log ð  p ð ~  X  n j ~ u  ji ÞÞ @ ~ u  ji ð d Þ¼ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ  W ðj ~ u  ji jÞ  W ð ~ u  ji ð d ÞÞ þ  log  X  nd 1 þ P Dd ¼ 1  X  nd !!  ð 19 Þ where W (.) is the digamma function. The derivative with respect to ~ u  ji ð D þ 1 Þ  is given by: @  Q  ðX  ; u ; f w  ji g ;~ p ; K Þ @ ~ u  ji ð D þ 1 Þ ¼ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ @  log ð  p ð ~  X  n j ~ u  ji Þ @ ~ u  ji ð D þ 1 Þ¼ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ  W ðj ~ u  ji jÞ W ð ~ u  ji ð D þ 1 ÞÞþ log 11 þ P Dd ¼ 1  X  nd ! ð 20 Þ Looking at the previous two equations, it is clear that an explicitform of the solution to estimate  ~ u  ji  does not exist. Thus, we referto Newton Raphson method expressed as: ~ u new ji  ¼  ~ u old ji    H   1  ji  G  ji ;  j  ¼  1 ; . . . ; M  ;  i  ¼  1 ; . . . ; K   j  ð 21 Þ where  H   ji  is the Hessian matrix associated with  Q  ðX  ; u ; f w  ji g ;~ p ; K Þ and  G  ji  is the first derivatives vector, G  ji  ¼  @  Q  ðX  ; u ; f w  ji g ;~ p ; K Þ @ ~ u  ji ð 1 Þ  ; . . . ; @  Q  ðX  ; u ; f w  ji g ;~ p ; K Þ @ ~ u  ji ð D þ 1 Þ   T  . To calculate the Hessianof   Q  ðX  ; u ; f w  ji g ;~ p ; K Þ  we have to compute the second and mixedderivatives: @  Q  ðX  ; u ; f w  ji g ;~ p ; K Þ @  2 ~ u  ji ð d Þ¼ ð W 0 ðj ~ u  ji jÞ W 0 ð ~ u  ji ð d ÞÞÞ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ !  ð 22 Þ @  Q  ðX  ; u ; f w  ji g ;~ p ; K Þ @ ~ u  ji ð d 1 Þ @ ~ u  ji ð d 2 Þ ¼ ð W 0 ðj ~ u  ji jÞÞ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ !  ð 23 Þ where W 0 (.) is the trigamma function.Thus, H   ji  ¼ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ W 0 ðj ~ u  ji jÞ W 0 ð ~ u  ji ð 1 ÞÞ  W 0 ðj ~ u  ji jÞ   W 0 ðj ~ u  ji jÞ W 0 ðj ~ u  ji jÞ  W 0 ðj ~ u  ji jÞ W 0 ð ~ u  ji ð 2 ÞÞ   W 0 ðj ~ u  ji jÞ ............ W 0 ðj ~ u  ji jÞ   W 0 ðj ~ u  ji jÞ W 0 ð ~ u  ji ð D þ 1 ÞÞ 0BBBBB@1CCCCCA ð 24 Þ we can write: H   ji  ¼  D  ji  þ a  ji  A T  ji  A  ji  ð 25 Þ where D  ji  ¼  diag    X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ W 0 ð ~ u  ji ð 1 ÞÞ ; . . . ; "  X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ W 0 ð ~ u  ji ð D þ 1 ÞÞ #  ð 26 Þ D  ji  is a diagonal matrix,  a  ji  ¼ P N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ   p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p ÞÞ W 0 ðj ~ u  ji jÞ  and  A T  ji  ¼ ð a 1 ; . . . ; a D þ 1 Þ ,  a d  =1, d  = 1, . . .  , D  +1,then using the Theorem8.3.3 in Graybill (1983), the inverse matrixis given by: H   1  ji  ¼  D  1  ji  þ a   ji  A  T  ji  A   ji  ð 27 Þ D   ji 1 can be easily computed as following:  A   ji  ¼  1 P N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ  1 W 0 ð ~ u  ji ð 1 ÞÞ ; . . . ;  1 W 0 ð ~ u  ji ð D  þ 1 ÞÞ   ð 28 Þ and a   ji  ¼  W 0 ðj ~ u  ji jÞ X D þ 1 d ¼ 1 1 W 0 ð ~ u  ji ð d ÞÞ !  1 "#  1  W 0 ðj ~ u  ji jÞ X N n ¼ 1  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ ð 29 Þ 4. Generalization of the model 4.1. A flexible hierarchical model In the previous sections, two levels of the hierarchy were mod-eled with prior knowledge about how to group the classes withinthe  super classes . In many real life applications, however, we donot have this prior knowledge. Thus, we propose to overcome thisproblem by generalizing our model to several hierarchical levels.Indeed, whenwe changethestrategyof howwe lookat themodel,we can overcome the model selection problem at each  super class level (i.e. prior knowledge about the number of components thateach super class  musthave). 1 Indeed, the second level can be viewedas a mixture of  P M  j ¼ 1 K   j  components with mixture weights  w  ji  such as P M  j ¼ 1 P K   j i ¼ 1 w  ji  ¼ 1. According to Eqs. (12) and (13), we have:  p ð  j j ~  X  n ; u ; f w  ji g ;~ p Þ ¼ X K   j i ¼ 1  p ð  j ; i j ~  X  n ; u ; f w  ji g ;~ p Þ ð 30 Þ Thus, Eqs. 17, 18 and 30 give us: 1 Many applications might find the prior specification of the number of compo-nents forming each  super class , useful. In our case we want the system to form theclasses and then group them dynamically according to the user’s needs. In this paper,the two approaches are proposed. T. Bdiri et al./Expert Systems with Applications 41 (2014) 1218–1235  1221  p  j  ¼ X K   j i ¼ 1 w  ji  ð 31 Þ When considering a hierarchical model with several levels, we pro-pose to estimate the probability densities parameters and theweight of the bottom level using the algorithm that we proposedin Bdiri and Bouguila (2012), and then estimate the weight of each upper level by summing the weights of their children. This ap-proach can be practical in real life applications when we do nothave prior knowledge about the hierarchy as we will show furtherin the experimental part. For instance, an object class can be mod-eled by a  K   j -components mixture without having prior knowledgeabout  K   j . To define a  super class  of objects classes, we simply sumtheir corresponding weights and we move to an upper level in thehierarchy. Thus, we mainly develop two approaches, the first oneis when we have prior knowledge about the number of clusters/classes forming a  super class  and the second one when we do nothave such knowledge, so we can group the clusters/classes accord-ingtoagivenhierarchythatcanchangeonthefly,whichisthepur-pose of this work. 4.2. Learning algorithm To initialize the EMalgorithmwe use the approachproposed inBdiri and Bouguila (2012) based on  K  -Means algorithm and themethod of moments. As we have two approaches for building thehierarchical model, we present two algorithms. The first one whenwe have the required prior knowledge about the  super classes  andthe second one when we do not have such knowledge. 4.2.1. Learning in the presence of prior knowledge When we know in advance the number of components formingeach super class ,wehavetwowaystoinitializetheparameters.Thefirst way is to let the system group the clusters basing on similar-ities, thus, weusethe K  -Meansalgorithmtohave M   clustersrepre-senting the  super classes  and then reuse  K  -Means to determine theappropriate classes  K   j  of each  super class . The second way is to use K  -Means to have  P M  j ¼ 1 K   j  clusters and then group themwithin each super class j .   Initialization algorithm 1. – Grouping by the system: Apply the  K  -Means on the  N D -dimensional vectors to obtain initial  M super classes  andreapply it on each  super class j  to obtain  K   j  clusters.– Grouping by the user: Apply the  K  -means on the  N D -dimensional vectors to obtain initial  P M  j ¼ 1 K   j , and then groupthe chosen clusters  K   j  for each  super class j .2. Weights initialization:– Calculate  w  ji  ¼  Number of elements in cluster jiN  – Calculate  p  j  ¼  Number of elements in superclass jN  3. Apply the moments method described in Bdiri and Bouguila(2012) for each component  ji  to obtain a vector of parame-ters corresponding to a given cluster  ji .Then, the estimation algorithm can be summarized as follows.   Estimation algorithm 1. INPUT:  D -dimensional data  ~  X  n ; n  ¼  1 ; . . . ; N  , a specifiednumber of   super classes M  , and the number of classes,  K   j ,in each  super class j .2. Initialization algorithm3. E-Step: Compute the posterior probabilities using Eqs. (12)and (13).4. M-Step:– Update  ~ u  ji  using Eq. (21),  j  =1, . . .  , M  ,  i  =1, . . .  , K   j .– Update  w  ji  using Eq. (18), and  p  j  using Eq.(31),  j  =1, . . .  , M  ,  i  =1, . . .  , K   j 5. If the convergence test  ð D  p ðXj u ; f w  ji g ;~ p Þ  <   Þ  is passed ter-minate, else go to 3.where  D  p ðXj u ; f w  ji g ;~ p Þ  is the difference between the likeli-hoods calculated in two consecutive iterations. 4.2.2. Dynamical hierarchical grouping without prior knowledge about the super classes For many applications, the hierarchical model can be built onthe fly according to the user’s need or depending on some circum-stances. Here, we present the algorithm proposed in this case.   Initialization algorithm 1. Apply  K  -Means on the  N D -dimensional vectors to obtaininitial  P M  j ¼ 1 K   j  clusters.2. Calculate  w  ji  ¼  Number of elements in cluster jiN  3. Applythe momentsmethodfor eachcomponent  ji  toobtainthe initial parameters corresponding to each cluster.Then, the estimation algorithm can be summarized as follows.   Estimation algorithm 1. INPUT:  D -dimensional data  ~  X  n ; n  ¼  1 ; . . . ; N   and a specifiednumber of total clusters  P M  j ¼ 1 K   j .2. Initialization algorithm.3. E-Step:Computetheposterior probabilities for  P M  j ¼ 1 K   j  com-ponentsofanIDMusingthealgorithminBdiriandBouguila(2012).4. M-Step:– Update the parameters of the mixture at the lowestlevel according to Bdiri and Bouguila (2012), consider-ing a finite mixture of IDM with  P M  j ¼ 1 K   j  components.– Update  w t   = w  ji  using Eq. (18) when considering theposterior probabilities calculated in 3,  j  ¼  1 ; . . . ; M  ; i  ¼  1 ; . . . ; K   j ;  t   ¼  1 ; . . . ; P M  j ¼ 1 K   j .5. Iftheconvergencetest ð D  p ðXj u ; f w t  gÞ  <   Þ  ispassedgoto6,else go to 3.  p ðXj u ; f w t  gÞ  is the likelihood of the considered IDMM hav-ing  P M  j ¼ 1 K   j -components, with a set of parameters f u ; f w t  gg ; t   ¼  1 ; . . . ; P M  j ¼ 1 K   j .6. For each level of the hierarchy: compute  p  j  ¼  P K   j i ¼ 1 w  ji according to a given ontological model.Using this algorithm, we do not have to specify  K   j  for each  super class j , but we rather specify the total number  P M  j ¼ 1 K   j  of the mod-elledclassesatthelowestlevelofthehierarchy.Wethenspecify K   j for each  super class j  basing on a given ontological model, and theobtained results. It is noteworthy to mention that any number of levels in the hierarchy can be built by simply constructing itsweights by summing their respective children weights. In ourapplication and experiments we use this algorithm. 5. Experimental results 5.1. Synthetic data In this section, an evaluation of our proposed algorithm is per-formedusingsyntheticdata.Wereportresultsonone-dimensionaland multi-dimensional synthetic data sets. We also analyze theperformanceofourestimationapproachinthepresenceofoutliersandfinallyweanalyzethecapabilitiesofourapproachinmodelingoverlapping classes. It is noteworthythat we supposehere that wehave a certain knowledge about the grouping of the different clas-ses into  super classes  based on a given semantic model. The gener-ationof invertedDirichlet distributiondensities according to some 1222  T. Bdiri et al./Expert Systems with Applications 41 (2014) 1218–1235
Search
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks