Social Media

A multi-stage approach for news video segmentation based on automatic anchorperson number detection

Description
A multi-stage approach for news video segmentation based on automatic anchorperson number detection
Categories
Published
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  A Multi-Stage Approach for News Video Segmentation based on Automatic Anchorperson number detection Leandro D’Anna 1 , Gennaro Percannella 1 , Carlo Sansone 2 , Mario Vento 1 1  DIIIE – Università di Salerno, Via Ponte don Melillo, 1 I-84084, Fisciano (SA), Italy 2  DIS – Università di Napoli “Federico II”, Via Claudio, 21 I-80125 Napoli, Italy ldanna@unisa.it, pergen@unisa.it, carlo.sansone@unina.it, mvento@unisa.it Abstract  In this paper we present an algorithm for anchor  shot detection that is a fundamental step for  segmenting news video into stories. This is among key issues for achieving efficient treatment of news-based digital libraries. The proposed algorithm creates a set of audio/video templates of anchorperson shots in an unsupervised way, then classifies shots by comparing them to all the templates when there is one anchor and to a single best template when there are two anchors.  In this paper we also propose an automatic selector, based only on the audio track, that is able to classify a news video as presented by one or two anchorpersons. The method has been tested on a wide database demonstrating its effectiveness. 1. Introduction Story segmentation is a basic step towards effective news video indexing. All the solutions to story segmentation may be basically ascribed to two approaches. According to the first, segmentation is accomplished by directly finding story boundaries. Such boundaries are typically obtained by looking for the occurrences of some specific event (a sequence of  black frames, the co-occurrence of a silence in the audio track and a shot boundary in the video track, etc.), or an abrupt change of some features at a high semantic level, as a topic switch. The main limitation of this approach relies on the fact that, in the first case, the overall performance depends on the validity of the hypothesis that a story boundary is associated to a specific event in the audio or the video stream, while in the second case to the possibility of reliably deriving high semantic level features. The other approach performs story segmentation according to the following news program model assumption: given that each shot of the news video can  be classified as an anchor shot or a news report shot, then a story is obtained by linking each anchor shot with all successive shots until another anchor shot, or the end of the news video, occurs. Using this model for the stories, news boundaries correspond to a transition from a news report shot to an anchor shot, or from an anchor shot to another. According to the above news story model, automatic anchor shot detection (ASD)  becomes the most challenging problem to partition a news video into stories. It has to be noted that the main limitation of this approach relies on the validity of the above described news story model. However, some  papers [1,2] have shown that such a model is valid for most TV networks. Consequently, we preferred to follow this approach, directing our efforts to provide an effective solution to the ASD problem. In the scientific literature there are many papers that  propose ASD algorithms. The majority exploits only video information [1,3,4], but in the last years the use of audio as an additional source of information for video segmentation has been rapidly raised up. Several  proposals [5,6] use audio features for directly individuating news boundaries, by means of a silence or a speaker change detector, in order to strengthen or to weaken the boundaries provided by the analysis  based on ASD video techniques. As discussed in [7], common drawbacks of such approaches are the use of supervised model-based techniques, that are not general enough, as they require an a priori  definition and construction of an anchor shot model, and the unsystematic use of audio information. This is not effective due to its incoherence with video and then yields to a misleading shot classification. In order to overcome these limitations, in [7] a two stage audio/video ASD method was proposed. In the first stage a set of templates are built in an unsupervised way. Each template represents a different anchor shot model within a video. In the second stage, a video similarity metric is used in order to retrieve a International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies 0-7695-2993-3/07 $25.00 © 2007 IEEEDOI 229   International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies 0-7695-2993-3/07 $25.00 © 2007 IEEEDOI 10.1109/UBICOMM.2007.35229  set of candidate anchor shots, which might have been missed by the first stage, and classify them by evaluating the audio similarity with respect to the templates. Differently from the majority of the other approaches, the method does not use audio information as-it-is, but  performs audio based classification only on the set of candidate shots and by employing a suitably defined similarity metric. In particular, from an audio point of view, each template is characterized in [7] by means of the union of the audio tracks belonging to all the shots summarized by such a template, represented in a suitable feature space. It is worth noting that this is the  best choice if the news video is presented by only one anchorperson. In this case, in fact, we use the maximum possible information and then the audio similarity is evaluated in the most reliable manner. On the contrary, a different criterion for building audio templates should be used when a news video is  presented by two anchorpersons. In this case, in fact, the evaluation of the audio similarity between a candidate anchor shot and each template should be  performed by using only the audio track of the closest shot (by an audio point of view), among all the ones represented by this template. This is due to the fact that a template can also summarize shots in which both the anchorpersons are present in video. In this case, a part of the audio track of these shots can refer to an anchor  person, while another part can refer to the other anchorperson. Therefore, if the union of all the audio  portions of such shots is used for building an audio template, it is likely that the second stage of the method proposed in [7] can give rise to misclassifications. If the number of anchorpersons is known, the correct criterion to be applied can be a priori  chosen. It happens if a broadcaster would employ the proposed method for analyzing all the editions of its news videos. In the general case, however, archiving companies work with large quantities of videos from different sources; moreover, as stated before, we are interested in developing a completely unsupervised method. So, in this paper we also propose an automatic selector, based only on the audio track, that is able to classify a news video as presented by one or two anchorpersons. After this classification, it is possible to employ the more suitable criterion for building the audio templates to be used within the approach  presented in [7]. We tested the overall approach on a significant database made up of several video news editions from one of the main Italian broadcasters. Moreover, we also compared our algorithm with respect to the algorithm proposed in [7], achieving a significant  performance improvement. The organization of the paper is then as follows: in Section 2, a description of the approach is reported; in Section 3 the database used is reported together with the tests carried out in order to assess the performance of the proposed algorithm. Finally, in Section 4, some conclusions are drawn. 2. The proposed approach As anticipated in the Introduction, the proposed approach is based on the algorithm in [7]. In this Section we firstly provide a concise description of the srcinal algorithm, then we give the details about the  proposed modifications. The method in [7] performs a two stage analysis which is able to achieve ASD in an unsupervised way. The first stage extracts a set of audio/video templates from the news video under analysis, so avoiding any training procedure or manual definition of the templates. Figure 1 summarizes the complete template extraction algorithm. 1)   Construct a complete graph G   such that: a)   its nodes Kf  i  correspond to the key-frames of each shot; b)   each of its edges e ij =(Kf  i , Kf  j )  is characterized by a weight w  ij =d(Kf  i , Kf  j ) , where d(Kf  i , Kf  j )  is calculated on the basis of the color histogram differences between Kf  i  and Kf  j ; 2)   determine the minimum spanning tree (MST) of G;  3)   remove from the MST those edges with large  weights by using the Fuzzy C-Means  algorithm, in order to create shot clusters; 4)   remove clusters with only one node; 5)   remove all clusters with lifetime lower than a threshold δ  ; 6)   extract 3 frames from each shot S  j  belonging to the set of remaining clusters; 7)   remove from clusters those shots which have less than 2 frames which contain a face; 8)   apply steps 4) and 5) to remaining clusters; 9)   for each remained cluster build the set N   As  by extracting all anchor shots from that cluster. 10)   extract a unique key-frame from each cluster and the whole audio track from the set N   As , giving rise to the set of audio/video anchor shot templates. Figure 1.   The proposed template extraction algorithm. In the second stage a set of candidate anchor shots is selected by means of a metric that provides a 230   230  measure of the video similarity with respect to the templates; then the candidate anchor shots are validated by exploiting both audio similarity and the  presence of faces. The proposed method differs from the one described above with respect to the way it performs the audio shot classification (Fig. 2, step 7). In fact, the srcinal method computes for each shot to be classified the value of a similarity index, namely  D-index , which expresses similarity between a shot and a single audio template. Candidate selection 1)   Select a template; 2)   compare all the discarded shots with the template; 3)   build the candidates list Lc   of the three most similar shots (according to a suitably defined similarity metric), sorted in descending order; 4)   repeat steps 1), 2) and 3) for every template and refresh Lc  ; Classification 5)   Extract audio feature vectors for all shots in N   As  and Lc  ; 6)   remove unvoiced and silent segments by means of a masking module; 7)   calculate D  i  for each candidate shot S  i , by comparison with the audio templates and make audio classification decision for S  i ; 8)   apply face detection to candidates; 9)   apply AND rule between face detection and audio classification to reach the final decision. Figure 2 .   The shot classification algorithm.   Audio classification is based on a set of 48 features (20 MFCC, 14 LPCC, 14 PF, see [9] for further details) extracted from each frame in the audio track of the shots. In this case, a frame is a segment of 1024 audio samples. Each shot S  i  is considered as a cluster of audio feature vectors, so it can be represented by its centroid, namely C  i . For a generic pair of shots ( S  m , S  k  )  belonging to the final set of anchor shot clusters  N   As , we assume: ( ) max, ,1 d C C d  D k mk m  −=  where d(C  m  ,C  k   )  is the Euclidean distance between C  m  and C  k  , while d  max  is the diameter of the cluster of centroids of the shots belonging to  N   As . Consequently, we can consider d  max  as the diameter of the set of templates. Given a generic shot S  i  to be classified, we calculate its  D-index , namely  D i , as the average of all the  D i,k   obtained by considering all the shots S  k   in the set  N   As . If  D i >0 then S  i  is classified as anchor shot. The case  D i > 0 implies that the average distance between the shot S  i  and the set of templates is lower than the diameter d  max  of the set of templates. Consequently, we classify S  i  as an anchor shot. On the contrary, if  D i < 0 then the shot S  i  is sufficiently far from the cluster of templates, so it is discarded as a news report shot. Since  D i  is calculated on the whole set of anchor shot clusters, this approach provides good results only on news editions characterized by the presence of a single anchor. On the contrary when two or more anchorpersons are present in the news video, this method can give rise to misclassifications, as already explained in the previous Section. In order to overcome this limitation we propose to modify the definition of the  D-index . Our idea is to associate the candidate shot under analysis to the template that is more similar from the audio point of view. According to this idea and with reference to Fig. 3, we calculate the  D-index of the m -th candidate shot as: where  D r m  is the similarity of the m -th candidate shot to the r  -th audio template: the similarity of the m -th candidate shot to the  p -th anchor shot within the r  -th audio template is expressed  by: and d(C  m  ,C  k   )  is the Euclidean distance between C  m  and C  k  , while d  k max is the diameter of the k  -th audio template. Figure 3.  Audio similarity index ( D-index  ) in case of multiple templates.  It has to be noted that the proposed method of calculating the  D-index  is better suited for the case of multiple speakers; on the contrary lower performance d  jmax T i T  j T k C m  d imax d kmax mi  D m j  D mk   D   1 k  S  2 k  S  3 k  S  2 , k m SS   D ( ) ,max ,1 k mr  k mr SS k  dCC  Dd  = − , r mp mr SS  p  DavgD = mr r m  Dmaxarg  D  = 231   231  can be possibly obtained when a single anchor is  present in the news video. In fact, in the latter case the srcinal definition of the  D-index  provides a more robust estimation of the audio similarity of the candidate shot with respect to the template since it uses all the available audio information. If the number of anchorpersons is known, the correct criterion to be applied can be a priori  chosen. Since we are interested in developing a completely unsupervised method, in this paper we also propose an automatic selector, based only on the audio track, that is able to classify a news video as presented by one or two anchorpersons. After this classification, it is  possible to employ the more suitable criterion for  building the audio templates The proposed method is based on the observation that each speaker is typically characterized by a specific distribution of the fundamental frequency (  f  0 ): the idea is to verify if the  f  0  distribution calculated on all the audio samples belonging to  N   As . We have modelled this pdf as a log-logistic: lnloglogln2 (1) T isticT  e pdf Te  µ σ  µ σ  σ  −−− =+  in order to take into account also the errors of halving and doubling introduced by the algorithm for  f  0  extraction. In case of more speakers within the  N   As  set, it is possible to expect an overall distribution for  f  0  that is given by the sum of the distributions of each single speaker. Then a simple hypothesis testing allows to verify if the obtained distribution can be attributed to a single anchorperson or to more speakers. Figure 4.  Main steps of the proposed techniques for automatic discrimination of single/multiple anchorpersons news video edition. In Fig. 4 there are sketched the main steps of the  proposed technique. We firstly extract the  f  0  using the autocorrelation. Then, we consider only voiced pieces and use a parametric compress stage in order to remove the most relevant doubling and halving errors so to  preserve the central piece of the  f  0  distribution. This stage updates the values of f0 according to the following rule:  ≥≤≤≤⋅=  µ  µ  µ  µ  4.1if   24.10.6if   6.0if  2  _ 0 _ 0 _ 0 _ 0  _ 0 _ 0  _ 0 old old old old old old new  f  f  f  f  f  f  f   where  µ   is the mean of the srcinal  f  0  distribution. Then, we carry out a Maximum Likelihood Estimation in order to find the parameters of our supposed distribution (log-logistic). Finally, we compare this theoretical distribution with the experimental distribution using the hypothesis test of Kolmogorov-Smirnov. We use this test, because it applies to all continuous distributions and provides good results even if we have little samples for distribution. If this test was true, we can consider that in the audio track there is only one speaker else we have more speakers. 4. Experimental Results In order to assess the performance of our approach, we collected a database composed by several videos from one of the main Italian broadcasters, namely CANALE 5. It includes 12 news videos; 9 of them are presented  by a single anchorperson. As anchorpersons, there are 6 different males and 3 different females. The overall time length of the videos is 6h and 27m. The total number of anchor shots is 168, while the total number of shots is 2802. The performance of our method were evaluated in terms of  Precision  and  Recall   [8]. The  F  -measure has also been used, which combines the former indexes in a single figure of merit according to the following formula:  F   = (2 *  Precision  *  Recall  ) / (  Precision  +  Recall  ) For each video of the above described database, in Table 1 the performance in case of audio templates  based on the closest shot and on multiple shots are reported. The best results in terms of  F   value are shown in bold. As it can be seen, the methods presented in [7] achieves the best performance in terms of  F   value for news editions where there is only one anchorperson, while the use of audio templates based on the closest shot permit to obtain the best performance in terms of  both  Precision ,  Recall   and  F   when two anchorpersons are present. In particular, as it can be expected, the use f   0   extraction Unvoiced pieces removal Compress stage Distribution parameters estimation(MLE) Hypothesis test 232   232  of the closest shot for building audio templates always allows us to improve the  Precision  value. Table 1.   Performance of the method with audio templates  based on multiple shots and on the closest shot. Best results are reported in bold.    Audio templates based on multiple shots  Audio templates based on the closest shot # anchor  persons # anchor shots / # shots  Prec. Recall F Prec. Recall F 1 8 / 155 1.00 0.86 0.93 1.00 0.86 0.93 1 9 / 171 1.00 0.78 0.86 0.86 0.67 0.75 1 10 / 161 1.00 1.00 1.00 1.00 1.00 1.00 1 11 / 268 1.00 1.00 1.00 0.90 0.82 0.86 1 13 / 184 1.00 1.00 1.00 1.00 1.00 1.00 1 13 / 285 1.00 1.00 1.00 1.00 0.85 0.92 1 14 / 255 1.00 1.00 1.00 1.00 0.93 0.96 1 14 / 268 0.93 1.00 0.97 1.00 0.86 0.92 1 15 / 302 1.00 0.87 0.93 1.00 0.80 0.89 2 18 / 250 0.86 1.00 0.92 0.95 1.00 0.97 2 21 / 221 0.96 1.00 0.98 1.00 1.00 1.00 2 22 / 282 0.91 0.96 0.93 0.96 0.96 0.96 The performance on the whole dataset of the method using audio templates based on multiple shots and on the closest shot are summarized in Table 2. In the same table the results of the proposed method for automatically detecting the number of anchorperson are also reported, together with those obtainable in the ideal case, i.e. when the number of anchorpersons for each video is a priori  known. It can be noted that the proposed method for anchorperson number detection achieves very good  performance; in particular in only one case (the seventh news video of Table 1) a video presented by a single anchor person is incorrectly recognized as  presented by two anchorperson. So, the obtained results (third row of Table 2) are very close to the ideal case and allows us the improve those provided by the method proposed in [7] (first row of Table 2). In  particular, the automatic anchorperson number detection method was able to get the 75% of the maximum obtainable improvement. Table 2.  Overall performance obtained by using audio templates based on multiple shots and on the closest shot, compared with those obtainable with automatic and ideal anchorperson number detection.  Precision Recall F  Audio templates based on multiple shots [  7  ]  0.959 0.964 0.961  Audio templates based on the closest shot 0.975 0.911 0.942  Proposed method for anchorperson number detection 0.982 0.958 0.970  Ideal anchorperson number detection 0.982 0.964 0.973 5. Conclusions In this paper it is proposed an improved version of a state of the art anchor shot classification algorithm. The contribution provided in this paper is twofold: from one side we have proposed a new way of evaluating the audio similarity which is better suited to operate in case of news video editions characterized by the presence of two speakers; on the other side we developed an automatic system that is able to detect if the news video has a single or multiple anchorpersons which allows to use a single or multiple templates in the audio classification phase. The method has been tested on a news video database consisting of about 6 hours, providing good improvements with respect to srcinal algorithm which was already characterized by very high performance. References [1] X. Gao, X. Tang, “Unsupervised Video-Shot Segmentation and Model-Free Anchorperson Detection for News Video Story Parsing”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, No. 9, pp. 765 776, 2002. [2] M. De Santo, G. Percannella, C. Sansone, M. Vento, “A Comparison of Unsupervised Shot Classification Algorithms for News Video Segmentation”, Lecture Notes in Computer Science vol. 3138, Springer, Berlin, pp. 233-241, 2004. [3] A. Hanjalic, R. L. Lagendijk, J. Biemond, “Semi-Automatic News Analysis, Indexing, and Classification System Based on Topics Preselection”, Proc. of SPIE, Electronic Imaging: Storage and Retrieval of Image and Video Databases, San Jose (CA), 1999. [4] M. Bertini, A. Del Bimbo, P. Pala, “Content-Based Indexing and Retrieval of TV News”, Pattern Recognition Letters, vol. 22, pp. 503-516, 2001. [5] S. Eickeler, S. Muller, “Content-based video indexing of TV broadcast news using Hidden Markov Models”, ICASSP ‘99, pp. 2997-3000, 1999. [6] W. Qi, L. Gu, H. Jiang, X. R. Chen, H. J. Zhang, “Integrating Visual, Audio and Text Analysis for News Video”, 7th IEEE International Conference on Image Processing, Vancouver, British Columbia, Canada, 2000. [7] L. D’Anna, G. Marrazzo, G. Percannella, C. Sansone, M. Vento, “A Multi-stage Approach for Anchor Shot Detection”, D.-Y. Yeung et al (Eds.), Lecture Notes in 233   233
Search
Similar documents
View more...
Tags
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks