Short Stories

A Robust SIFT-Based Descriptor for Video Classification

Description
A Robust SIFT-Based Descriptor for Video Classification
Categories
Published
of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
    A Robust SIFT-Based Descriptor for Video Classification   Raziyeh Salarifard, Mahshid Alsadat Hosseini, Mahmood Karimian and Shohreh Kasaei Department of Computer Engineering, Sharif University of Technology, Tehran, Iran ABSTRACT Voluminous am ount of videos in today’s world  has made the subject of objective (or semi-objective) classification of videos to be very popular. Among the various descriptors used for video classification, SIFT and LIFT can lead to highly accurate classifiers. But, SIFT descriptor does not consider video motion and LIFT is time-consuming. In this paper, a robust descriptor for semi-supervised classification based on video content is proposed. It holds the benefits of LIFT and SIFT descriptors and overcomes their shortcomings to some extent. For extracting this descriptor, the SIFT descriptor is first used and the motion of the extracted keypoints are then employed to improve the accuracy of the subsequent classification stage. As SIFT descriptor is scale invariant, the proposed method is also robust toward zooming. Also, using the global motion of keypoints in videos helps to neglect the local motions caused during video capturing by the cameraman. In comparison to other works that consider the motion and mobility of videos, the proposed descriptor requires less computations. Obtained results on the TRECVIT 2006 dataset show that the proposed method achieves more accurate results in comparison with SIFT in content-based video classifications by about 15 percent. Keywords:  Robust Video Descriptor, SIFT, Video Classification, LIFT. 1.   INTRODUCTION With recent technological advances and the huge boost in video capturing devices, video data has grown exponentially. This calls for a fast and accurate solution in classification for indexing and retrieval of the data. Human-based video classification increases the time and costs. This is where automatic video data classification comes into view for many researchers. For video data classification, appropriate descriptors are extracted initially and the video class is determined based on these descriptors subsequently. The more the extracted descriptors show the differences among various types of videos, the more accurate the classification will be. Since a video usually contains a sequence of frames, all extractable descriptors from its frames can also be extracted. In most video classification methods only extractable descriptors from its frames are utilized independently and therefore the motion trajectory is ignored. Therefore, employing the motion trajectory can lead to a more accurate classification. Visual features of video can be generally classified to static descriptors extracted from the main frame, extracted descriptors from the video objects, and motion descriptors [1]. Static descriptors from the main frame involve color-based, texture-based, and shape-based descriptors [2, 3]. These static descriptors only describe the visual aspect of video and are weak in describing other aspects such as objects and motion. Researchers use various methods to extract objects from video, for example, Visser [4] uses the Kalman filter and Zhang [5] benefits spatio-temporal independent component analysis (stICA) and multi-scale analysis. The motion descriptor   of video is used in [6, 7, 8, 9]. Different researches use various information of video. For example, [6] uses motion vectors embedded in MPEG bitstream as a video descriptor, in [7] motion vector field is used in order to extract a motion descriptor. Also, [8] extracts a video motion descriptor based on local and global motion information, and [9] benefits the spatio-temporal distribution within a shot for video indexing and retrieval. Along with using motion descriptor, [10] uses static descriptors and SIFT descriptor to generate a new descriptor. However, using various features for video retrieval is time consuming and can be non-applicable in time-sensitive tools. One of the most important descriptors in content-based video classification is the SIFT descriptor which is a scale invariant feature [11]. Image and video classification using SIFT descriptor has a high accuracy. However, SIFT descriptor is applied on independent frames and thus ignores the motion in videos. In [12], a descriptor called local invariant feature tracks  (LIFT) is presented which tracks the SIFT descriptor in consecutive frames of each shot. It considers the dynamism of video and consequently leads to better results. In order to equalize its descriptor vector, the LIFT descriptor uses many complicated and time-consuming calculations which are not appropriate for online video classification. Using the SIFT descriptor, in this work a descriptor similar to LIFT is extracted which tracks the SIFT    keypoints in consecutive frames and extracts the final descriptors, by making these tracks equal to each other. The proposed descriptor is as accurate as LIFT while it uses a very simple method to equalize the length of the descriptor vector which results in reducing the time complexity. 2.   NOTATIONS AND FORMULATIONS Before explaining the proposed descriptor extraction algorithm, the used notations and formulations are described in this section. A brief description of notations is listed in Table I. TABLE I: NOTATIONS USED IN IMPLEMENTATION. i F    th i sampled frame   , ij ij  X Y   Coordinates of the th  j   keypoint in   the th i frame ij S   SIFT descriptor of the th  j point in   the th i frame     Half of square spatial window side  xk   A   The transmission matrix to  X  curve in the th K  track  yk   A   Transmission matrix to Y  curve in the th K  track k   X    Matrix containing coefficient of the  X  curve k  Y    Matrix containing coefficient of the Y  curve  Z    Matrix contains index of points Each video contains a number of shots and every shot consists of some frames that are presented within a short time interval. In this paper, i F  is the th i  frame selected out of every 25 successive frames. Each i F    has many keypoints that their coordinates are denoted by   , ij ij  X Y  . The SIFT descriptor of each point is denoted by ij S  . There are many tracks extracted for each shot. In order to extract a track, a set of consecutive points are found where    is   half of a square spatial window side for finding the points. As shown in (1), the  xk   A matrix is calculated by using the  X  coordinate of the th K  track points. The  yk   A matrix is calculated in the same way as well. The  xk   A and  yk   A matrices transform the th K  track to a twentieth degree polynomial curve, where k   X  and k  Y  contain coefficients of the curve. 1 220 20 201 2 1,1,......,1,1,1, ...,.., ..., k k nk  xk k k nk   x x x A x x x             3.   PROPOSED ROBUST VIDEO DESCRIPTOR   In this section, the proposed robust descriptor for video classification is described. Actually, the proposed descriptor is extracted to classify the shots. As shown in Figure 1, some frames are sampled from each shot. Then, the keypoints of these sampled frames are extracted. A SIFT descriptor is then extracted for each keypoint. Among the points in the neighborhood of these keypoints in the next frame, a point which has a more similar SIFT descriptor is selected. By continuation of this procedure a sequence of points with similar location and descriptor are generated. These tracks of points have different lengths, thus by transmitting each track to a constant degree polynomial curve and saving the curve coefficient, a vector which has a constant length is formed. This vector along with the average of the SIFT descriptor form the semi-final descriptor. Among the extracted descriptors, by using the bag-of-words method only some of them are selected to represent the others. These descriptors are the final descriptors which are used in video classification stage. In the following, the descriptor extraction and shot classification are comprehensively described. 3.1 Frame Sampling In 25 frames per second videos, assuming that the probability of changing the video content in less than a second is very low, here just one frame per 25 frames is selected to represent it. Thus, as shown in (2), a set of consecutive    sampled frames represent a shot by 1 2 {F,F ,...,F ,...,F }  j n Sh  . (2) Figure 1. Proposed robust video descriptor. 3.2 SIFT Descriptor In order to extract the SIFT descriptors, keypoints are first extracted from each frame. To find keypoints, among all existing pixels in the frame, those which are immutable toward scale and rotation changes in all scales are considered. As shown in Figure 2, for each keypoint locating in the middle of small blocks the surrounding pixels are divided into 4 parts. Then, by using a Gaussian function, shown by a circle in the figure, a weight is assigned to each vector of these 4 parts. Finally, a histogram with 8 different directions is formed for existing vectors in each part [13]. After SIFT extraction for each sampled frame i F  , a set of keypoints is obtained where each has the location   , ij ij  X Y  and SIFT descriptor ij S  . Figure 2. SIFT descriptor extraction [13]. 3.3   Robust Video Descriptor The details of each stage of the proposed descriptor are given next. 3.3.1 Motion Estimation In order to extract a track, for each keypoint in a frame, the similar one in the next frame is found. A similar keypoint has the following conditions (j 1) |X X | ij i         (3) (j 1) |Y | ij i Y          (4) (j 1) ( 1) |S | min(|S S |) ij i ij k j S       . (5) Equations (3) and (4) denote that a similar keypoint should be located in a square spatial window of size 2    and equation (5) denotes that this keypoint has the most similar SIFT descriptor among other keypoints in the mentioned neighborhood. If such a point is found, it will be added to the track. This search will continue until the last point of the    track cannot find a similar point in the next frame. For each track, the average of SIFT descriptors of the points will also be saved. Since there is a number of keypoints in each frame, and the last point of each track can be located anywhere in the next frames, many tracks with different lengths will be generated. 3.3.2 Curve Estimation As shown in the previous subsection, the lengths of the tracks are different. Thus, in order to transmit these tracks to feature vectors, a vector with a constant length should be extracted from each track. For each track, the sequence of i  X  and i Y  elements in the time dimension are mapped to a curve. Thus, in order to extract the mentioned descriptors, each of these curves is transmitted to a twentieth degree polynomial curve. The k   X  and k  Y  matrices that denote the coefficient of curves are calculated as k  xk   Z  X  A   (6) k  yk   Z Y  A   . (7) As described in Table 1,  Z   is a matrix that contains the index of tracked keypoints and  xk   A and  yk   A are matrices that transmit k   X  and k  Y  to a polynomial curve. The coefficient of these two polynomial curves along with the average of the SIFT descriptor of all points in a track, forming a 168 element vector, construct the semi-final descriptors. Therefore, with a very simple method and a few calculations a descriptor is extracted from the tracks of a shot where it represents the motion feature of video well. 3.3.3 Bag-of-Words There are a number of extracted 168-element vectors for each shot. In order to classify a shot, a constant number of vectors should be selected to represent it. The bag-of-words method is used to choose a specific number of vectors among all input vectors. In this method, all these 168-element vectors are transmitted to a 168-element space. In this space, a clustering method is applied. It can be performed by clustering method. The K-means clustering is applied in this paper. It groups the vectors into K clusters and selects a vector from each cluster to represent that cluster. Thus, K number of 168-element vectors are selected to represent the shot. Now, for each shot, a vector with an equal number of elements is extracted as its descriptor. 3.4 Shot Classification To classify shots, a supervised classification has been used. To do so, we have used the 10 fold cross-validation method which uses Support Vector Machine with RBF kernel as its classifier. 4. EXPERIMENTAL RESULTS In this section, the proposed descriptor is compared with SIFT in precision and complexity aspects. In order to evaluate the proposed method, the TRECVIT 2006 dataset is used. The proposed method is implemented using C programming language. To run and test the method, the program has been run on a personal laptop with INTEL® CORE™2 DUO processor with the process spee d of 2 . 40 GHz. One of the most prevalent criteria of video classification assessment is precision. Thus, the criteria of evaluating the proposed descriptor is the average of the video classification precision in each label. Figure 3 describes the effect of σ on the descriptor extraction precision. As     increases, the number of adjacent keypoints will be raised and consequently the probability of finding them will be increased as well. Thus, as shown in Figure 3,     growth will result in increment of average precision of classification. But, by increasing     for more than 10 pixels, the number of irrelevant keypoints added to the track will be increased and thus the average precision will be decreased. Therefore,     equal to 10 is chosen as an experimental setup. Figure 4 shows the average precision of classification for the proposed and SIFT descriptors in various contents. According to this figure, for videos having mid and high motion the precision of the proposed method is about 15 percent higher than that of SIFT and in videos with low motion (like airplane and explosion) the precisions are the same. In videos with no motion (such as building exterior, waterscape, and smoke) the precision of the proposed is about 10 percent less than SIFT. The analysis on the descriptor extraction execution time done on 2000 shots shows that the SIFT    execution time is 200 milliseconds and that for our proposed descriptor is 215 milliseconds and this time increase in execution time is negligible. Figure 3. Effect of    on average precision. Figure 4. Average precision of proposed and SIFT descriptors. 5. CONCLUSION In this paper a robust motion-based descriptor is proposed by using SIFT. It is simple and fast. It uses the motion trajectory in videos to improve the accuracy of the subsequent classification stage. The experimental results show that the proposed method is efficient for content-based video classification with negligible time overload. In order to have an effective classification in all video contents, SIFT can be selected as descriptor for non-motion video contents, which is our future work. REFERENCES [1] Hu, Weiming, et al. "A survey on visual content-based video indexing and retrieval." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on  41.6 (2011): 797-819. [2] Yan, Rong, and Alexander G. Hauptmann. "A review of text and image retrieval approaches for broadcast news video." Information Retrieval 10.4-5 (2007): 445-484. [3] Amir, Arnon, et al. "IBM research TRECVID-2003 video retrieval system." NIST TRECVID-2003 (2003). [4 ] R. Visser, N. Sebe, and E. M. Bakker, “Object recognition for video retrieval,” inProc. Int. Conf. Image Video Retrieval, London, U.K., Jul. 2002, pp. 262  –  270. [5] Zhang X P, Chen Z. An automated video object extraction system based on spatiotemporal independent component analysis and multiscale segmentation. EURASIP Journal on Applied Signal Processing, 2006, 2006: 184 [6] Dao, Minh-Son, F. G. B. DeNatale, and Andrea Massa. "Video retrieval using video object-trajectory and edge potential function." Intelligent Multimedia, Video and Speech Processing, 2004. Proceedings of 2004 International Symposium on. IEEE, 2004. [7]. Su, Chih- Wen, et al. “Motion Flow - Based Video Retrieval.” Multimedia, IEEE Transactions on 9.6 (2007): 11931201 [8] Ma, Yu-Fei, and Hong-Jiang Zhang. "Motion texture: a new motion based video representation." Pattern Recognition, 2002. Proceedings. 16th International Conference on. Vol. 2. IEEE, 2002. [9] Fablet, Ronan, Patrick Bouthemy, and Patrick Pérez. "Nonparametric motion characterization using causal probabilistic models for video indexing and retrieval." Image Processing, IEEE Transactions on 11.4 (2002): 393-407. [10] Basharat, Arslan, Yun Zhai, and Mubarak Shah. "Content based video matching using spatiotemporal volumes." Computer Vision and Image Understanding 110.3 (2008): 360-377 [11] Lowe, David G. "Object recognition from local scale-invariant features."Computer vision, 1999. The proceedings of the seventh IEEE international conference on. Vol. 2. Ieee, 1999. [12] Mezaris, Vasileios, Anastasios Dimou, and Ioannis Kompatsiaris. "Local invariant feature tracks for high-level video feature extraction." Analysis, Retrieval and Delivery of Multimedia Content. Springer New York, 2013. 165-180. [13 ] David G. Lowe, “Distin ctive Image Features from Scale- Invariant Keypoints”, INTERNATIONAL JOURNAL OF COMPUTER VISION, vol. 60, n. 2, pp.91-110, 2004.
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks