CONTENT BASED VIDEO RETRIEVAL BASED ON HDWT AND SPARSE REPRESENTATION

Video retrieval has recently attracted a lot of research attention due to the exponential growth of video datasets and the internet. Content based video retrieval (CBVR) systems are very useful for a wide range of applications with several type of data such as visual, audio and metadata. In this paper, we are only using the visual information from the video. Shot boundary detection, key frame extraction, and video retrieval are three important parts of CBVR systems. In this paper, we have modified and proposed new methods for the three important parts of our CBVR system. Meanwhile, the local and global color, texture, and motion features of the video are extracted as features of key frames. To evaluate the applicability of the proposed technique against various methods, the P(1) metric and the CC_WEB_VIDEO dataset are used. The experimental results show that the proposed method provides better performance and less processing time compared to the other methods.


INTRODUCTION
Video data contains several types of information such as images, sounds, motions, and metadata.These characteristics have caused research in processing videos to become quite difficult and time consuming.Video retrieval is used in numerous multimedia systems, processing and applications, and also assisting people in finding the videos, images and sounds related to the user's interest.In early video retrieval systems, videos were manually annotated using text descriptors.However, these systems have several shortcomings.For example, the concept of a video is more than a series of words, since manual indexing is a costly and difficult process.Due to increasing the number of video datasets and as well as the mentioned shortcomings, text based video retrieval is known to be an inefficient method, while the demand for CBVR increases.The content based techniques use vision features for the interpretation of the videos (Lew et al., 2006).
CBVR systems are useful for a wide range of applications such as seeking an object in a video, digital museums, video surveillance, video tracing, and management of video datasets, as well as remote controlling, and education.
Video involves visual information, audio information and metadata (Chung et al., 2007).Visual information contains numerous frames and objects, and their feature vectors are extracted by content based methods.Audio information can be obtained by speech recognition methods, and video indexing by using extracted texts.Metadata contains the title, date, summary, producer, actors, file size, and so on.These types data are often used for video retrieval.CBVR systems contain three important parts, 1) shot boundary detection, 2) key frame extraction, and 3) video retrieval.There is plenty of research in all of these areas, and the novel methods are well-motivated in recent years (Weiming et al., 2011).In the following, we review the processes and recent developments in each area.

Shot boundaries are fundamental units of videos.
A shot contains a consecutive sequence of frames captured by a camera where the frames have strong content correlations.The shot boundaries are used to organize the contents of videos for indexing and retrieving applications (Yuan et al., 2007).Transitions between the shots are classified to abrupt (cut) and gradual transitions which include such as fade in, fade out, wipe.The detection of an abrupt transition is easier than gradual detection.In recent years, many methods have been proposed to detect abrupt and gradual transitions (Smeaton et al., 2010).The shot boundary detection methods usually have three main steps.In the first step, visual features of each frame are extracted by using special methods.These features include histogram, edge, color, texture, motion features, and scale invariant feature transform (SIFT) (Porter, 2004;Chang et al., 2008).Then, the similarity is measured between these extracted features.Many similarity measurements have been proposed by researchers, such as the Euclidean distance, the cosine dissimilarity, the earth mover's distance, and the histogram intersection (Hoi et al., 2006;Camara et al., 2007).The similarity is measured between consecutive frames or between a limited number of frames located in a window (Hoi et al., 2006).Finally, the shot boundaries between dissimilar frames are detected.The shot boundary detection methods can be classified into two approaches: 1) statistical learning-based, such as support vector machine (SVM) (Matsumoto et al., 2006), Adaboost, k nearest neighbor (kNN), Hidden Markov Models (HMM), and clustering algorithms such as K-means and fuzzy K-means (Damnjanovic et al., 2007) 2) threshold-based approaches which detect the boundaries by comparing the measured pair-wis a predefined threshold (Cernekova et al., 2006; Weiming e similarities between frames with et al., 2011).
After shot boundary detection, the key frame extraction is the second important part of the video retrieval systems.The frames within the shot contain great redundancies and similar contents.Therefore, the key frames should be extracted to summarize the video, and succinctly represent the shot (Mukherjee et al., 2007).In the last decade, several features for key frame extraction have been proposed such as: colors (e.g.histogram), textures (e.g.discrete wavelet transform, discrete cosine coefficient), shapes, motions (e.g.motion vectors, image variations), and optical flow (Narasimha et al., 2003;Guironnet et al. 2007;Wang et al., 2007).Truong and Venkatesh have classified the key frames extraction methods into six categories (2007): clustering based (Yu et al., 2004), sequential comparison-based (Zhang et al., 2003), global comparison-based (Liu et al., 2004), reference frame-based (Ferman and Tekalp, 2003), object/eventbased (Song and Fan, 2006) and curve simplificationbased (Calic and Izquierdo, 2002).
Finally, the video retrieval part is applied to show the retrieval results.According to Weiming et al. (2011) the six types of query has been proposed: 1) query by example, 2) query by sketch, 3) query by object, 4) query by keywords, 5) query by natural language, and 6) combined based query.In this paper, the query by example has been used.This query extracts low-level features from given example videos or images, and similar videos or key frames are found by measuring the feature similarities.The static feature of key frames are suitable for query by example, as the key frames extracted from the example videos or exemplar images can be matched with the stored key frames (Weiming et al., 2011).The stored key frames complements video retrieval (Xiong et al., 2006), by making browsing of the retrieved videos faster, especially when the total size of the retrieved videos is large.The user can browse through the abstract representations to locate the desired videos.A detailed review on video browsing interfaces, and applications can be found in (Schoeffmann et al., 2010).There are two basic strategies to show the retrieval results.1) Static video abstracts: each of which consists of a collection of the key frames extracted from the source video.2) Dynamic video skims: each of which consists of a collection of video segments (and corresponding audio segments) that are extracted from the original video and then concatenated to form a video clip which is much shorter than the original video (Weiming et al., 2011).
In this paper, according to the mentioned shortcoming text and auditory features, we only use visual features in video indexing and retrieval.The content based image retrieval (CBIR) methods can be applied on the key frames to achieve CBVR, and the static key frame features are used for video retrieval (Yan and Hauptmann, 2007).The feature extraction method plays a critical role in CBIR and CBVR systems.The feature vector of each image should represent the content of the image accurately (Kekre and Thepade, 2009).Meanwhile, the size of the feature vector has to be smaller than the image size.Therefore, this results in a minimization of search time, a simple search process, the retrieval of the same image as fast as possible, and a reduction of storage memory.The color-based (Yan and Hauptmann, 2007), the texturebased (Hauptmann et al., 2004), and the shape-based (Cooke et al., 2004) features are used for the CBIR and CBVR systems.In the following, we review some recent research which has been reported by the CBVR systems that they used the CBIR system in their methods.Many CBVR systems have used the color features such as: the global color histogram and color moment features (Amir et al., 2003), the local color histogram and color moment features by splitting the image into 5×5 blocks (Yan and Hauptmann 2007), and color Correlograms (Adcock et al., 2004).Low computational complexity and simple and accurate extraction are the advantages of the color features, in contrast, the limitation of the color features is in describing the texture feature of the images.There are many texture features such as co-occurrence texture and Tamura features (Amir et al., 2003), global and local Gabor wavelet filters (Hauptmann et al., 2004), and wavelet transformation.The advantages of the texture features are the independent color and intensity, and the extracting of the intrinsic visual features as well as their correlations with the surrounding environment.In contrast, the limitation of texture features is that they are unavailable in non-texture video.
The feature database of key frames are constructed by extracting one of the mentioned features.On receiving a query, the same feature extraction method is applied on the query.Then, one of the mentioned similarity measures is calculated, and the retrieval results are shown according to the query (Snoek et al., 2007).This paper is organized as follows: In the following section, the proposed video retrieval method is described step by step.The shot boundary detection, key frame extraction and video retrieval via sparse representation and Hadamard discrete wavelet transform (HDWT) are explained.In next section, the evaluation measures, dataset and indexing results are explained in detailed.Finally, the conclusions are drawn.

MATERIAL AND METHODS
In this section, we explain the video retrieval framework and its steps within the following subsections.The flowchart of the proposed video retrieval method and its steps is shown in Fig. 1.First, every video is converted into frames.Second, the shot boundaries of the frames are detected.Third, the key frames of the shots are extracted.Finally, the accuracy of the proposed video retrieval system are obtained by using a query by example.

CONVERTION OF VIDEO
In this step, the videos are converted into frames, and saved into a folder to be stored.This pre-processing reduces complexity and increases the speed of the proposed method in the subsequent steps (Weiming et al., 2011).

SHOT BOUNDARY DETECTION
Shots are the basic unit of every video where a sequence of successive frames creates a video shot.All the frames in each shot usually have similar visual features such as color, texture and motion.Videos have two basic types of transitions between the shots: cut transition and gradual transition.The process of identifying between a cut and the gradual  transition is called shot boundary detection.For cut transition, the dissimilarity between the last frame belonging to the current shot and the first frame of the next shot is significant.Therefore, the cut transition appears immediately when viewing between the current and next shots.On the other hand, a gradual transition involves fade in, fade out, erase, object motions, camera operations and other effects.Therefore, the neighboring frames in the current and the next shot have extra visual similarities, where a gradual transition detection is more complex and confusing than a cut transition.Methods of gradual transition detection should distinguish the diversity of the mentioned effects (Cotsaces et al., 2006).
In recent years, various algorithms of shot boundary detection have been proposed such as: joint entropy, edge information, characteristics of a gradual transition, a linear transition detection (LTD) algorithm, and singular value decomposition (SVD) (Grana and Cucchiara, 2007;Cernekova et al., 2007;Lu ZM and Shi 2013).
In the proposed shot boundary detection method, as seen in Fig. 2, we have adopted and modified the method used by Lu and Shi (2013).The proposed method is explained as follows:

STEP1: CANDIDATE SEGMENT SELECTION (CSS)
A video consists of many boundary and nonboundary frames.The main purpose of CSS is to decrease the computational complexity in the subsequent steps by removing the non-boundary frames.We divide all the video frames into segments with a length of 21 frames, and calculate the Euclidian distance between the intensity of the pixel for the first and the last frames in each segment.The intensity feature of each pixel is used because it is a mutual feature in video frames and is simple to calculate.Every calculated distance is compared to an adaptive threshold (Lu and Shi, 2013).If it is greater than the adaptive threshold, the segment is classified as a candidate segment, otherwise, that segment is removed.In the proposed method, the second condition which has been reported by the Lu and Shi (2013) has not been used because it is always satisfied in our dataset.
Otherwise, it is considered that a gradual transition may exist in the segment.
This candidate segment selection method eliminates about half of the frames in a video and many nonboundary frames.In this method, a lot of vain shot boundaries are not considered, and the rest of the shot boundaries will be used in subsequent steps.The candidate segments with 6 frames have been used to detect candidate cut transitions (CT) and candidate gradual transitions (GT) which were introduced in steps 2 and 3, respectively.The detection methods concentrate on the reduction of the required time.

STEP 2: CUT TRANSITION DETECTION
In this step, the CT detection method is introduced.Features of each candidate CT segment are extracted by using the normalized hue-saturation-value (HSV) color histograms and adopted as frame features.The color feature is one of the most common and determinant features used in image and video retrieval systems which is stable against direction variations, the size of image, and background complexity (Montagna and Finlayson, 2012).The Histogram detects accurately global features of frames and reduces a computational cost (Gargi et al., 2000;Lu and Tan, 2005).According to recent research, HSV color space leads to an acceptable performance in the color histogram feature (Lu and Tan, 2005).The values of H, S and V are in the intervals [0, 180], [0, 255] and [0, 255], respectively.The 3D color histogram is obtained by the quantization of the H, S and V components into 18, 12, and 8 bins, respectively.Therefore, the dimension of the feature for every frame is 1728 bins.Therefore, the extracted features of the candidate CT segments construct a matrix called, A, with a size of 1728*6.The dimension of the column indicates the number of consecutive frames in the candidate CT segments.In this paper, the dimension of the column is equal to 6 and the dimension of the rows indicates the size of the feature vectors.
Next, SVD is applied on matrix A. The SVD returns U, S and V vectors.The S is a vector of singular values and a diagonal matrix with the same dimension as A, with nonnegative diagonal elements in a decreasing order, and unitary matrices U and V so that A = U×S×V T .The 1728-dimension of the feature vector is reduced and mapped to 6-dimension by: (1) After obtaining β with a size of 6*6, we used the cosine distance to calculate the similarity between every two consecutive frames f t and f t-1 , we choose the cosine distance because the computational cost of the cosine distance is less than other methods which require a normalization operation.The cosine distance is quite small even for two frames having many differences.The range of the cosine distance falls in the intervals 0 to 1 which is suitable to show the similarity between two frames.The cosine distance is obtained by: ( ) ( ) where β is calculated by Eq. 1.Meanwhile, we obtain the distance between the first and last frames in a segment with the length of 6 that is named G = Φ(f 0 ,f 5 ).
A cut transition in the t th frame will be detected if the following two criteria are satisfied.
G < 0.95, (3) where t = 1, ..., 5 and p is a 0.48 (Lu ZM and Shi, 2013).The segment will be removed from the candidate segment, if the first mentioned criterion cannot be satisfied.If the first criterion in Eq. 3 can be satisfied and the second criterion in Eq. 4 cannot be satisfied, a GT detection is required, in order to ensure that, during a GT, the similarity between two consecutive frames is always much higher.In this case, the segment with a length of 6 frames is considered as the candidate GT segment with a length of 11 frames because the length of the GT segment is considered more than the CT segment.

STEP 3: GRADUAL TRANSITION DETECTION
In this step, we use a novel method to detect the GT.The proposed GT detection method modifies the CT detection method mentioned in the previous step.This method is explained as follows: a) The candidate GT segments which are extracted from step 1 and added from step 2 contain 11 or typically 21 frames.In this method, in order for two frames of the different shots have a definite difference, we have added one frame before and after the candidate GT segment.
b) The HSV 3D histogram is calculated with 1728 bins for every frame in each GT candidate segment.Thus, the size of the extracted feature matrix is 1728×13 or 1728×23.
c) The SVD is applied on the feature matrix to reduce the size of the feature matrix to 10×13 or 10×23.The 1728 bins are reduced to 10 bins, such that increasing the number of bins for long segments will be more sensitive and results in more noises (Cernekova et al., 2007).
d) The distance between the first and last frames in the segment is calculated by using G = Φ(f 0 , f N-1 ) and goes to the next sub step (sub step e) if G < 0.9.Otherwise this segment is discarded.The GT segment has a higher difference than the CT because more frames exist within the GT segment thus 0.9 is experimentally considered as the threshold.
e) Absolute distance difference, In the proposed scheme, the candidate segment selection and SVD are employed, but the proposed algorithm detects both cut and gradual transitions with low computational complexity as shown in the experimental results.The experimental results on the video dataset will show that the proposed scheme can provide high accuracy and speed to detect both abrupt and gradual transitions.

KEY FRAME EXTRACTION
The shot boundaries of the video are detected by the mentioned algorithm which is described in the shot boundary detection section.In this section, a key frames extraction method is described.These frames are extracted by using an unsupervised clustering method.The color features of the videos have been used in the shot boundary detection step, but the video motion has not been considered as a feature.Motion is a special feature of a video that distinguishes a video from an image.Video motion is classified to the foreground motion and the background motion that are created by moving an object and a camera, respectively.The camera movements include tilting up or down, zooming in or out and panning left or right.Due to the camera movements, moving objects and lightning changes, consecutive frames within the same shot have visual content differences.These differences are obtained and extracted by using motion compensation methods (Wolf, 1996).In this paper, the motion compensation procedure is employed by using a block-matching method.The key frame extraction method is shown in Fig. 3 and performed as follows.
In the key frame extraction method, the YUV color space is used instead of the RGB color space because it provides better results in our experiment.Therefore, first, the RGB images are converted to YUV images.Second, each frame is divided into 4×4 blocks without any overlapping blocks by using the motion feature.The average value of the 4×4 blocks is obtained and saved as a new image (frame) thus the size of the original image is reduced.This process is performed for every Y, U and V planes of the frames.The average of distances (AD) between new Y, U, and V planes of consecutive frames is obtained by, (5) Third, we cluster video frames by using AD values which are obtained for all consecutive frames within the shot.The cluster boundary is obtained by comparing the normalized AD values to a threshold T, where T is set as 0.95.This value is based on providing the best performance in our experiments.
Clustering algorithms use a threshold parameter to control the density of the clustering.The low value for the threshold T increases the number of clusters and key frames, but a threshold should be adjusted to be suitable for various videos.After clustering, we use the histogram feature to extract a key frame within a cluster.The histogram counts the number of pixels within the sets of the defined bins, which allows it to reduce the complexity and calculation time.The histogram with 64 bins of the Y plane of the prior image (frame) is considered as a feature of the frame.64 bins are usually adequate the histogram feature for accurate results (Shuping and Xinggang, 2005).The histogram feature of all frames within a cluster are extracted.Then, the distances between the histogram feature of each frame and the average of the histogram feature of the frames within the cluster are obtained.A frame with a minimum distance can be considered as a key frame because it is closest to a cluster centroid.

RETRIEVAL ALGORITHM
In this subsection, the proposed retrieval methods have been described.As mentioned above, we used the CBIR methods to achieve the CBVR system.The key frames of a video are extracted by the previous subsection and stored in a folder.The main purpose of the proposed video retrieval system is to retrieve relevant key frames.The S and I planes of the HSI color space are used to extract texture features of the frames.The discrete wavelet transform (DWT), as the texture feature, is used in the proposed method (Mohamadzadeh and Farsi, 2014).We use the approximation components in the proposed method because the wavelet transform analyses the signal at various frequency bands.The low-low frequency component provides a coarse scale approximation of the image, while the other frequency components fill in the detail and extract edges.In previous steps, we have proposed and modified new algorithms for the shot boundary detection and the key frame extraction.In the following, we propose and compare the retrieval methods via sparse representation and Intensity-HDWT (Farsi and Mohamadzadeh, 2013).

The sparse representation
Most coefficients of the DWT are small when we compute a wavelet transform of a typical natural image.Hence, we can obtain an accurate approximation of the image by setting the small coefficients to zero, or thresholding the coefficients, to obtain a sparse representation (Mohamadzadeh and Farsi, 2014).In this paper, the DWT is applied on the S plane in the HSI color space of the approximation component of the DWT output and this process is repeated five times.The extracted feature using this procedure is called the Iterative DWT (IDWT) feature.
We review some fundamentals in the sparse representations and then we explain our proposed method for the video retrieval application using sparse representation.The concatenation of two vectors is . We represent where A = [a 1 , a 2 , ..., a n ], x = (x 1 , x 2 , ..., x n ), Dictionary A comprises n signals [a 1 , a 2 , ..., a n ] called atoms.In the Discrete Fourier Transform (DFT) or the classical signal decomposition, the number of atoms (n) is equal to the length of signals (m), where a unique solution exists for this problem.However, when these two parameters are not equal, or in other words when n > m, the decomposition is not unique.Sparse decomposition aims to seek for a solution in which as few atoms as possible would contribute in the decomposition.This is equivalent to seeking the sparsest solution of the undetermined system of linear equation b = Ax.We seek the sparsest solution for this equation by solving the optimization problem (Elad, 2012).In recent years, several development algorithms have been reported to solve Eq. 7 such as Smoothed L0 (SL0), Dual Augmented Lagrangian Method (DALM), Primal Augmented Lagrangian Method (PALM) and Homotopy method (Elad, 2012;Yang et al., 2012).In this paper, we use DALM, SL0, Homotopy, and PALM algorithms to solve Eq. 7 because these algorithms provide better performance and lower processing time than other algorithms (Yang et al., 2012).We use these algorithm to investigate the sparse representation and to find its usefulness in the video retrieval application.Therefore, we apply the following algorithm via sparse representation to achieve the desired video retrieval.
1.In the video retrieval literature, we construct the dictionary A by using sufficient training samples of the i th image, , where ν i,j represents the j th feature of the i th extracted image by applying the IDWT method on the image.Therefore, 2. Extract the feature vector of the query image by applying the IDWT method, b ∈ R m .
3. Seek sparse representation, x 0 ∈ R n , by solving Eq. 7 and using DALM, SL0, Homotopy, and PALM techniques.Therefore, some elements of x 0 are zero except those associated with the k th class.
6. Compute the Euclidean Distance (ED) between the feature vector of the query image (b) and C k by 7. Compute the weighting of the elements of x 0 by considering the ED and Eq. 10 0,1 0,2 0, ; ; ; 8. Finally, the best relevant key frames are retrieved by using the sorted element of x weighted .

Intensity-HDWT method
In this paper, we used the Hadamard matrix and Discrete Wavelet Transform (HDWT) method to achieve the CBVR (Farsi and Mohamadzadeh, 2013).The Intensity plane of the proposed method provides an acceptable performance and the size of the feature vector is satisfactory.The features of the key frames and a query are extracted by using the HDWT.Then, the Euclidian distance between the key frames feature and the query feature is calculated, and the related key frames or videos according to the user's request are shown.The size of the feature vector of the Hue-Maximum-Minimum-Difference (HMMD) color space is three times bigger than the Intensity, therefore, we use the Intensity plane instead of the HMMD planes because the size of the feature vector plays an important role in the proposed method.The HDWT method has been briefly explained as below (Farsi and Mohamadzadeh, 2013).
1. Apply the DWT on the Intensity plane with a size of N×N to generate the approximation (Low-Low), the horizontal (Low-High), the vertical (High-Low) and the diagonal (High-High) components.
2. Construct the modified approximation components by multiplying the actual approximation components and the Hadamard matrix with the size of the approximation component.The Hadamard matrices are the square matrices whose entries are either +1 or −1, and their rows are mutually orthogonal.
3. Construct the modified plane from step 2 by applying the inverse wavelet transform with the modified approximation components, the zeroing horizontal, the vertical and the diagonal components.The new image is used in the next level to construct the new approximation components.
4. Take the alternative rows and columns by downsampling the output from step 3 with a size of N/2×N/2.The down-sampling reduces the size of the feature vector which is important for increasing speed of the retrieval.
5. Construct the HDWT feature of the level-p by repeating steps 2 to 4, 'p' times on the each plane.
6. Use the approximation components of the level-p.this results in step 2 as the HDWT feature of the level-p.
We generated the feature vectors of the data set image by applying the HDWT level-5 and stored the approximation component as the feature vectors for each image.

EVALUATION MEASURES
In order to evaluate the performance of the proposed retrieval systems, we use two evaluation metrics.Farsi and Mohamadzadeh (2013) proposed a method using the combination of precision and recall criteria as performance measures for the CBIR and CBVR systems.The precision and recall criteria are given by Eq. 13 and Eq.14, respectively.

Number of Relevant Images Retrieved Recall
Total Number of Relevant Images in Database = .( 14) According to Farsi and Mohamadzadeh (2013), P(1) has been adopted, with precision at 100% recall (i.e., precision after retrieving all of the relevant documents).P(1) is number of relevant images divided by the total number of images that are retrieved.This becomes the fraction of retrieved images that are relevant to the query image.We use this value because precision and recall are considered to be related to each other and are meaningless if taken separately.

DATASETS
In order to evaluate the proposed method in video retrieval, we have considered and collected a common dataset.A diverse dataset was chosen to better compare these methods.We used CC_WEB_VIDEO, Near-Duplicate Web Video Dataset, to evaluate the methods (Wu et al., 2009).This dataset is collected from the top favorite videos from YouTube, Google Video, and Yahoo.In the dataset, videos are classified into 24 categories.The names of these categories are The lion sleeps tonight, Evolution of dance, Fold shirt, Cat massage, Ok go here it goes again, Urban ninja, Real life Simpsons, Free hugs, Where the hell is matt, U2 and green day, Little superstar, Napoleon dynamite dance, I will survive Jesus, Ronaldinho ping pong, White and nerdy, Korean karaoke, Panic at the disco I write, Bus uncle, Sony Bravia, Changes Tupac, Afternoon delight, Numa gray, Shakira hips don't lie, and India driving.We have selected 24 videos to achieve the video retrieval and to compare the methods (Wu et al., 2009).These videos include the RGB frames with a size of 320×240.Therefore, the size of the feature vector in level 5 of HDWT and IDWT is 80 features.As described in the previous sections, first, the shot boundaries are detected, second, the key frames are extracted, and finally, the retrieval methods are applied to the dataset.We have proposed and modified the new methods for each part.The results of the retrieval system is shown in the next step.

INDEXING RESULTS
In this section, we represented the results of the proposed method and compared them to other methods.The P(1) metric is obtained to compare and evaluate the proposed method using different acceleration algorithms.The best scores for each video category are bolded.The name of each video category, the number of frames and the number of key frames are shown in Table 1.These key frames are extracted by the proposed method which has been described in the key frame extraction section.In Table 2, the experimental results of the proposed methods via Intensity-HDWT, spars DALM, sparse SL0, sparse Homotopy, sparse PALM, and HER (Li et al., 2015), Heesch et al. (2004) and SCFV (Araujo et al., 2015) methods in the CC_WEB_ VIDEO dataset have been shown.In this experiment, the best performance rate for the proposed method via Intensity-HDWT with the center or random queries is being tested.In comparing the performances, the P(1) of the proposed method via Intensity-HDWT, sparse SL0, sparse DALM sparse Homotopy, sparse PALM, HER, Heesch et al. (2004) The average is calculated by taking the mean value of the P(1) from the video categories.The average of the P(1) of the proposed method via Intensity-HDWT with center query is 84.82%.After the proposed method via Intensity-HDWT with center query, the propose method via Intensity-HDWT with the random query provides better performance than the other methods.
The processing times of the proposed methods in the CC_WEB_VIDEO dataset are obtained and shown in Table 3. Once again, the proposed methods via Intensity-HDWT with the center and random queries provide the best processing times.Meanwhile, the average processing time of the proposed methods via Intensity-HDWT with center query is less than the other methods.
In the Changes Tupac video (video: 20), the sparse Homotopy with the random query provides a better performance than the other methods (P(1) = 60.87%), but the processing time of the sparse Homo-topy with the random query is 9.561 seconds.Whereas, the processing time of the proposed method via In-tensity-HDWT with the random query is 4.346 seconds and the P(1) is equal to 58.7%.Therefore, in considering the P(1) and the processing time, the proposed method via Intensity-HDWT provides better performance than other methods.
All numerical experiments are performed on a personal computer with a 2.6 GHz Core i5 and 3.8 Gb of Ram.This computer runs on Windows 7, with MATLAB 7.01 and a VC++ 6.0 compiler installed.
The obtained results are represented, and the methods are compared as bar charts, for clear visual representation.The P(1) metric and the processing times of the methods of each category are shown in Figs. 4 and 5, respectively.As shown in Figs. 4 and  5, the proposed method via Intensity-HDWT provides higher performance in P(1) and less processing time than the other methods.Consequently, as shown in Tables 1 and 2, and Figs. 4 and 5, the proposed method via Intensity-HDWT provides better performance than other methods.The obtained results show that with respect to the performance rate and the size of the feature vectors, the proposed method scores extremely well.
Thus, the proposed algorithm can be considered more powerful than the existing methods.Moreover, the proposed system not only reduces the size of the feature vector and storage space but also improves the performance and reduces the video retrieval system's processing time.

Fig. 2 .
Fig. 2. The block diagram of the shot boundary detection method.

Fig. 3 .
Fig. 3.The block diagram of the key frame extraction method.
a signal vector b ∈ R m , signal (or atomic) decomposition is the linear combination of n basic atoms a i ∈ R m , (1 ≤ i ≤ n) which constructs the signal vector b[n] as: b

Fig. 4 .
Fig. 4. Performance Comparison on various test video sequences between the proposed method and other methods.

Fig. 5 .
Fig. 5. Comparison of processing time on various test video sequences between the proposed method and other methods.
N| ≤ 0.25 is checked where t m is the point with the minimum value of the absolute distance difference, d, and N = 11 or 21.If it is satisfied, the criterion (K L + K R )/ N ≤ 0.3 will be checked where KL and KR are the number of the ascending points before tm and the number of the descending points after tm, respectively.If it is satisfied, the candidate GT segment is considered as a GT segment.
t = 0, …, 10 or 20), is calculated where f s and f e stand for the last frame of the previous shot and the first frame of the next shot, respectively.If the criterion max(d(t)) -min(d(t)) > 0.33 is satisfied, go to the next sub step (sub step f).Otherwise this segment is discarded and go to the first sub step (sub step a).f) Criterion |(t m -(N + 1)/2)/

Table 1 .
The name of videos, the number of frames and key frames.

Table 2 .
The P (1)% of the proposed methods in each video.

Table 3 .
The processing times (second) of the proposed methods in each video.