Disclaimer: This is an example of a student written essay.
Click here for sample essays written by our professional writers.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Content-Based Video Retrieval Method

Paper Type: Free Essay Subject: Computer Science
Wordcount: 2759 words Published: 22nd Jun 2018

Reference this

An Approach for Analyzing keyframes based on Self Adaptive Threshold and Scene Descriptors

  • Suruthi.K, Tamil Selvan.T, Velu.S, Maheswaran.R, Kumaresan.A


In this paper, we propose a CBVR (content based video retrieval) method for retrieving a desired object from the abstract video dataset. Recording and storing enormous surveillance video in a dataset for retrieving the main contents of the video is one of the complicated task in terms of time and space. Even though, methods are available for retrieving the main content of a video based on ROI as well as threshold values for retrieving background information key frames, determining the threshold values manually is a complex scenario. So, we propose a method, where we use self-adaptive threshold for determining the background information along with the use of several descriptors to increase the efficiency of determining the contents of the key frames. We can also use CBVR to retrieve the information of a desired object from our abstract dataset.

Keywords: Self adaptive threshold, Keyframes, Descriptors, CBVR


The process of providing security plays a major role in all organizations these days. This security can be provided in many ways considering the criticality of the information being secured. Theses security methodologies include providing manual guards around the perimeter or providing electric fence around the infrastructure or any other available effective means of technology available. In spite of the availability of these methodologies, an effective and 24×7 security could be provided with the help of installation of cameras at the crucial areas of an organization which should be out of reach for the humans. The optimal number of cameras to be installed in an environment could be calculated with respect to [1]. Since these cameras are recording videos with a time scale of 24 hours, the recorded videos are to be stored and analyzed where storing these videos require an enormous database and analyzing these videos require humans to play through the entire video in order to analyze the incidents occurred where the biggest de-merit is that we cannot skip the videos being played since we would miss the important actions when we skip.so, we are in need of a method for extracting the essential events been occurred from the prolonged surveillance videos and storing these events alone in a separate database which would minimize the memory space being utilized for data storage along with minimization of human work to look through the entire videos. We know that the first step in observing videos is to convert it into individual frames or images since the broadcasting of moving visual images form a video. This can be termed as image retrieval.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing Service

Image retrieval is the process of retrieving images from an enormous database based on the metadata added to the image which could be said as the annotations. But this annotations have some demerits. Annotating images manual is a time consuming work to be done and if images are annotated ambiguously, the user would never get the required results no matter the number of times he search the image database. Several methods for automatic image annotations have been under research due to the advancement in the field of semantic web and social web applications. In spite of the advancements, there is an effective methodology termed CBIR (content based image retrieval), in which feature extraction is basis. These features represent text based features representing keywords as well as annotations whereas visual features correspond to color, texture and faces along with shapes [2]. Since, features plays a major role here, when user inputs an input image, the pixel value of these images are compared with all the images prevailing in the database and the results given to the user would contain all the images containing a part of the queried image which is an effective way of avoiding annotations to avoid ambiguity. Since we are dealing with videos here, we need an advanced approach from CBIR.

2. Related Work:

Speech recognition is an important conc

3. Fast Clustering Method Based on ROI

Since users find easy to access online videos easily these days, we are in need of finding an effective way to store and maintain enormous amount of video files facilitating easy and quick access for multiple users. In order to support research in this area, Guang-Hua-Song et al have proposed the fast clustering based on the region of interest (ROI). The authors have employed the average histogram algorithm for the purpose of extracting key frames from each shot. A shot could be defined as the depiction of a particular scene or action. A single shot refers to the action covered by a camera between the start and stop of the recording time which would be normally in the same angle. The extracted key frames are used for the generation of edge maps which contribute the next step in the video abstraction scenario. Based on the above methodologies, the authors have determined the key points. Calculation of threshold values from the respective key frames would be the next step which is done for the purpose of expanding and identifying the area surrounding the key points [9]. The authors have proposed the observation of main content in each of the key frame based on the threshold values defined and the concept of key points. As the final step of their proposed method, they have utilized the ROIs of the key frames and have performed the fast clustering method on them. The different methodologies involve before implementing the fast clustering method along with the implementation of fast clustering methodology is explained in the following sections.

A. Key frame Extraction

The representation of video sequence would be in the form of a hierarchical structure considering the scene, shot and frame contributing different levels on the hierarchy [10]. Different researches on video sequences requires the researches to deal with the different levels of the video sequence hierarchy with respect to the information needed for their research. Shot is to be considered first for the purpose of key frame extraction. The shot level is chosen at the hierarch among the other available levels due to certain reasons. The sequence of video frames captured continuously by a camera contributed a shot which also would include the moving objects, panning and zooming in terms of the recording camera. We also have a greatest merit with the shot as the two adjacent shot does not have the same content which would obviously eliminate redundancy. The authors have employed the use of algorithm proposed in [11] for the purpose of extracting key frames. The key frame extraction process also involves the average histogram method. A shot S = { } of length n is assumed. The kth frame in the assumed shot is represented as . Considering to be the gray level histogram containing L bins could be generated from frame, whereas the calculation of the average histogram H is done based on the following formula:

Where represents the value of the ith frame of frame k. After the extraction of key frame, ROIs are generated by adopting a series of key frame analysis this process is followed by saliency map generation and edge map generation.

B. Edge Map Detection

It is a general concept that we would focus on objects which has a whole shape in the video. So there would be edges within these components. We are in need of determining the key points which would be available inside the objects and so determining edges would make our tracking process easier. The authors have used the canny edge detection scenario with respect to [12]. This process is followed by the location of key points and generation of ROI.

C. Fast Clustering

In a video sequence, though each shot would be having a different content to portray, some of the shots may look similar to one another in camera angle or facial expression of the people involved or in any other means. Sometimes, a shot would ne manually segmented into many shots and used at different places in a video sequence. The approach of the authors is to make the video sequence compact and thus they have clustered the key frames in order to avoid the redundant frames.

Normally, clustering before the entire process of extracting the key frames is done would be of no use since the new frames could not be taken into account. In order to overcome this traditional approach, the authors have used fast clustering in which clustering process starts once the key frame extraction and identifying ROI are done. Even though this approach was good enough to an extent, the authors have not used more effective descriptors to extract more features from the frames for better observation. In addition to this manually setting the threshold to obtain the background information would not be so effective.

4. Application of Self Adaptive Threshold and Descriptors

Though the use of assigning the threshold manually works in a better way, setting the threshold manually is a difficult task. So we are in need of an alternate way for setting the threshold which is the adaptive threshold methodology. We propose the use of adaptive threshold in our video abstraction method for the purpose of gaining more knowledge about the objects in the background. In addition to this, we have also made use of several descriptors such as FCTH (Fuzzy Color and Texture Histogram) and SCD (Scalable Color Descriptor). A descriptor is generally used for extracting different kinds of features from an image based on the functionality of a descriptor. Features refers to the different kinds of information that could be extracted from an image which may refer to the color, intensity, pixels, etc. the functionality of FCTH and SCD are discussed as follows:


In this type of descriptor, fuzzy is used for gathering information about colors which lie between the pure black and pure white. Here, fuzzy is made used of since the general concept of fuzzy is to deal with all possible scenarios (partial true / partial false ) which lies between the True (1) and False (0) values.

B. SCD (Scalable Color Descriptor)

SCD is used here for the purpose of extracting information about the colors which are scalable. This scalable colors represent colors which are extended to the nearby boundaries and would be available in a different form within that boundary.

C. Algorithm: Distance Vector

We are using Distance Vector algorithm in this video abstraction process for the purpose of observing the distance travelled by an object in two subsequent frames in order to determine the motion of the object in a more likely scenario which involves the following steps:

  1. Detecting and identifying the boundaries of the moving objects.
  2. Extracting ROI (region of interest) of the object within the frame.
  3. Searching for the same object in the next subsequent frame.
  4. Detecting boundaries and location of the object.
  5. Comparing the location of the object and finding its distance moved from the previous frame to the current frame.
  6. Repeating the above steps for all the video frames would enable us to find the moving object distance covered for each frame.
  7. Updating the distance vector matrix.

The overall methodology of the proposed methodology is shown in Figure 1.

Figure 1. Block Diagram of the Proposed Methodology

This scenario is applied for minimizing the memory complexity in terms of storing and retrieving enormous 24×7 surveillance videos where recording and storing of the entire video would increase the demand of memory as well as looking through the entire video to verify a crime scene would be a more complex scenario. In order to overcome this complexity, our method extract the key frames from the entire video and store it in a desired database where only the distinct images would be available minimizing the work of the user to look through a full length video. In addition to that, saving images would have a memory demand much lesser than the demand of the videos. Since we are using descriptors, more detailed information could be extracted from the images. Self-adaptive threshold enables the user to get more details above the objects available in the background which is an added advantage of this methodology. Any sort of frame can be given as a query into the system and the user would get the relevant video containing the respective key frame. If the frame is not available in any of the dataset, user would be shown with an error prompt. This process is termed as CBVR. CBVR is similar to CBIR but differs in a way that user would be given a frame (image) as a result in case of CBIR whereas result would be the entire video in case of CBVR. But in both the cases, data is compared and retrieved based on the contents available in the frames.

5. Experimental results

We have conducted our experiment with videos available in the MATLAB dataset. First step would be the extraction of key frames based on self-adaptive threshold value which is shown in Figure 2.

Figure 2. Window for Key frame Extraction

Key frames are extracted and stored in a destined folder as shown in the Figure 3.    

Figure 3. Key frames Stored in the Destined Folder

After the key frame extraction, the user can input a key frame of their choice and the contents of all the available videos in the dataset are compared and the respective video containing the requested key frame would be found based on CBVR and retrieved as shown in Figure 4a. The user can click on the play button available at the bottom right to play the entire video containing the requested key frame. If the requested frame is ot found, the user would be prompted with an error message as shown in Figure 4b.

Figure 4a. Video is retrieved based on the queried key frame using CBVR

Figure 4a. User id prompted with an error message since the requested frame is not found

Our experiment have showed a compromising result with more than 80% accuracy. As explained above, this methodology can decrease the memory space demands and the time of the user to spend in looking through the entire videos.

6. Conclusion

In this paper, we have proposed a methodology for video abstraction based on several descriptors and self-adaptive threshold. This methodology facilitates user to minimize the memory demands and time demands for looking through the videos. Our methodology also makes use of CBVR for retrieving a video based on the contents with respect to the user requested key frame. The only problem that our methodology faces is the time taken for comparison if the key frame to be searched is available in the final video available in the dataset. Our future work is to concentrate on limiting the time space for comparison in a large video dataset.


[1] Tatsuya Hirahara

Figure Captions

  • Fig.1.Optimal Position fo


Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: