Disclaimer: This is an example of a student written essay.
Click here for sample essays written by our professional writers.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Video summarization techniques

Paper Type: Free Essay Subject: Computer Science
Wordcount: 5410 words Published: 1st Jan 2015

Reference this


Recently, the number of videos has been increased and the ability of individuals to capture or/and create digital video has been grown at the same time. So there is a growing need for video summarization. Video summarization refers to a summary creation of a video where is has to address three main beliefs. First, the video summary should contain the most important scenes and events from the video but it should be as short as possible. For example, in soccer game, the summary must contain goals, fouls, scuffles, and some other important scenes. Second, the video summary should show a good continuous connection among scenes. It means that the video summary should not show video segments concatenated together in a blindly way. Finally, the video summary should not contain any redundancy. In other words, the video summary should have a free repetition. However, it is kind of difficult to achieve that. For instant, it is very common that goals in soccer game be replayed many times during the match which makes it really difficult to distinct between the goal and the reply of that goal. As a result, the redundancy will be there for each time the goal been replayed.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing Service

Video summarization is considered one of the most important features which it makes the search away easier and useful than before. Also, video summarization is a very important tool where people can use it to get the main idea and the important scenes without watching the full original video. For example, in any companies which use surveillance video to secure their building, they wish that they can see only the important events that happened there. So, in order to achieve that video summarization is the only solution. Also, some of the large movie databases such as Internet Movie Database (IMDb) and movie sellers want to have software that can summarize the movies automatically. As a result, companies can save time and afford by using video summarization.

There are many techniques that have been used recently for video summarization. In this paper, I will address some of most interesting techniques and methods. These techniques are used to summarize video based on camera motion, meeting recording, sports video, surveillance, and presentation.



Video summarization based on many methods and one of these methods is based on camera motion. In [1], there are two methods have been addressed for video summarization. The first method is to give an explanation for making summary for a given video and the second method is to evaluate that video summary by comparing it with other video summary. In other words, they are trying to create a good video summary by evaluating the result based on camera motion. Also in [1], there are three different families have been proposed a video summarization based on camera motion.

In [1], the first family divides the video instead of selecting the keyframes by using camera motion. In [2], the moving objects have been detected by using the camera motion. Therefore, the video summary will be built according to this detection. For example, the illustration of detecting the moving objects is shown in 1. The images

on the left are two shots where the object has been marked and on the right side there is a made up background image. There are two modules for object recognition. The first one is segmentation module and the second one is classification module. The process of object recognition is shown in 2.

In [3], authors used camera motion to divide the shots into segments and the selecting of keyframes has been done under 4 different measures. This system called video snapshot. The system Architecture of this technique is shown in 3. There are six steps to create a video snapshot as follow:

1) Clustering the connection between sub-shots which are decomposed of the video.

2) Detecting the current scenes according to the relationship of the clustering that has been done in the first step.

3) Sorting out the sun-shots which are not important to make the number of them is equal or less than the given grid cell number.

4) The grid’s cells are allocated to dissimilar scenes based on the key events number in each topical scene that has been detected in the second step.

5) In a respectively way, the keyframes sub-shots in each scene are lined up from top to bottom and from left to right in the area that has been assigned to scenes.

6) “Besides the black background and the rectangle border box, to guide browsing and enhance visual orderliness, the vertical and the horizontal splitting lines are added to the boundaries of scenes” [3].

In [4], based on camera motion, shots has been segmented and after that MPEG motion vectors which has the objects and camera motion has been used to identify the motion of each frame and select the keyframes. However, this technique is used more to segment the video than to summarize the video. Authors proposed a method called InsightVideo which is a system where it analysis and retrieves the video. The system flow for InsightVidoe is shown in 4. InsightVideo system has three divisions. The first one is video feature extracting. The second division is hierarchal video content table. The last division is progressive video content access.

In [1], the second family is focusing on the existing of motion or not. In [5], the shots have been selected first, then find out whether there is a camera motion or not. On one hand, if there is a camera motion, then the shots are represented by three keyframes. On the other hand, if there is no camera motion, then the shots are represented by one keyframe. In [6], the video summary method used both camera motion and object motion. So in order to get the video dynamic features, authors designed a method to select the segments that contain large motions. “The segments with a camera motion provide keyframes which are added to the summary; nevertheless these approaches are based on simple consideration which exploit little information contributed by camera motion” [7].

In [1], the last family tries to find the similarity between frames and then choosing keyframes according to that similarity. In [8], by calculating the distance between two frames, the similarity has been computed. Therefore, if the distance between two frames is small, it means that they are close to each other in term of content and a few keyframes are selected. The calculating of the similarity between two frames is shown in 5.

In [9], it has the same idea of [8] where Fauvet et al “determine from the estimation of the dominant motion, the areas between two successive frames which are lost or appear. Then, a cumulative function of surfaces which appear between the first frame of the shot and the current frame is used to determine the keyframes”.

In [1], a new method has been addressed for video summarization which is based on camera motion and /or on static camera. Camera motion as authors believe is really important because it carries a lot of interesting information. For instance, to catch the viewer’s awareness, a zoom in is one of the ways that makes it achievable. Also, a change of location can be shown when there is a translation. As a result, the features of the camera motion have been used to select the keyframes. Indeed, the method has an advantage where it avoids a straight contrast among frames and the classification of camera motion is the only one that is based on.

There are two principles that the video summary works based on camera motion. The first principle is the recognition of camera motion. In [1], camera motion detection is based on the recognition of translation, zoom and static camera. The structural design of the system, which is illustrated in 6, has three phases. The first phase is the extraction of motion parameter which by the model of an affine parametric, it focuses in the estimating of the main motion between two frames in a row. The second phase is consisting of three different stages which called the classification of camera motion.

In [1], the first stage of the second phase is to change the model of the motion parameters into values which are symbolic. The second stage is to separate the static frames from the dynamic frames. The last stage is to integrate the dynamic frames temporally. Finally, the third phase is to find out the features by extracting on each segment of the video.

In [1], the second principle of summarizing method that is based on camera motion is selecting the keyframes according to camera motions. Moreover, there are three ways to select the keyframes. The first way is to select the keyframes according to the sequence of camera motions as shown in 7. For example,if there are two segments and none of them is static, the first frame of each segment will be selected. However, if there is a static segment, then the first and the end of the motion segment will be selected. The second way to select keyframes is according to camera motion magnitude as illustrated in 8. For example, if there is one segment and that segment is translation and it has low magnitude, then the last frame of the segment will be selected.

in other case if the segment has high magnitude and rectilinear translation, then the first and the last frame will be selected. Also, if a high magnitude and no rectilinear translation have been found in the segment, then the first, middle, and last frame will be selected. Moreover, if there is a zoom segment and it has a low magnitude, then the last frame of that segment will be selected. But if the zoom segment has high magnitude, then the first and the last frame will be selected. Finally, the last way is to select the keyframes according to both succession and magnitude of camera motion as shown in 9. In other words, it is simply combining the first and the second ways to one way.


In [12], taking a video for a meeting is really important instead of just writing it down in a document. There are many leaks for the information that has been transformed into a written document. It takes a long time to write it down in a document and it loses the accuracy of the information. Also, since it is only a written document, it loses the presence reaction during the meeting. Moreover, it may not cover everything that occurs there, so there might be leaks in the completeness of the meeting for any reason. Therefore, video summarization is really useful for meeting recording and there are many proposed method to achieve that. In [10], a multimodal meeting summarization method has been explained and this method includes audio, visual, and speech transcriptions. Also, in this method authors proposed a measure where it can localize the sound and the scale of the audio signal. The illustration of this method according to the localization of the sound, the output of audio activities and the transcriptions of speech is shown in 10. Moreover, this method can detect the relations between the presence and the contributor according to the loud speech.

In [10], visual activities also have been analyzed in order to get a specific event in a video sequence. There is not that much motion in the video sequence for any meeting. For example, when a contributor tries to make a presentation or when people tries to join the meeting. In 11, shows some high motions that occur in a video of a meeting.

In [10], the text has been analyzed based on the language analysis techniques in this method. This method computed the Frequency-Inverse Document Frequency (TF-IDF). In [16], the Frequency-Inverse Document Frequency (TF-IDF) is a well known term which specify in a document, the words that relatively important. In [11], the method that has been used in this paper is really useful only for video browsing and searching by keyframes based representation.

In [12], author has proposed a summarization technique for any meeting content by skimming the video with the aspect of user-determine length. In 12, the method of any meeting processing is illustrated. The files of audio have been sent to the detection of speech and the brows meeting by the program recorder that includes the identify speaker module. Then, the displayed result has sent by three modules with their data from the brows of meeting front end and the three modules are summarization, emotion and discourse. Finally, the archive of the meeting has accessed.

In [13], the method that has been used in this paper is an automatic method that can create video skimming for different types of video such as presentation video which is kind of similar to meeting recording video. The illustrated of this method process is shown in 13. There are two steps that create a summarization process. The first step is dividing the video into segments, then some features such as visual, audio, and textual are putting together from the extraction of stream video to assign the scores by using segments. The second step is to collect the segments and then the summary will be created.

In [14], instead of playing the content faster, authors created techniques that can save the time by deleting parts of the content. As a result, the summary of this technique can be away shorter than others techniques.

In [15], in this paper, authors found out that the analysis of the audio that came together with the analysis of visual is a way better than the skims that comes when they combined the analysis of the audio and uniform sampling. This result came out after authors compared the skimming of a video in three different techniques. The first one is the analysis of the audio according to the amplitude of audio and the analysis of the term frequency-inverse document frequency (TF-IDF). The second technique is the analysis of the audio which comes together with the analysis of the image according to face detection, text detection, and camera motion. The last technique is video sequences based on uniform sampling. In 14, the illustration of video skim process is shown.


In [17], sports videos have been increasing recently and by video summarization process, people can control this increasing. There are many researchers have been proposed many techniques to compress sports video by applying video summarization process. In [17], authors have proposed a technique that can suit all different kind of users and applications and they called this technique “complete sports videos summarization”. In 15, the illustration of the hierarchy of sports video based on plays, breaks, and highlights. The framework of this technique is a combination of plays, breaks, collection of highlights and highlights as it is shown in 15. The definition of each part of the framework is as follow:

1) Play is a collection of shots where the play does not stop.

2) Break is also a collection of shots but play does not run.

3) Collection of highlights is a group of highlights.

4) Highlight is a collection of shots but they represent events.

In [17], play-break sequence model is shown in 16 which is considered as a fix model for any sports.

In [17], authors described the summarization technique of sports video with a method that can integrate highlights into play-breaks. So, in order to achieve that, we need to know which highlight integrates to whether play or break. For example, in soccer game, if there is a foul then the play will stop and then there will be an old highlight playing during the foul and before the play resumes. In 17, the illustration of the integration in soccer game of highlight into play-break is shown. However, this model is easy to be modified for any other sport so it is not only suitable for soccer.

In [17], there are three different detection models. The first one is play-break detection and in this detection the camera views classification can be used to detect the play-break transitions. The second model is highlights detection and in this model the detection of highlights is based on the slow-motion replay scenes. The advantage of this way is that slow-motion replay is used to represent any interesting scene. However, the disadvantage of this way is that sometimes there will be no slow-motion replay for some reason after interesting scenes. As a result, we will miss that event and some interesting events will be ignored. Finally, the last model is text detection. During any sport, there will be after any interesting events a text displayed on the screen so this model is specialize to detect this text. For most sports videos, the text that displayed will be in a horizontal way. So authors used this idea to detect the text displayed. However, if the text is not displayed in a horizontal way, then their technique will not work.

In [18], a content-based video summarization technique for large archives in video sports has been proposed by the authors. Also, authors used metadata which is the content, quality, condition, and semantic information that have been explained by the data. In 18, the metadata composition has been illustrated. There are five different parts of information for each scene unit type, classification, players, events, and the time of media. In [18], authors have proposed a technique that makes a summary by using metadata which is based on play scenes ranking. This method has two main parts. The first part is the selection of play scene which is consists of two sections, play scenes significance and the highlights selection. Play scenes significance based on three components, the ranks play, the occurrence time of play, and the replays number. The second section is highlights selection where authors explained the creation of a video summary according to the play scenes significance. The second part of that method is visualization. There are two types of visualization. First, video clip where the user can select the length of the video summary based on the time that user want. Two methods are proposed by the authors to choose the significance play scenes, greedy method and play-cut method. The second part of visualization is the poster of the video where authors proposed a system that is visually spatial and presents keyframes of the image. Each image keyframes can symbolize a scene in the summary. Authors illustrate their system interface in 19. In 19.a, in each row there are important scenes represented by keyframes. In 19.b, in each row symbolize an inning of the game. In 19.c, at-bats are symbolized in each row. In 19.d, plays are represented in each row.

In [19], extracting highlights method has been proposed by the author for a sport TV broadcast. The approach of domain specific and generic for this method has been illustrated in 20. Author models the excitement according to his approach in [20] where he model the effect of three low level characteristics on the excitement for the users. The first characteristic is “the overall motion activity measured at frame transition” [20]. The second characteristic is “the density of cuts” [20]. And the last characteristic is “the energy contained in the audio track of a video” [20]. The time curves of the three characteristics on the excitement are shown in 21.

In [21], authors have proposed sports video technique where they can combine highlights and play break scenes. “Researchers have identified that each type of sports have a typical and predictable temporal structure, recurrent events, consistent features and fixed number of views” [22].

In [21], authors said that to create highlights there is one approach is to optimize the visual characteristics based on their use. For example, in [23], authors generated the highlights for a soccer game according to penalty, midfield, in between midfield, corner kick, and shot at goal. On the other hand, in [24], authors generated the highlights of a basketball game according to left- right- fast-break, dunk, and close up shots.

In [21], on one hand it is very efficient to summarize the sports video without the breaks scenes because spectators will focus only in the important events. On the other hand, the break scenes sometimes contain some really important events that it will not be displayed in the highlights that do not include break scenes. For instance, during any free kick, teams are trying to put a plan to manage their teams whether they are offensive or defensive. Also, sometimes the most important scenes occur between play scenes and break scenes. In 22, the framework of sports video summarization is illustrated.

Find Out How UKEssays.com Can Help You!

Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.

View our services

There are three types of detection that authors proposed, whistle detection, excitement detection, and text displayed detection.

In [21], tracking the voice during any sport game is kind of difficult especially when there are noises form human and background. Nevertheless, authors have found that whistle sound is very different than any other sounds and it is very unique. In 23, the whistle spectrogram is shown where authors found out the difference in sound according to HZ. According to the whistle sound, authors have indicated three different situations that the referee will use his whistle during soccer match as follow:

1. The start of the match, the end of the match, and the playing period.

2. When referee stops the playing.

3. When referee resumes the playing.

In [25], authors found out that the whistle sound that comes from the referee during the game has a very high frequency and strong spectrum that people can distinct from any other sounds. Also, they found out the range of the whistle sound where it is between the ranges of 3500-4500.

In [21], the second type of detection is excitement detection where authors indicated eight candidates that represent the excitement in any games as follow:

1. The loudness of the crowd and the commentator.

2. When the commentator has a high pitch rate.

3. When the commentator has less pauses which means he becomes more talkative.

The framework of the excitement detection is illustrated in 24. We can see that in The last type of detection is text display detection where authors proposed a technique in order to achieve the requirement. In [26], authors “used the gradient of color image to calculate the complex-values from edge orientation image which is defined to map all edge orientation between 0 and 90 degree and thus distinguishing horizontal, diagonal and vertical lines”. Also in [27], authors “localized character regions by extracting strong still edges and pixels with a stable intensity for two seconds”. However, in [21], authors proposed that according to the method which works with 99% of different cases, the horizontal text is the only one that has been used among all different type of sport. In 25, the illustration of using a horizontal text is shown. The team players names is displayed in 25.a and the score line for the match is displayed in 25.b. Also, In 25.c, the players substitution name is shown. The static text is displayed of the whole game is shown in 25.c.

In [21], authors proposed prediction method for the text to be displayed in the video and it is illustrated in 26. However, it is not the only possibility for the text to be displayed, there are more events that cannot be detected.

In [28], authors have presented a framework for a sport video summarization based on text semantic annotation. A lot of video summarization techniques that are based on sports video have been questioned [29, 30]. In other words, these kinds of techniques are depending on low level characteristics. But “semantic level events were generally inferred if special sequences of production level events occur” [31].

In [28], authors have indicated three distinguish key characteristics to support their framework as follow:

1. The framework is not fixed for one type of sport and it can be modified to suit other sport.

2. Analysing the text does not ensure text webcasts which is well built.

3. According to the time and the score, events can have different importance.

In [32], authors have proposed a technique for sports video summarization based on audio pitch, and energy in order to recognize the excitement of the games. However, in [28], authors presented a framework that can detect events according to play-by-play text webcasts and logic-based technique. There are five models that are connected to the center of this framework system architecture as shown in 27. The first model of this framework is the Graphical User Interface which makes it easy for the user to connect with the centre. The second model is web parser which is in control for analysing HTML and creates plain text, and then it sends it to the third model. The third model is text analyzer which is responsible to detect the semantic events form the plain text as shown in 28. The fourth model is video processing which takes the data whether it is video or audio and retrieves their keyframes. The last model is logic engine which is responsible to choose the important events and put them in list to be ready for summarize.

In [33], author proposed a system in which can detect the event and the text from the HTML parser as shown in 29.


In [34], there are two reasons for people to use video surveillance. The first reason is that it serves to detect events online and alive. The second reason is that people can use it offline and can analysis the video and retrieve any data they want. So for people to go through the entire video is kind of boring and waste of the time. For example, if there is a crime happens then the operator needs to spend too much time viewing the entire video to get all the important scenes. As a result, authors proposed a summarization technique for surveillance videos. This technique is based on skimming the video and they call it adaptive video skimming. The scheme of this adaptive summarization is illustrated in 30. The general structure of this adaptive summarization scheme is illustrated in [34], optical flow estimation is a very important factor that contributes in the adaptive summarization technique. Among consecutive video frames, optical flow estimation in the same pattern calculates the dislocations of pixels [35, 36].

In [37], authors have proposed a technique for video surveillance that is based on recognition and clustering method in order to create static and dynamic summary. This technique is different than the one proposed in [34] because this technique tries to gather between video skimming and the keyframes that arranged in a cluster in order to achieve faster browsing of the entire video. In [38], there are two different styles for video surveillance summarization technique that have been explained. The first style is doing a summarization services over home network. The second style is doing a summarization services over the internet. In [39], authors have proposed a technique for surveillance video summarization that is based on viewing optimization time, skip framing, and bit rate limitation.

In [37], authors developed their algorithm according to a still surveillance video camera that is fixed in one point and never moves. Also, the detection of event has three steps where we can combine them to two main stages. The first stage is that for each frame, the frame different and its energy are measured. The second stage is that the reference frame is refreshed after showing the frames event is found. Also, the three steps for the algorithms are as follow:

1- Difference frame calculation.

2- The energy of the difference frame calculation.

3- Frames selecting based on threshold.

in 32, the deference estimation for frame energy is illustrated.

In [37], the overview process of event detection and summarization is illustrated in 33. The first row of this (I) is representing a set of frames of the video before the summarization. The second row (II) is representing the interesting events that come from (I) after the analysis. The third row (III) is representing a set of frames in a different set of events. The fourth row (IV) is representing the clusters of keyframes. The last row (V) is representing the final summary after the clustering.

In [40], authors have proposed a technique called OVISS (Omnidirectional Video Visualization and Summarization) where it visualizes the contents of video surveillance and creating a summary. There are three features for this proposed system. The first one, video and audio can be analyzed together to detect events. The second feature, the index of temporal-spatio based on event. The third feature, by realizing the temporal and spatial relations with event, it makes the visualization for at-a-glance very easy. The last feature, video summarization is not limited to one area or one event. The OVISS system can be divided to many processing models as follow:

1. Sensing.

2. The analysis of the video.

3. The transformation of an image.

4. The models of visualization and summarization.

In 34, we can see the real environment that authors deal with. There are four doors to enter and exit. Also, the OVS signal in the center of the room and there are four areas in this room and they have the same size. In 35, the connection between these four areas according to omnidirectional is illustrated.

In [40], there are four types of events, such as, In, move, stay, and out and the explanation for them is as follow:

1- In is representing the time of the object when it show up till the time it closes the door.

2- Move is representing the time of the object when it is moving in the area.

3- Stay is representing the time of the object when it is staying at the same place.

4- Out is representing the time of the object when it is opens the door till the time it gone.

In 36, we can see the diagram of the event transition. Also, we can see that authors have added new type of event but actually it is only the initial state before even the event started in the area.

In [40], the interface of OVISS is illustrated in 37. We can see the time line and the map of spatial in the top. Also, the sequence of the image is displayed down the system interface.


Videos have been increased recently whether at home such as personal video or work and such as meeting and presentation. As a result, researchers are trying to find some techniques that allowed them to summarize these kinds of videos. In section 4, meeting summarization techniques is already described but now we need to talk about summarizing the meeting videos. In [41], there are two types of analyzing the structure and the content of the video. Detection the keyframes and scene breaks are considering the major focus that attracted researchers [42, 43, 44].

In [41], authors proposed a method to summarize the video-taped presentation to constrict the video sequences and the place that they are planning to do their experiments. They want to focus the camera in one location. The camera maybe zooms in or out it depends in the situation. In 38, the sample of web-based interface for video browsing is illustrated [45]. The one that does not have any related semantic explanation is called Nuisance change as it illustrated in 39. For example, when the presenter tries to move one of the slides by using his/her hands or his/her body. However, in [41], authors do not want to include Nuisance changes to their technique or the analysis. The other type does have seman


Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: