INT J COMPUT COMMUN, ISSN 1841-9836 Vol.7 (2012), No. 3 (September), pp. 565-573 Packet-Layer Quality Assessment for Networked Video H. Su, F. Yang, J. Song Honglei Su, Fuzheng Yang, Jiarun Song State Key Laboratory of Integrated Service Networks Xidian University, Xi’an, Shaanxi 710071, China E-mail:{hlsu,fzhyang}@mail.xidian.edu.cn, sjrxidian@hotmail.com Abstract: To realize real-time and non-intrusive quality monitoring for networked video, a content-adaptive packet-layer model for quality assessment is proposed. Consider- ing the fact that the coding distortion of a video is dependent not only on the bit-rate but also on the motion characteristic of the video content, temporal complexity is evaluated and incorporated in quality assessment in the proposed model. Since very limited information is available for a packet-layer model, an adaptive method for frame type detection is first applied. Then the temporal complexity which reflects the motion characteristic of the video content is estimated using the ratio of the bit-rate for coding I frames and P frames. The estimated temporal complexity is incorporated in the proposed model, making it adaptive to different video content. Experimental results show that the proposed model achieves an advanced performance in compari- son with the ITU-T G.1070 model. Keywords: Packet-Layer Model, Networked Video, Video Quality Assessment, Coding Distortion, Temporal Complexity. 1 Introduction Recently, with the development of advantage multimedia processing technologies [1], multimedia services such as videophone, mobile conference and Internet Protocol Television (IPTV) have gained significant popularity in our daily life. However, the quality of these applications cannot be guaranteed in an IP network due to its best-effort delivery. It is therefore crucial to establish an objective model for video quality assessment targeting system design, QoS (Quality of Service) planning and quality monitoring [2], [3]. Objective video quality assessment can be categorized into media-layer models, bitstream-layer mod- els, packet-layer models, parametric models and hybrid models from the viewpoint of the input infor- mation. To estimate the perceptual quality of service (QoS) for users, the media-layer models use media signals [4], where characteristics of the video content and decoder strategies such as error concealment are usually taken into account. The bitstream-layer models, on the other hand, perform an analysis on the bitstream without resorting to a complete decoding [5], which can be used in situations where one does not have access to decoded video sequences. The packet-layer models exploit the packet headers to obtain information about the service quality [6], making them well suited for in-service non-intrusive monitoring. The parametric models employ parameters from the network or the application [7], [8]. Parameters from the network may include the packet loss rate and the delay information, while those from the application usually cover the coding bit rate, frame rate, and so on. The hybrid models use a combination of information from the bitstream and the media data, and therefore have an advanced performance as well as combined features of the other models [9]. Since the packet-layer model only utilizes information from packet headers, it is very efficient in quality monitoring due to its low complexity, especially suitable for quality monitoring at network inter- nodes. The other advantage is that the packet-layer model does not need decryption and decoding, Copyright c⃝ 2006-2012 by CCC Publications 566 H. Su, F. Yang, J. Song making it favorable when packet payloads are encrypted. In this paper, a packet-layer model is proposed for efficient quality assessment for networked video. Utilizing the limited information which can be provided by packet headers, the frame type and temporal complexity are estimated based on the coding bit-rate. The temporal complexity is incorporated in the proposed model to make it content-adaptive. The remainder of this paper is organized as follows. The framework for proposed packet-layer video quality assessment is introduced in Section 2. Section 3 discusses the relationship between the coding distortion and the bit-rate. The proposed packet-layer model for video quality assessment is described in Section 4. The experimental results are presented in Section 5. This paper closes with conclusions given in Section 6. 2 Packet-Layer Model for Video Quality Assessment Packet-layer model for video quality assessment is especially suitable for application scenarios like in-service video quality monitoring and network service planning. It predicts the networked video quality from packet-header information, without resorting to any media-related payload information. Since only the packet header is exploited, the packet-layer model is very useful at network inter-nodes due to its low complexity, where it can monitor thousands of video streams at the same time. The other advantage is that the packet-layer model does not need decryption and video decoding, making it favorable when packet payloads are encrypted. As an example, Figure 1 shows the structure of a packet in RTP/UDP/IP protocol stacks. In this case, the IP (Internet Protocol) header, the UDP (User Datagram Protocol) header, and the RTP (Real-time Transport Protocol) header can be accessed by a packet-layer model. The length of payload is easily obtained since the UDP length field indicates the length of the UDP header [10], the RTP header and the payload [11]. The marker bit in the RTP header indicates the end of a video frame, and all packets related to one video frame are with the same RTP timestamp. Using this information, the packets can be assembled to frames. Figure 1: The structure of a packet under RTP/UDP/IP protocol stacks By analyzing packet headers, the information needed by the parametric model can be obtained and used to estimate the video quality [12]. Apart from the parameters of bit-rate, frame rate, and packet loss rate which are usually employed in parametric models [13], [14], other information can also be employed in a packet-layer model, such as coding parameters (e.g., the frame type, and the bit-rate of each frame), information about the video content characteristics (e.g., the ratio of the bit-rate for coding I frames and P frames), and the detailed positions of lost packets in a video. The framework of the proposed packet-layer model is shown below in Figure 2. Firstly, after packet header analysis, the bit-rate for coding each frame can be obtained. Then, it is employed to detect the frame type and calculate the ratio of the bit-rate for coding I frames and P frames. This ratio is employed in the proposed model to estimate the temporal complexity which reflects the motion characteristic of the video content. Finally, the coding distortion of networked video is evaluated using the bit-rate infor- mation and the estimated temporal complexity. Packet-Layer Quality Assessment for Networked Video 567 Packet header analysis Bitstream Frame type detection Coding distortion evaluation Temporal complexity estimation Bit-rate Video qualityFrame type Temporal complexity Figure 2: Framework of the proposed packet-layer model 0 0.1 0.2 0.3 0.4 0.5 0.6 1.5 2 2.5 3 3.5 4 4.5 Bit−rate(bits/pixel) M O S Container Coastguard Sign−Irene Football Figure 3: Relationship between the MOS and the bit-rate for each sequence 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 30 35 Bit−rate(bits/pixel) R at io o f bi t− ra te f or I a nd P f ra m es Coastguard Container Football News Sign−Irene Tempete Figure 4: The ratio of the bit-rate for coding I frames and P frames 3 Coding Distortion and Bit-Rate The bit-rate is a key parameter for estimating the coding distortion. It has been well recognized that there is a relationship between the bit-rate and the average Mean Opinion Score (MOS) for different video sequences. Therefore, several functions have been proposed to approximate the relationship to predict the average coding distortion using the bit-rate, such as the computational model proposed in the ITU-T Recommendation G.1070 [15] and its enhancement [16]. Other model forms can also be found, such as the "m-n model" [17], as well as the exponential model [12], [18]. A detailed performance comparison of those models is provided in [19], where superior performance of the enhanced G.1070 model is observed. Although the average MOS can be predicted using models, the subjective quality of individual videos cannot be well formulated when provided only with the coding bit-rate. Considering the video quality for each sequence, the relationship between the bit-rate and the MOS is shown in Figure 3. It is observed that there are obvious differences in the video quality at a same bit-rate for different sequences. Therefore using the bit-rate only is not suitable for estimating the quality of a certain video service. It has now been widely acknowledged that content features must be taken into account for an accurate prediction of the perceived video quality [20]. Building on this argument, video clips are classified into three classes according to the subjective movement content (High, Medium and Low movement content), and the model parameters are calculated for each class [18], [19]. However, it is not described how to obtain the information about movement content for each video clip based on objective parameters [18]. Although the average SAD (sum of absolute differences) can be employed in [19] to reflect the motion characteristic of video content, this value is not available for a packet-layer model. According to Figure 3, it can be seen that the video clip which has a higher motion complexity such as "Football" has a comparatively lower quality at the same bit-rate. Correspondingly, "Container" having a lower motion has a higher quality over the others under the same coding bit-rate. Therefore, the temporal complexity estimated from the packet headers is expected to reflect the motion extent of video clips. How to measure this variable and then establish a packet-layer model based on video content is proposed in the next section. 568 H. Su, F. Yang, J. Song 0 0.5 1 1.5 2 1 4 7 10 13 16 19 22 25 28 31 34 B it -r a te ( b it s/ p ix e l) Frame number I P (a) Football 0 0.5 1 1.5 2 1 4 7 10 13 16 19 22 25 28 31 34 B it -r a te ( b it s/ p ix e l) Frame number I P (b) News Figure 5: Bit-rate histogram of different frame type at 0.202(bits/pixel) 4 Packet-Layer Model Based on Video Content Figure 4 shows the values of the ratio of the bit-rate for coding I frames and P frames for different video clips. This ratio is always comparatively lower for the "Football" sequence due to its high temporal complexity which results in more bit-rates in coding the P frames. On the other hand, the values of this ratio for the "Container" sequence are consistently larger because of its lower temporal complexity. As the observation, low values of this ratio may correspond to a high temporal complexity. Therefore, the temporal complexity can be roughly estimated using the ratio of the bit-rate for coding I frames and P frames. For a packet-layer model, however, the frame type information is not readily available. Conse- quently, a frame type detection method based on the bit-rate distribution for each frame is introduced as follows. 4.1 Frame Type Detection As a general principle, video coding exploits spatial redundancy using intra coding and temporal redundancy using inter coding, where the inter coding modes are usually more efficient in removing redundancy. Accordingly, the bit-rate for coding an I frame is usually much higher than that for a P frame, as shown in Figure 5. Consequently, a threshold-based method is proposed to detect the frame type using the information about the coding bit-rate. However, Figure 5 also shows that the bit-rate related to a certain frame type varies with the video content. Because of the high motion complexity of "Football", more bit-rates are distributed to P frames. As the result, at a same bit-rate, the values of bit-rate coding I frames of "Football" are lower than the values of "News" which has a lower motion complexity under the same coding bit-rate. Therefore, to make the detection more effective, the threshold for video clips of higher temporal complexity should be lower than the threshold for clips of lower temporal complexity. So fixed thresholds may fail in detecting for different video clips. Dynamic thresholds which are adaptively adjusted [21] are applied in this paper. 4.2 Temporal Complexity Estimation After frame type detection, the ratio of the bit-rate for coding I frames and P frames can be calculated using the information about the frame type and bit-rate of each frame. This ratio is defined as: r = RI RP , (1) where RI is the average bit-rate for coding I frames in a certain duration and RP is the average bit-rate for coding P frames in the same duration. However, it can be seen from Figure 4 that there is obvious Packet-Layer Quality Assessment for Networked Video 569 0 0.2 0.4 0.6 0.8 1 0.6 0.8 1 1.2 1.4 1.6 Bit−rate(bits/pixel) E st im at ed t em po ra l co m pl ex it y Coastguard Container Football News Sign−Irene Tempete Figure 6: Relationship between estimated tem- poral complexity and bit-rate 0.6 0.8 1 1.2 1.4 1.6 0 0.05 0.1 0.15 Temporal complexity v 4 Figure 7: Relationship between v4 and tempo- ral complexity differences in the values of r at different bit-rates for each sequence. Due to the fact that the temporal complexity should be evaluated for the sequence with a given related value, r can not be directly applied as the measure of temporal complexity. Consequently, a mathematical mapping is used to unify the values of r for all the range of bit-rates for each sequence. Firstly, the natural logarithm function acts on the values of r to eliminate the enormous differences due to their distribution of different orders of magnitude. Then, adaptively adjusted factors which are related to the average bit-rate are introduced to make the values generated by the first step for each sequence as near as possible. Both of these two steps are implemented without influencing the relatively ranking of the curves of different sequences. However, to give the temporal complexity a practical significance (a high value corresponds to a higher motion extent sequence), the inverse function is employed at last. So the temporal complexity is formulated as: σT = a1 · ln(R) + b1 ln(RI ) − ln(RP) , (2) where R is the average bit-rate, and a1 and b1 are constants obtained by the experiments. As shown in Figure 6, the values of estimated temporal complexity calculated by Formula 2 for each sequence are roughly consistent for all bit-rate. And different sequences have different average values which can reflect the motion characteristic of the video content. So the estimated temporal complexity can be employed for the packet-layer model based on video content. 4.3 Proposed Model for Quality Assessment It is obvious that the average MOS for different video sequences increases as the bit-rate increases and saturates at the maximum MOS, which has been formulated in ITU-T Recommendation G.1070 as: Vq = 1 + v3 · (1 − 1 1 + ( R v4 )v5 ), (3) where Vq is the video quality, R is the bit-rate, and v3, v4, v5 are empirical parameters. This model can estimate the average video quality for different contents at each bit-rate. However, video quality strongly depends on the video content. Best values for v3, v4 and v5 are calculated for the video sequences are presented in Table 1. It can be found from Figure 6 and Table 1 that a higher value of v4 corresponds to a video sequence with higher temporal complexity (e.g., the "Football" sequence). On the contrary, a video sequence whose temporal complexity is low usually has a low value of v4 (e.g., the "Container" sequence). Figure 7 shows the relationship between v4 and the temporal complexity, and a linear model approximates this relationship well as follows, v4 = a2 ·σT + b2, (4) 570 H. Su, F. Yang, J. Song where a2, b2 are obtained by the experiments, and v4 is not a constant but a variable varied with σT . Table 1: The values of v3, v4 and v5 for each video sequence Video sequence v3 v4 v5 Container 3.546 0.027 1.237 News 3.371 0.040 2.232 Coastguard 3.464 0.063 1.733 Tempete 3.456 0.072 1.687 Sign-Irene 3.546 0.091 1.884 Football 3.481 0.134 2.230 In addition, Table 1 shows that there is relative small difference between the values of v3 for different sequences, which is the maximum MOS of the sequence. Though there is a relative large difference between the values of v5 for different sequences, the difference influences the value of Vq slightly. There- fore, v3 and v5 are set as constants for all video clips in the proposed model and both of them are obtained by the experiments. Consequently, submitting Formula 4 into Formula 3, the proposed model is established as: Vq = 1 + v3 · (1 − 1 1 + ( Ra2·σT+b2 ) v5 ). (5) Apart from the bit-rate, the temporal complexity, which reflects the motion characteristic of the video content, is considered in this model to make the evaluation more accurate. 5 Experimental Results The video sequences chosen for experiments covered a wide range of scenes from high motion to low motion events. Specifically, standard video sequences of "Carphone", "Coastguard", "Container", "Football", "Foreman", "Hall_Monitor", "Mother&Daughter", "News", "Paris", "Sign_Irene", "Silent", "Soccer" and "Tempete" were used for performance evaluation. The sequences were all in the Common Intermediate Format (CIF) at 25 frames per second (fps), and encoded using x264 coder [22] with a GOP (Group of Picture) structure of "IPPP" sized of 30. For each sequence, the first 8 seconds were used for evaluation. The subjective scores were collected for comparison purposes. The guidelines specified by the Video Quality Experts Group (VQEG) in [23] were followed for the subjective tests. Twenty-five non-expert viewers were involved in these tests, using the Absolute Category Rating (ACR) with a 5-point scale to obtain the MOSs of reconstructed sequences [24], [25]. The parameters were obtained according to the experiments, and their values are shown in Table 2. These parameters were set fixed for all carried experiments. However, if applied to videos generated by the other codecs, they may need to be adjusted. Table 2: Parameter values v3 v5 a1 b1 a2 b2 3.477 1.834 -0.334 1.137 0.142 -0.065 Pearson correlation coefficient (PCC) and the root-mean-squared error (RMSE) were used to evaluate the performance of the proposed model. By comparison with the G.1070 model, the proposed model gets an increment about 0.024 in PCC and a decrement about 0.082 in RMSE, as shown in Table 3. The Packet-Layer Quality Assessment for Networked Video 571 0 1 2 3 4 5 0 1 2 3 4 5 Scores by proposed model M O S (a) Proposed model 0 1 2 3 4 5 0 1 2 3 4 5 Scores by G.1070 model M O S Carphone Coastguard Container Football Foreman Hall−Monitor Mother&Daughter News Paris Sign−Irene Silent Soccer Tempete (b) G.1070 model Figure 8: Scatter plot of MOSs vs objective scores scatter plots of the objective scores versus the subjective scores are shown in Figure 8, from which the same conclusion can be drawn that using the proposed model the perceived coding distortion can be more accurately measured. Table 3: Performance comparison of proposed model and G.1070 model Video quality assessment model PCC RMSE Proposed model 0.9582 0.3250 G.1070 model 0.9338 0.4067 6 Conclusions A packet-layer model based on characteristics of the video content is proposed in this paper to mea- sure the perceived coding distortion for networked video. Without resorting to the payload information, the temporal complexity is estimated using the ratio of the bit-rate for coding I frames and P frames to reflect the motion characteristic of the video content. Based on analysis of the parameters in the original G.1070 model, the measure of temporal complexity is integrated in the proposed model. Exten- sive experimental results have demonstrated that the proposed model shows an advanced performance in comparison with the G.1070 model. Further work may include the application of the proposed model to practice by considering both coding distortion and packet loss. Acknowledgement This work was supported by the National Science Foundation of China (60902081, 60902052), the Fundamental Research Funds for the Central Universities (72004885), the International Science and Technology Cooperation Program of China (2010DFB10570), and the 111 Project (B08038). 572 H. Su, F. Yang, J. Song Bibliography [1] C. Grava, A. Gacs¨˘di, I. Buciu, "A homogeneous algorithm for motion estimation and compensation by using cellular neural networks", International Journal of Computers Communications & Control, ISSN 1841-9836, Vol. 5, No. 5, pp.719-726, 2010. [2] H. R. Wu, K. R. Rao, Eds., Digital video image quality and perceptual coding, CRC Press, 2005. [3] A. Marchand, M. Chetto, "Quality of service scheduling in real-time systems", International Journal of Computers Communications & Control, ISSN 1841-9836, Vol. 3, No. 4, pp. 353-365, 2008. [4] ITU-T Recommendation J.148, "Requirements for an objective perceptual multimedia quality model", 2003. [5] O. Verscheurei, X. Garcia, "User-oriented QoS in packet video delivery", IEEE Network, pp. 12-21, Nov. 1998. [6] A. Clark, "Modeling the effects of burst packet loss and recency on subjective voice quality", IP Telephony Workshop, 2001. [7] K. Yamagishi, T. Hayashi, "Analysis of psychological factors for quality assessment of interactive multimodal service", Electronic Imaging 2005, pp. 130-138, Jan. 2005. [8] K. Yamagishi, T. Hayashi, "Opinion model using psychological factors for interactive multimodal services", IEICE Trans. Commun., Vol. E89-B, No. 2, pp. 281-288, Feb. 2006. [9] A. Takahashi, A. Kurashima, H. Yoshino, "Objective assessment methodology for estimating con- versational quality in VoIP", IEEE Trans.on SALP, Nov. 2006. [10] RFC 768, UDP, User datagram protocol, 2003. [11] RFC 3550, RTP, A transport protocol for real-time applications, 2003. [12] A. Raake, M. Garcia, J. Berger, F. Kling, P. List, J. Johann, C. Heidemann, "T-V-Model: parameter- based prediction of IPTV quality", Proc. International Conference on Acoustics, Speech, and Signal Processing, pp. 1149-1152, Mar. 2008. [13] M. N. Garcia, A. Raake, "Parametric packet-layer video quality model for IPTV", Proc. Information Sciences Signal Processing and their Applications, Kuala Lumpur, Malaysia, May 2010. [14] K. Yamagishi, T. Hayashi, "Parametric packet-layer model for monitoring video quality of IPTV services", Proc. International Communications Conference, Beijing, China, May 2008. [15] ITU-T Recommendation G.1070, "Opinion model for video-telephony applications", Apr. 2007. [16] J. Joskowicz, J. C. Lopez-Ardao, "Enhancements to the opinion model for video-telephony appli- cations", Proc. the International Latin American Networking Conference, Pelotas, Brazil, Sep. 2009. [17] J. Joskowicz, J. C. Lopez-Ardao, M. A. G. Ortega, C. L. Garcia, "A mathematical model for eval- uating the perceptual quality of video", Proc. International. Workshop on Future Multimedia Net- working, Coimbra, Portugal, June 2009. [18] H. Koumaras, A. Kourtis, D. Martakos, J. Lauterjung, "Quantified PQoS assessment based on fast estimation of the spatial and temporal activity level", Multimedia Tools and Applications, Vol. 34, No. 3, Sep 2007. Packet-Layer Quality Assessment for Networked Video 573 [19] J. Joskowicz, J. C. Lopez-Ardao, "A general parametric model for perceptual video quality estima- tion", Proc. Communications Quality and Reliability, Vancouver, BC, June 2010. [20] M. N. Garcia, A. Raake, P. List, "Towards content-related features for parametric video quality prediction of IPTV services", Proc. Acoustics, Speech and Signal Processing, Las Vegas, USA, pp. 757-760, April 2008. [21] N. Liao, Z. Chen, "A packet-layer video quality assessment model based on spatiotemporal com- plexity estimation", Proc. Visual Communications and Image Processing, Huangshan, China, July 2010. [22] VideoLAN, X264 CODEC, http://www.videolan.org/x264.html. [23] VQEG, "Hybrid perceptual/bitstream group TEST PLAN 1.1", http://www.its.bldrdoc.gov/vqeg/, Sep. 2007. [24] ITU-T, Recommendation P. 910, "Subjective video quality assessment methods for multimedia applications", April 2008. [25] ITU-R Recommendation BT.500-11, "Methodology for the subjective assessment of the quality of television pictures", 2002.