2010 IEEE International Workshop Technical Committee on Communications Quality and Reliability (CQR) 1
Abstract
— In this paper a general parametric model isproposed which provides estimation for the perceived quality of video, coded with different codecs, at any bit rate and displayformat. The proposed model takes into account video content,using an objective estimation of the spatialtemporal activity,based on the average SAD (Sum of Absolute Differences) for theclip. Studies were made for more than 2000 processed videoclips, coded in MPEG2 and H.264/AVC, in bit rate ranges from25 kb/s to 12 Mb/s, in SD, VGA, CIF and QCIF display formats.The results shows that the proposed model fits very well to theperceived video quality, in any combination of codec, bit rateand display format.
Index Terms
—Video perceptual quality, Video codecs, Videosignal processing, VoIP Network design
I.
I
NTRODUCTION
ideo and multimedia applications are growing fast. Inthe massive market, different providers are offeringvideo and multimedia applications to the end users, includingcable television, Internet service providers, and traditionaland emerging telephony carriers, among others. In thecorporate market telephony applications are well established,and different video applications are emerging (videophones,video conferencing, etc.). In these challenging scenarios, it iscritical to guarantee an appropriate QoE (Quality of Experience) for the end user, according to the application to be developed. QoE can be defined as the overall performanceof a system, from the user perspective. Many factors canaffect the QoE, depending on the application and usersexpectations. Video quality is one of the most importantaspects to consider in the user QoE. With digital video codingand distribution, new artifacts are presented, affecting thevideo perceived quality, and the final QoE.Different evaluations and standardized efforts have beenmade, and are currently ongoing, in order to derive objectivemodels and algorithms to predict the perceived video qualityin different scenarios.Picture metrics, or medialayer models, are based on theanalysis of the video content. These metrics can be classified into FR (Full Reference), RR (Reduced Reference) and NR (No Reference) models. In the first one, FR models, thesrcinal and the degraded video sequences are directlycompared. In the RR models, some reduced information aboutthe srcinal video is needed, and is used along with thedegraded video in order to estimate the perceived videoquality. NR models are based only in the degraded video inorder to make an estimation of the perceived video quality.Data metrics, or packetlayer models, are based on network information (i.e. IP packets). This metrics can be classified into packetheader models, bitstreamlayer model and hybrid models. The packetheader models use only generalinformation about the network (i.e. packet loss rates), and does not take into account packet contents. Bitstreamlayer models can access IP packets payload, and extract somemedia related information. Hybrid models use a combinationof the other methodsParametric models predicts the perceived video qualitymetrics based on some reduced set of parameters, related tothe encoding process, video content and/or network information. These models typically present a mathematicalformula, representing the estimation of the perceived videoquality as a function of different parameters. Parametricmodels can be applied to packetlayer models, medialayer models or a combination of both.One of the fundamentals factors affecting the perceived video quality is the degradation introduced by the encoding process. Different parametric models have been proposed, inorder to predict the perceived video quality based on someencoding parameters. However, most of them are applied tosome specific applications, display formats or codecs, and arenot valid (or were not tested) in other environments. In thiswork, we make a comparison between different parametricmodels for predicting the perceived video quality estimationdue to coding degradations, and based on the results, ageneral parametric model is proposed, applicable to a widerange of applications, display formats, codecs and videocontent.
A General Parametric Model for PerceptualVideo Quality Estimation
Jose Joskowicz,
Member, IEEE
J. Carlos López Ardao,
Member, IEEE
Facultad de Ingeniería ETSE TelecomunicaciónUniversidad de la República Campus Universitario, 36310Montevideo, URUGUAY Vigo, SPAINPhone: +598 2 7110974 Phone: +34 986 8212176 josej@fing.edu.uy jardao@det.uvigo.es
V
2010 IEEE International Workshop Technical Committee on Communications Quality and Reliability (CQR) 2The paper is organized as follows: Section 2 describes asummary of different Video Quality Metrics. Section 3describes how the perceived video quality varies as a functionof the bit rate. Section 4 presents different existing parametricvideo quality estimation models. In Section 5 the proposed perceptual video quality estimation model is presented, and the parameters are calculated for MPEG2 and H.264/AVCcodecs, in SD, VGA, CIF and QCIF display formats. Section6 summarizes the results and main contributions.II.
V
IDEO
Q
UALITY
M
ETRICS
The most reliable form for measuring the perceived videoquality of a video clip is the subjective tests, where typicallyvideo sequences are presented to different viewers, and opinions are averaged. The MOS (Mean Opinion Score) or the DMOS (Difference Mean Opinion Scores) are the metricstypically used in these tests. Different kinds of subjective testscan be performed, based on Recommendations ITUR BT.50011 [1] and ITUT P.910 [2].Subjective tests are difficult to implement, and takesconsiderable time. For these reasons, different objective videoquality metrics have been used for video quality evaluationand estimation. Historically, the PSNR (Peak SignaltoNoiseRatio) picture metric has been used for the evaluation of videoquality models. It is now accepted that such quality measuresdoes not match the "perceived" quality by human viewers [3].Based on VQEG (Video Quality Expert Group) work, ITUhas standardized the Recommendations ITUT J.144 [4] and ITUR BT.1683 [5] for estimation of the perceived videoquality in digital TV applications when the srcinal signalreference is available (FR models). VQEG is working inmodels evaluation for the estimation of the perceived videoquality in multimedia [6] and HDTV (High Definition TV)[7] applications, and is also working in bitstream and hybrid models evaluation [8].The models proposed in the Recommendation ITUT J.144 perform quality comparisons between the degraded and thesrcinal video signal, and can be classified as FR models. For each video clip pair (srcinal and degraded), the algorithms provide a VQM (Video Quality Metric), with values between0 and 1 (0 when there are no perceived differences and 1 for maximum degradation). Multiplying this value by 100 ametric is obtained which corresponds to the DSCQS (DoubleStimulus Continuous Quality Scale) [1] and can be directlyrelated to the DMOS. The statistical error between theaverage subjective DMOS and the predicted DMOS usingRecommendation ITUT J.144 models can be estimated in +/0.1 in the 01 scale [9][10].Instead of doing subjective test, we used in this work themodel proposed by NTIA standardized in RecommendationITUT J.144, available at [11]. The DMOS values returned form the NTIA model can be related to the MOS usingEquation (1). The interpretation of the MOS values is presented in Table 1. MOS errors, using this model, can beestimated in +/0.4 in the 15 scale (4 times the DMOS error).
DMOS MOS
45
(1)
TABLE I.
MOS
TO PERCEIVED QUALITY RELATION
Quality Bad Poor Fair Good ExcellentMOS
1 2 3 4 5
III.
P
ERCEIVED
V
IDEO
Q
UALITY
A
S A
F
UNCTION OF THE
B
IT
R
ATE
Sixteen video clips, available in the VQEG web page [12],were used. These video clips spans over a wide range of contents, including sports, landscapes, “head and shoulders”,etc. The srcinal and the coded video clips were converted tononcompressed AVI format in order to be compared with the NTIA model.Figure 1 shows the relation between MOS and bit rate, for all the used clips, coded in MPEG2 (using the coding parameters detailed in Table 2), in SD display format. MOSvalues were derived from DMOS, using Equation (1). DMOSvalues were calculated using the NTIA Model.
TABLE II. MPEG2
AND
H.264
CODING PARAMETERS
MPEG2 H.264
Profile/Level: MP@MLMax GOP size: 15GOP Structure: AutomaticPicture Structure: Always FrameIntra DC Precision: 9Bit rate type: Constant Bit RateInterlacing: NonInterlaced Frame Rate: 25 fpsProfile/Level: High/3.2Max GOP size: 33 Number of B Pict between I and P: 2Entropy Coding: CABACMotion Estimated Subpixel mode:Quarter PixelBit rate type: Constant Bit RateInterlacing: NonInterlaced Frame Rate: 25 fps
Peceived quality for all clipsSD  MPEG2
11.522.533.544.550 1 2 3 4 5 6 7 8 9 10 11 12
Bitrate (Mb/s)
M O S
src2 src3src4 src5src7 src9src10 src13src14 src16src17 src18src19 src20src21 src22
Fig. 1. Perceived Quality as a function of the Bit Rate
As can be seen, all the clips have better perceived qualityfor higher bit rates, as can be expected. In MPEG2, in SD,for bit rates higher than 6 Mb/s all the clips have an almost“perfect” perceived quality (MOS higher than 4.5). At 3 Mb/sall the clips are in the range between “Good” and “Excellent”. However for less than 3 Mb/s the perceived
2010 IEEE International Workshop Technical Committee on Communications Quality and Reliability (CQR) 3quality strongly depends upon the clip content. For exampleat 2 Mb/s, MOS varies between 3.6 and 4.8, and at 0.9 Mb/sMOS varies between 1.9 (between “Bad” and “Poor”) and 4.2(between “Good” and “Excellent”). Is common to use MPEG2 at 3.8 Mb/s in SD IPTV commercial applications, where the perceptual quality is near “Excellent” for all video clips.However, at low bit rates there are high differences in the perceived quality for identical coding conditions, dependingon video content. Similar considerations can be made for other codecs (i.e. H.264/AVC) and display formats (i.e. VGA,CIF and QCIF).IV.
E
XISTING
P
ARAMETRIC
V
IDEO
Q
UALITY
M
ODELS
The curves in Figure 1 represent the perceived qualityvariation, in function of the bit rate, for clips coded in MPEG2, due to coding degradations only. Similar curves areobtained with H.264/AVC. These curves can be modeled bydifferent type of relations between the bit rate and the MOS.By definition MOS values can have values between 1 and 5.We will define
I
c
as the video quality determined by theencoding parameters, and
V
q
as the estimation for MOS, withthe relation presented in Equation (2).
I
c
varies between 0 and 4, and
V
q
between 1 and 5.
cq
I V
1
(2)Different functions for
I
c
were published, each oneapplicable for some specific conditions (i.e. display formats,codecs, applications, etc.), as described in the following paragraphs.In [13] an exponential model is presented, for IPTVapplications coded in MPEG2 and H.264, in SD and HDdisplay formats. The proposed model in this paper, called “TVModel”, is presented in Equation (3), using the parametersnames provided in the referred paper.
bac
eaa I
2
13
(3)where
I
c
represents the video quality, determined by thecodec distortion,
b
is the bit rate and
a
1
,
a
2
and
a
3
are thethree model parameters. The influence of video content in the“TVmodel” is qualitatively described in [14], showing thatsome spatial and temporal features must be taken into accountin order to predict the perceived quality, but a specificquantitative model is not proposed in the mentioned paper.In [15] the exponential model showed in Equation (4) is presented, for videos coded in MPEG4 in CIF and QCIFdisplay formats, for multimedia applications that distributeaudiovisual content over 3G/4G (3rd/4th generation)networks.
1)1]([
)(
L BRb L H c
PQePQPQ I
L
(4)where
I
c
represents the video quality determined by thecodec distortion,
b
is the bit rate and
PQ
H
,
PQ
L
,
BR
L
and
α
are the four model parameters. In this work, video quality wasevaluated using the MPQoS (Mean Perceived Quality of Service) VQM, a metric based on a Picture QualityMeasurement [16]. The model parameters depend on thespatial and temporal activity level of the video clip.It can be seen that the “TV Model” represented inEquation (3) and the exponential model represented inEquation (4) are the same, with the parameters relationdetailed in Equations (5). These two equivalent models will be called the “Exponential Model” for reference in the present paper.
L
BR L H
ePQPQa
)(
1
2
a
(5)
1
3
H
PQa
In [17], the relation between the bit rate and the DMOS ismodeled with Equation (6), for MPEG2 and H.264, indifferent display formats.
n
abk m DMOS
).(
(6)where
b
is the bit rate,
a
is a constant that depend on thedisplay format (SD, VGA, CIF or QCIF),
k
is related to thecodec (i.e.
k
=1 for MPEG2, and
k
is a function of
ab
for H.264) and
m
and
n
are the other model parameters. In this paper, it is shown that the model parameters
m
and
n
arerelated to the subjective movement content of the clip. Videoclips are classified in three classes, according to the subjectivemovement content: High, Medium and Low movementcontent. Using the relation between MOS and DMOS fromEquation (1),
I
c
can be defined as described in Equation (7).We will call this the “
mn
Model” for reference in this paper.
nc
abk m I
14
(7)Another model can be found in Recommendation ITUTG.1070 [18], which describes a computational model for pointtopoint interactive videophone applications over IPnetworks. This Recommendation is based on the work performed by NTT (Nippon Telegraph and Telephone)Service Integration Laboratories [19][20]. The model has 3 parameters, which depend on codec type, video displayformat, key frame interval, and video display size. Provisionalvalues are provided only for MPEG4 in small displayformats (QVGA and QQVGA). In [21], the same model presented in Recommendation ITUT G.1070 is proposed for video quality estimation in IPTV services, in HD (HighDefinition, 1440 x 1080 pixels) display format. In this case, parameters values are provided for the H.264 codec.Enhancements to the Recommendation G.1070 were proposed in [22], with the model presented in Equation (8)
5
4
1114
vc
vabk I
(8)
2010 IEEE International Workshop Technical Committee on Communications Quality and Reliability (CQR) 4where
I
c
represents the video quality determined by thecodec distortion,
b
is the bit rate,
a
is a constant that depend on the display format and
v
4
and
v
5
are the other model parameters. The coefficient
k
is related to the codec, with
k
=1 for MPEG2 and
k
is a function of
ab
for H.264. We willcall this the “Enhanced G.1070 Model” for reference in the present paper. In this model ([22]), it is shown that therelation between
I
c
and the bit rate not only depends on thecodec used and the display format, but strongly depends onvideo content, specially for low bit rates. In this paper, usingsimilar ideas presented in the “
mn
Model” in [15], videoclips are classified in three classes, according to the subjectivemovement content, and the model parameters
v
4
and
v
5
arecalculated for each class (High, Medium and Low movementcontent). The parameters
a
and
k
do not depend on movementcontent according to the mentioned paper.V.
P
ROPOSED
P
ERCEPTUAL
V
IDEO
Q
UALITY
M
ODEL
As has been shown in section 4, different models for therelation between the perceived video quality and the bit ratehave been proposed, each one applicable to some specificcodec, application or display format. Nevertheless, same or equivalent models have been presented, in different works, for different multimedia applications. This is the case of the“Exponential Model”, applied for MPEG2 and H.264 withhigh definition and large screens (IPTV applications in SDand HD display formats) in [13] and for MPEG4 with lowdefinition and small screens (3G/4G applications in CIF and QCIF display formats) in [15]. This is also de case for the“G.1070 Model”, applied to video telephony applications withsmall display formats (MPEG4, QVGA and QQVGA) in[18] and for IPTV (H.264, HD) in [21]. On the other hand,the “
mn
Model” and the “Enhanced G.1070 Model” areapplicable to different kind of applications, from small tolarge display formats, and for different codecs. In these twomodels, and also in the “Exponential Model”, it is shown thatsome of the model parameters depend on video content, butnone of them presents a direct way to objectively derive themodel parameters from the video content. In the rest of thecurrent section, a comparison between the different models is presented, and for the selected model, a direct relation between the model parameters and video content is proposed.The best parameters values for the “Exponential Model”,for the “Enhanced G.1070 Model” and for the “
mn
Model”,as well as the RMSE (Root Mean Square Error) werecalculated for each curve in Figure 1. The maximum RMSEvalues were found for the video clip “Rugby” (src9), whichhas very high movement content, and are presented in Table3, with the corresponding parameter values for each model.In Figure 2 the perceived video quality derived from one of the ITUT standardized models (NTIA), and the estimated using “Exponential Model”, “Enhanced G.1070 Model” and “
mn
Model”, with the values presented in Table 3, is showed for this clip.The three models fit well to the actual values, but the“Enhanced G.1070 Model” has some advantages regardingthe other models. First, the “Enhanced G.1070 model” haslower RMSE values than the other two models for all theclips. Second, for lower bit rates, when
b
→ 0, in the
“Exponential Model”
V
q
→ (1+
a
3
–
a
1
), in the “
mn
Model”
V
q
→

∞,
while in the “Enhanced G.1070 Model”
V
q
→ 1 (for
any parameters value). The minimum MOS value is, bydefinition, 1, as derived from the “Enhanced G.1070 Model”for any parameters values. For these reasons, we have selected a model based on the “Enhanced G.1070 Model” in thiswork, in order to derive the model parameters from videocontent. The proposed model is presented in Equation (9),leaving the parameters name provided in the “Enhanced G.1070 Model”.
5
4
11141
vq
vabV
(9)
TABLE III.
M
ODELS
C
OMPARISION
Model Parameters RMSE
Enhanced G.1070
v
4
=1.24
v
5
=1.6
a
=1
k
=10.65Exponential Model
a
1
=4.50
a
2
=0.77
a
3
=3.750.79
mn
Model
m
=0.56
n
=0.99
a
=1
k
=10.77
Src 9 Rugby(SD MPEG2)
11.522.533.544.550 1 2 3 4 5 6 7 8 9 10 11 12
Bit rate (Mb/s)
M O S
Perceptual VaulesExponential ModelEnhanced G.1070 Modelmn Model
Fig. 2. Perceived quality for the clip Rugby (src 9), coded in SD in MPEG2,and values derived from different models
As has been shown in [14] and [15] (“ExponentialModel”), [17] (“
mn
Model”) and [22] (“Enhanced G.1070Model”), there is a strong relation between MOS and videocontent, for a given codec, display format and bit rate. Thiscan be confirmed looking at Figure 1. In MPEG2, SDdisplay format, for less than 3 Mb/s there are high variationsin MOS values for the same bit rate. For example at 2 Mb/s,MOS varies between 3.6 and 4.8, and at 0.9 Mb/s MOS varies between 1.9 (between “Bad” and “Poor”) and 4.2 (between“Good” and “Excellent”). Similar behaviors can be found for
2010 IEEE International Workshop Technical Committee on Communications Quality and Reliability (CQR) 5different codecs and for different display formats. For thisreason video content must be taken into account in order toestimate the perceived MOS, for a given codec, displayformat, bit rate and frame rate.In [14], for the “Exponential Model”, only a qualitativeanalysis is presented, regarding the relation between videocontent and perceived quality. In [15], (“ExponentialModel”), it is shown that the model parameters can bederived from only one parameter, related to the spatialtemporal activity of the clip. But this paper did not use astandardized video quality metric, and did not present how todirectly derive this parameter from the video clip. An indirectmethod is presented, based on the quality evaluation of theclip coded with a high bit rate.In [17] (“
mn
Model”) and [22] (“Enhanced G.1070Model”) video clips are classified according to the subjectivemovement content into three classes (High, Medium and Lowmovement content), but it is not described how to derive thevideo clip movement content based on objective parameters.Using the “Enhanced G.1070 Model”, the values of
a
,
v
4
and
v
5
that best fits Equation (9) to the perceived quality for all the clips coded in MPEG2 and H.264/AVC, in SD, VGA,CIF and QCIF display formats were calculated. Differentestimations for the video spatialtemporal activity wereevaluated, founding a strong correlation between the
v
4
and
v
5
parameters with the average SAD (Sum of AbsoluteDifferences). SAD is a simple video metric used for block comparison and for moving vectors calculations. Each frameis divided into small blocks (i.e. 8x8 pixels) and for every block in one frame the most similar (minimum SAD) block innext frame is find. This minimum sum of absolutesdifferences is assigned as the SAD for each block in eachframe (up to the n1 frame). Then all the SAD values areaveraged for each frame and for all the frames in the clip, and divided by the block area, for normalization. This value(average SAD/pixel) provides an overall estimation about thespatialtemporal activity of the entire video clip.The relations between the
v
4
and
v
5
parameters with theaverage SAD per pixel are depicted in Figure 3 and in Figure4. In Figure 3, the subjective movement content is graphicallyshowed, confirming that low values for SAD/pixel are related to low spatialtemporal activity and high values are related tohigh spatialtemporal activity or movement content. Anestimation of
v
4
and
v
5
for MPEG2 and H.264, as a functionof the average SAD/pixel can be performed as
645314
52
cscv
cscv
cc
(10)where
s
is the video average SAD/pixel. The best values for
c
1
..
c
6
and for
a
are presented in Table 4 and Table 5.With these relations, the video quality estimation presented in Equation (9) only depends on the encoded bit rate and thespatialtemporal activity of the video clip, measured as theaverage SAD/pixel, calculated for each frame in 8x8 blocksand averaging for all the frames in the clips (for clip durationof the order of 10 seconds, as the ones used in this work).
TABLE IV.
c
i
V
ALUES FOR
E
ACH
C
ODEC
Codec
c
1
c
2
c
3
c
4
c
5
c
6
MPEG2
0.208 0.95 0.036 0.036 1.52 1.17
H.264/AVC
0.150 0.95 0 0.030 0.68 1.20TABLE V.
a
V
ALUES FOR
E
ACH
D
ISPLAY
F
ORMAT
Display Format
SD VGA CIF QCIF a
1 1.4 3.2 10.8
The dispersion between the MOS values derived usingEquations (9) and (10) and the perceived MOS values (usingthe NTIA VQM standardized in Recommendation ITUTJ.144), for the sixteen video clips used, coded in MPEG2 and H.264, in SD, VGA, CIF and QCIF display format, with bitrates from 25 kb/s to 12 Mb/s are plotted in Figure 5. In thisfigure, each point represents a video clip coded in a specificcombination of codec, bit rate and display format. It is worthnoting that subjective rating scales have ranges of 1 unit, inthe 15 MOS scale. On the other hand, the NTIA algorithmstandardized by the ITU has errors in the order of +/ 0.4regarding to MOS measures of subjective quality. In Figure 5,the dotted lines represent the estimated +/ 0.4 error marginof the NTIA model. Only 27 from the 2064 points are outsidethe dotted lines, meaning that the predicted MOS values(using the proposed model) have the same degree of precisionthan the video quality metric used.
v4
vs SADMPEG2 & H.264/AVC
00.20.40.60.811.21.41.61.80 2 4 6 8 10
average SAD/pixel (8x8)
v 4
H.264  Low MovH.264  Med MovH.264  High MovMPEG2  Low MovMPEG2  Med MovMPEG2  High Mov
Fig. 3. Relation between
v
4
with respect to SAD
v5
vs SADMPEG2 & H.264/AVC
11.21.41.61.822.20 2 4 6 8 10
average SAD/pixel (8x8)
v 5
MPEG2H.264/AVC
Fig. 4. Relation between
v
5
with respect to SAD