精华区文章阅读

发信人: seaman (翩翩少年), 信区: Graphics
标  题: Re: MPEG 的压缩格式                    seaman
发信站: 哈工大紫丁香 (2000年07月01日13:03:37 星期六), 站内信件

发信人: seaman (翩翩少年), 信区: Graphics
标  题: Re: MPEG 的压缩格式
发信站: 哈工大紫丁香 (Mon Sep 13 09:18:11 1999), 转信

==> ace (懂懂) 提到:
>     各路大虾, 懂懂想了解一下 MPEG 的压缩格式, 以及这 vcd 的大致工作原理.
>     有哪位愿意赐教的吗?
>                                      对图形一无所知的懂懂拜上
研究一下下面这篇文章：Q: What is MPEG, exactly?

A: MPEG is the "Moving Picture Experts Group", working under the
   joint direction of the International Standards Organization (ISO)
   and the International Electro-Technical Commission (IEC). This
   group works on standards for the coding of moving pictures and
   associated audio.

Q: What is the status of MPEG's work, then? What's about MPEG-1, -2,
   and so on?

A: MPEG approaches the growing need for multimedia standards step-by-
   step. Today, three "phases" are defined:

   MPEG-1: "Coding of Moving Pictures and Associated Audio for
           Digital Storage Media at up to about 1.5 MBit/s"

   Status: International Standard IS-11172, completed in 10.92

   MPEG-2: "Generic Coding of Moving Pictures and Associated Audio"

   Status: Comittee Draft CD 13818 as found in documents MPEG93 /
           N601, N602, N603 (11.93)

   MPEG-3: no longer exists (has been merged into MPEG-2)

   MPEG-4: "Very Low Bitrate Audio-Visual Coding"

   Status: Call for Proposals 11.94, Working Draft in 11.96

Q: MPEG-1 is ready-for-use. How does the standard look like?

A: MPEG-1 consists of 4 parts:

   IS 11172-1: System
   describes synchronization and multiplexing of video and audio

   IS 11172-2: Video
   describes compression of non-interlaced video signals

   IS 11172-3: Audio
   describes compression of audio signals

   CD 11172-4: Compliance Testing
   describes procedures for determining the characteristics of coded
   bitstreams and the decoding porcess and for testing compliance
   with the requirements stated in the other parts

Q. Does MPEG have anything to do with JPEG?

A. Well, it sounds the same, and they are part of the same
   subcommittee of ISO along with JBIG and MHEG, and they usually meet
   at the same place at the same time.  However, they are different
   sets of people with few or no common individual members, and they
   have different charters and requirements.  JPEG is for still image
   compression.

Q. Then what's JBIG and MHEG?

A. Sorry I mentioned them. Ok, I'll simply say that JBIG is for binary
   image compression (like faxes), and MHEG is for multi-media data
   standards (like integrating stills, video, audio, text, etc.).
   For an introduction to JBIG, see question 74 below.

Q. So how does MPEG-1 work? Tell me about video coding!

A. First off, it starts with a relatively low resolution video
   sequence (possibly decimated from the original) of about 352 by
   240 frames by 30 frames/s (US--different numbers for Europe),
   but original high (CD) quality audio.  The images are in color,
   but converted to YUV space, and the two chrominance channels
   (U and V) are decimated further to 176 by 120 pixels.  It turns
   out that you can get away with a lot less resolution in those
   channels and not notice it, at least in "natural" (not computer
   generated) images.

   The basic scheme is to predict motion from frame to frame in the
   temporal direction, and then to use DCT's (discrete cosine
   transforms) to organize the redundancy in the spatial directions.
   The DCT's are done on 8x8 blocks, and the motion prediction is
   done in the luminance (Y) channel on 16x16 blocks.  In other words,
   given the 16x16 block in the current frame that you are trying to
   code, you look for a close match to that block in a previous or
   future frame (there are backward prediction modes where later
   frames are sent first to allow interpolating between frames).
   The DCT coefficients (of either the actual data, or the difference
   between this block and the close match) are "quantized", which
   means that you divide them by some value to drop bits off the
   bottom end.  Hopefully, many of the coefficients will then end up
   being zero.  The quantization can change for every "macroblock"
   (a macroblock is 16x16 of Y and the corresponding 8x8's in both
   U and V).  The results of all of this, which include the DCT
   coefficients, the motion vectors, and the quantization parameters
   (and other stuff) is Huffman coded using fixed tables.  The DCT
   coefficients have a special Huffman table that is "two-dimensional"
   in that one code specifies a run-length of zeros and the non-zero
   value that ended the run.  Also, the motion vectors and the DC
   DCT components are DPCM (subtracted from the last one) coded.

Q. So is each frame predicted from the last frame?

A. No.  The scheme is a little more complicated than that.  There are
   three types of coded frames.  There are "I" or intra frames.  They
   are simply a frame coded as a still image, not using any past
   history.  You have to start somewhere.  Then there are "P" or
   predicted frames.  They are predicted from the most recently
   reconstructed I or P frame.  (I'm describing this from the point
   of view of the decompressor.)  Each macroblock in a P frame can
   either come with a vector and difference DCT coefficients for a
   close match in the last I or P, or it can just be "intra" coded
   (like in the I frames) if there was no good match.

   Lastly, there are "B" or bidirectional frames.  They are predicted
   from the closest two I or P frames, one in the past and one in the
   future.  You search for matching blocks in those frames, and try
   three different things to see which works best.  (Now I have the
   point of view of the compressor, just to confuse you.)  You try
   using the forward vector, the backward vector, and you try
   averaging the two blocks from the future and past frames, and
   subtracting that from the block being coded.  If none of those work
   well, you can intracode the block.

   The sequence of decoded frames usually goes like:

   IBBPBBPBBPBBIBBPBBPB...

   Where there are 12 frames from I to I (for US and Japan anyway.)
   This is based on a random access requirement that you need a
   starting point at least once every 0.4 seconds or so.  The ratio
   of P's to B's is based on experience.

   Of course, for the decoder to work, you have to send that first
   P *before* the first two B's, so the compressed data stream ends
   up looking like:

   0xx312645...

   where those are frame numbers.  xx might be nothing (if this is
   the true starting point), or it might be the B's of frames -2 and
   -1 if we're in the middle of the stream somewhere.

   You have to decode the I, then decode the P, keep both of those
   in memory, and then decode the two B's.  You probably display the
   I while you're decoding the P, and display the B's as you're
   decoding them, and then display the P as you're decoding the next
   P, and so on.

Q. You've got to be kidding.

A. No, really!

Q. Hmm.  Where did they get 352x240?

A. That derives from the CCIR-601 digital television standard which
   is used by professional digital video equipment.  It is (in the US)
   720 by 243 by 60 fields (not frames) per second, where the fields
   are interlaced when displayed.  (It is important to note though
   that fields are actually acquired and displayed a 60th of a second
   apart.)  The chrominance channels are 360 by 243 by 60 fields a
   second, again interlaced.  This degree of chrominance decimation
   (2:1 in the horizontal direction) is called 4:2:2.  The source
   input format for MPEG I, called SIF, is CCIR-601 decimated by 2:1
   in the horizontal direction, 2:1 in the time direction, and an
   additional 2:1 in the chrominance vertical direction.  And some
   lines are cut off to make sure things divide by 8 or 16 where
   needed.

Q. What if I'm in Europe?

A. For 50 Hz display standards (PAL, SECAM) change the number of lines
   in a field from 243 or 240 to 288, and change the display rate to
   50 fields/s or 25 frames/s.  Similarly, change the 120 lines in
   the decimated chrominance channels to 144 lines.  Since 288*50 is
   exactly equal to 240*60, the two formats have the same source data
   rate.

Q. What will MPEG-2 do for video coding?

A. As I said, there is a considerable loss of quality in going from
   CCIR-601 to SIF resolution.  For entertainment video, it's simply
   not acceptable.  You want to use more bits and code all or almost
   all the CCIR-601 data.  From subjective testing at the Japan
   meeting in November 1991, it seems that 4 MBits/s can give very
   good quality compared to the original CCIR-601 material.  The
   objective of MPEG-2 is to define a bit stream optimized for
   these resolutions and bit rates.

Q. Why not just scale up what you're doing with MPEG-1?

A. The main difficulty is the interlacing.  The simplest way to extend
   MPEG-1 to interlaced material is to put the fields together into
   frames (720x486x30/s).  This results in bad motion artifacts that
   stem from the fact that moving objects are in different places
   in the two fields, and so don't line up in the frames.  Compressing
   and decompressing without taking that into account somehow tends to
   muddle the objects in the two different fields.

   The other thing you might try is to code the even and odd field
   streams separately.  This avoids the motion artifacts, but as you
   might imagine, doesn't get very good compression since you are not
   using the redundancy between the even and odd fields where there
   is not much motion (which is typically most of image).

   Or you can code it as a single stream of fields.  Or you can
   interpolate lines.  Or, etc. etc.  There are many things you can
   try, and the point of MPEG-2 is to figure out what works well.
   MPEG-2 is not limited to consider only derivations of MPEG-1.
   There were several non-MPEG-1-like schemes in the competition in
   November, and some aspects of those algorithms may or may not
   make it into the final standard for entertainment video
   compression.

Q. So what works?

A. Basically, derivations of MPEG-1 worked quite well, with one that
   used wavelet subband coding instead of DCT's that also worked very
   well.  Also among the worked-very-well's was a scheme that did not
   use B frames at all, just I and P's.  All of them, except maybe
   one, did some sort of adaptive frame/field coding, where a decision
   is made on a macroblock basis as to whether to code that one as
   one frame macroblock or as two field macroblocks.  Some other
   aspects are how to code I-frames--some suggest predicting the even
   field from the odd field.  Or you can predict evens from evens and
   odds or odds from evens and odds or any field from any other field,
   etc.

Q. So what works?

A. Ok, we're not really sure what works best yet.  The next step is
   to define a "test model" to start from, that incorporates most of
   the salient features of the worked-very-well proposals in a
   simple way.  Then experiments will be done on that test model,
   making a mod at a time, and seeing what makes it better and what
   makes it worse.  Example experiments are, B's or no B's, DCT vs.
   wavelets, various field prediction modes, etc.  The requirements,
   such as implementation cost, quality, random access, etc. will all
   feed into this process as well.

Q. When will all this be finished?

A. I don't know.  I'd have to hope in about a year or less.

Q: Talking about MPEG audio coding, I heard a lot about "Layer 1, 2
   and 3". What does it mean, exactly?

A: MPEG-1, IS 11172-3, describes the compression of audio signals
   using high performance perceptual coding schemes. It specifies a
   family of three audio coding schemes, simply called Layer-1,-2,-3,
   with increasing encoder complexity and performance (sound quality
   per bitrate). The three codecs are compatible in a hierarchical
   way, i.e. a Layer-N decoder is able to decode bitstream data
   encoded in Layer-N and all Layers below N (e.g., a Layer-3
   decoder may accept Layer-1,-2 and -3, whereas a Layer-2 decoder
   may accept only Layer-1 and -2.)

Q: So we have a family of three audio coding schemes. What does the
   MPEG standard define, exactly?

A: For each Layer, the standard specifies the bitstream format and
   the decoder. To allow for future improvements, it does *not*
   specify the encoder , but an informative chapter gives an example
   for an encoder for each Layer.

Q: What have the three audio Layers in common?

A: All Layers use the same basic structure. The coding scheme can be
   described as "perceptual noise shaping" or "perceptual subband /
   transform coding".

   The encoder analyzes the spectral components of the audio signal
   by calculating a filterbank or transform and applies a
   psychoacoustic model to estimate the just noticeable noise-
   level. In its quantization and coding stage, the encoder tries
   to allocate the available number of data bits in a way to meet
   both the bitrate and masking requirements.

   The decoder is much less complex. Its only task is to synthesize
   an audio signal out of the coded spectral components.

   All Layers use the same analysis filterbank (polyphase with 32
   subbands). Layer-3 adds a MDCT transform to increase the frequency
   resolution.

   All Layers use the same "header information" in their bitstream,
   to support the hierarchical structure of the standard.

   All Layers use a bitstream structure that contains parts that are
   more sensitive to biterrors ("header", "bit allocation",
   "scalefactors", "side information") and parts that are less
   sensitive ("data of spectral components").

   All Layers may use 32, 44.1 or 48 kHz sampling frequency.

   All Layers are allowed to work with similar bitrates:
   Layer-1: from 32 kbps to 448 kbps
   Layer-2: from 32 kbps to 384 kbps
   Layer-3: from 32 kbps to 320 kbps

Q: What are the main differences between the three Layers, from a
   global view?

A: From Layer-1 to Layer-3,
   complexity increases (mainly true for the encoder),
   overall codec delay increases, and
   performance increases (sound quality per bitrate).

Q: Which Layer should I use for my application?

A: Good Question. Of course, it depends on all your requirements. But
   as a first approach, you should consider the available bitrate of
   your application as the Layers have been designed to support
   certain areas of bitrates most efficiently, i.e. with a minimum
   drop of sound quality.

   Let us look a little closer at the strong domains of each Layer.

   Layer-1: Its ISO target bitrate is 192 kbps per audio channel.

   Layer-1 is a simplified version of Layer-2. It is most useful for
   bitrates around the "high" bitrates around or above 192 kbps. A
   version of Layer-1 is used as "PASC" with the DCC recorder.

   Layer-2: Its ISO target bitrate is 128 kbps per audio channel.

   Layer-2 is identical with MUSICAM. It has been designed as trade-
   off between sound quality per bitrate and encoder complexity. It
   is most useful for bitrates around the "medium" bitrates of 128 or
   even 96 kbps per audio channel. The DAB (EU 147) proponents have
   decided to use Layer-2 in the future Digital Audio Broadcasting
   network.

   Layer-3: Its ISO target bitrate is 64 kbps per audio channel.

   Layer-3 merges the best ideas of MUSICAM and ASPEC. It has been
   designed for best performance at "low" bitrates around 64 kbps or
   even below. The Layer-3 format specifies a set of advanced
   features that all address one goal: to preserve as much sound
   quality as possible even at rather low bitrates. Today, Layer-3 is
   already in use in various telecommunication networks (ISDN,
   satellite links, and so on) and speech announcement systems.

Q: Tell me more about sound quality. How do you assess that?

A: Today, there is no alternative to expensive listening tests.
   During the ISO-MPEG-1 process, 3 international listening tests
   have been performed, with a lot of trained listeners, supervised
   by Swedish Radio. They took place in 7.90, 3.91 and 11.91. Another
   international listening test was performed by CCIR, now ITU-R, in
   92.

   All these tests used the "triple stimulus, hidden reference"
   method and the CCIR impairment scale to assess the audio quality.
   The listening sequence is "ABC", with A = original, BC = pair of
   original / coded signal with random sequence, and the listener has
   to evaluate both B and C with a number between 1.0 and 5.0. The
   meaning of these values is:

   5.0 = transparent (this should be the original signal)
   4.0 = perceptible, but not annoying (first differences noticable)
   3.0 = slightly annoying
   2.0 = annoying
   1.0 = very annoying

   With perceptual codecs (like MPEG audio), all traditional
   parameters (like SNR, THD+N, bandwidth) are especially useless.
   Fraunhofer-IIS works on objective quality assessment tools, like
   the NMR meter (Noise-to-Mask-Ratio), too. BTW: If you need more
   informations about NMR, please contact nmr@iis.fhg.de.

Q: Now that I know how to assess quality, come on, tell me the
   results of these tests.

A: Well, for low bitrates, the main result is that at 60 or 64 kbps
   per channel), Layer-2 scored always between 2.1 and 2.6, whereas
   Layer-3 scored between 3.6 and 3.8. This is a significant increase
   in sound quality, indeed! Furthermore, the selection process for
   critical sound material showed that it was rather difficult to
   find worst-case material for Layer-3 whereas it was not so hard to
   find such items for Layer-2.

Q: OK, a Layer-2 codec at low bitrates may sound poor today, but
   couldn't that be improved in the future? I guess you just told me
   before that the encoder is not fixed in the standard.

A: Good thinking! As the sound quality mainly depends on the encoder
   implementation, it is true that there is no such thing as a "Layer-
   N"- quality. So we definitely only know the performance of the
   reference codecs during the international tests. Who knows what
   will happen in the future? What we do know now, is:

   Today, Layer-3 already provides a sound quality that comes very
   near to CD quality at 64 kbps per channel. Layer-2 is far away
   from that.

   Tomorrow, both Layers may improve. Layer-2 has been designed as a
   trade-off between quality and complexity, so the bitstream format
   allows only limited innovations. In contrast, even the current
   reference Layer-3-codec exploits only a small part of the powerful
   mechanisms inside the Layer-3 bitstream format.

Q: All in all, you sound as if anybody should use Layer-3 for low
   bitrates. Why on earth do some vendors still offer only Layer-2
   equipment for these applications?

A: Well, maybe because they started to design and develop their
   system rather early, e.g. in 1990. As Layer-2 is identical with
   MUSICAM, it has been available since summer of 90, at latest. In
   that year, Layer-3 development started and could be successfully
   finished in spring 92. So, for a certain time, vendors could only
   exploit the existing part of the new MPEG standard.

   Now the situation has changed. All Layers are available, the
   standard is completed, and new systems need not limit themselves,
   but may capitalize on the full features of MPEG audio.

Q: How do I get the MPEG documents?

A: You may order it from your national standards body.

   E.g., in Germany, please contact:
   DIN-Beuth Verlag, Auslandsnormen
   Mrs. Niehoff, Burggrafenstr. 6, D-10772 Berlin, Germany
   Phone: 030-2601-2757, Fax: 030-2601-1231

   E.g., in USA, you may order it from ANSI [phone (212) 642-4900] or
   buy it from companies like OMNICOM phone +44 438 742424
                                      FAX   +44 438 740154

Q. How do I join MPEG?

A. You don't join MPEG.  You have to participate in ISO as part of a
   national delegation.  How you get to be part of the national
   delegation is up to each nation.  I only know the U.S., where you
   have to attend the corresponding ANSI meetings to be able to
   attend the ISO meetings.  Your company or institution has to be
   willing to sink some bucks into travel since, naturally, these
   meetings are held all over the world.  (For example, Paris,
   Santa Clara, Kurihama Japan, Singapore, Haifa Israel, Rio de
   Janeiro, London, etc.)

--

－－－－－－－－－－
学高为师，德高为范。

--
☆ 来源:．哈工大紫丁香 bbs.hit.edu.cn．[FROM: sunsoft.bbs@bbs.net.]

Graphics 版 (精华区)