Behind the Screen: Codecs and Formats Unveiled

Table of Contents

Reading time: 30 min

Codecs and formats form the bulk of our source material, from both proprietary and standard files.
Our Video Formats and Conversion blog series continues with a full breakdown of all the common types encountered in forensic video analysis.

codecs and formats

Welcome back to this blog series on video formats and conversion. In this post, we will cover codecs and formats, as it is important to understand these before we look at video conversion.

There is a lot of information here. If you are viewing within a browser, you have the table of contents on the right to assist with navigation. You can also bookmark chapters within your browser to come back to later, or to remember certain sections.

In “Video Formats and Conversion – Behind the Screen: Codecs and Formats Unveiled”, we will cover:

  1. Codecs vs. Formats: Codecs are used to encode and decode (compress and decompress) video data, while formats are containers that organize the video and audio streams for playback. Both are essential for video processing.
  2. Compression Types: Video compression can be either lossless (retaining all data) or lossy (reducing data with potential quality loss). Compression reduces file size but can affect visual fidelity depending on its application.
  3. Spatial and Temporal Compression: Spatial compression reduces data within individual frames (such as by discarding redundant color values), while temporal compression reduces data by referencing previous or future frames and encoding only changes.
  4. Impact of Compression on Forensic Analysis: Compression artifacts, particularly in low-quality or poorly compressed video, can lead to errors in interpreting visual data. This is critical in forensic video analysis, where the integrity of the data is paramount.
  5. The Forensic Line: Forensic video analysis is required when questions arise about the reliability of what is seen in the video. Analysts must assess compression, artifacts, and other factors to determine the accuracy of the footage.
  6. Block Types in Compression: Modern Video compression divides images into blocks or units. Intra-encoding only references that frame, whereas Inter encoding references previous or following frames by only encoding the differences.
  7. GOP (Group of Pictures): GOP structures are used to organize video frames into groups. Each GOP must start with a frame consisting of new Intra-blocks. Followed by P and B-frames that can contain Inter encoding, and rely on other frames to be presented correctly.
  8. Forensic Challenges with Modern Codecs: Advanced codecs like H.264 and H.265 offer high compression efficiency. However, artifacts and errors, especially in poorly lit or low-resolution areas, can propagate through frames, complicating forensic analysis.
  9. Container Formats: Common container formats include AVI, MP4, and MKV. Each has specific features and limitations for handling video and audio streams. Forensic analysts must assess how these formats store and synchronize data.
  10. Importance of Codec and Container Integrity: Ensuring that a video’s codec and container are properly aligned is critical in forensic investigations. Improper processing or transcoding can lead to data loss or inaccurate frame rendering, affecting evidence presentation.

Previously, we examined proprietary data and gained an appreciation for the tasks required of the Amped Engine.

As a reminder, let us look at those again here:

  • Identification of the format
  • Extraction of the multimedia and data
  • Formatting of the streams
  • Decoding of the footage

We learned that we cannot expect a multimedia framework to interpret data correctly if that framework was never designed to understand it. The data must be identified correctly, and extracted in a manner that retains the integrity of the original encoding. What follows is the formatting of the streams into a container that the framework can accurately decode. Some streams can be decoded without formatting, but may not play correctly if not fully understood by the decoder.

Unfortunately, we regularly receive files that have been incorrectly processed by applications based solely on standards. Mixed video streams, damage to the pixels, incorrect video timing, and no data timestamp are some of the most common errors. Most of the time though, those tools simply won’t play anything. They are looking for data that conforms to standards, and cannot adapt to proprietary video. That data may also be incorrectly reported, with inaccurate durations or frame rates.

In the series so far, we have learned that most video surveillance formats are formed originally from standard encoding. Therefore, let us start with the codecs, and then look at the container formats that they reside in.

Video Codecs

The purpose of a codec is to code and then decode data. Or compress and then decompress. Due to the term “compression”, there is often some illusion that compression is bad. Well, it can be, but not if done correctly. In fact, compression is often a necessity to reduce vast amounts of video evidence to a manageable file size. There is a balancing act at play between file size and image quality.

Compression can also be lossless, or lossy. Even when there is a loss of information, there may be no loss in visual fidelity so some loss may be entirely justifiable. The amount of loss or change can be quantified, or visually assessed depending on the purpose of the visual information. For instance, if the purpose of the video were to present a person in a black and white checked shirt, it would not be acceptable if the compression turned the shirt grey.

Every codec will need an encoder and a decoder. There can be many different types for each one. As long as they follow standards, one encoder can write a file that is then decodable by another. For instance, the x264 and x265 encoders can write compatible data with Microsoft Directshow or Media Foundation decoders. There was a time, not long ago, when codecs could be listed by status within the evidential processing chain.

We may have had “Motion JPEG” or “MPEG4 Part 2” as our source codecs.
Our intermediary working codecs were perhaps “Uncompressed RGB” or “Apple ProRes”.
We would then output for delivery and playback with “Windows Media Video” or “MPEG2” for DVD.

How Times Have Changed?

It has been several years since applications such as Amped FIVE allowed the processing of video and surveillance footage to be handled natively without the requirement to “multi-tool”.

At the same time, there has been a shift from uncontrolled CCTV processing to forensic video analysis. It is now a requirement in many legal systems that video data is not blindly transcoded and played back. There is an obligation to understand a video’s integrity, authenticity, and reliability before any version is presented as fact to a judge or jury. This is paramount now due to the ease of accidental or malicious manipulation and the generation of synthetic media.

Rather than look at codecs for their use, we will concentrate on their level of technicality.

It is beyond the scope of this post to go into all the intricacies of video compression. Still, we must highlight some of the most important points that are relevant when using video within a legal setting.

At the start is the consideration that you are viewing a digital representation of a scene, object, or event. It is not a person, a vehicle, or an item, but a collection of small pieces of color that your brain interprets as something. You must then consider that one of the purposes of compression is to remove as much data as possible without being too noticeable. One of the methods to achieve this is to fool the brain, and remove information that our brain can make up!

Add to this the many visual defects that happen during the video generation process and you can easily see why some people make mistakes when viewing and interpreting footage.

The Forensic Line

This brings us nicely to “The Forensic Line”. This is where viewing footage must turn to forensic video analysis. How do you know if you need to cross that line? It is the moment you ask a question about what you are seeing.

“Is it?”
“It could be?”
“It looks like”

At that point, being aware of video compression will help you make the right decisions, and the video can be sent for further analysis if required. This ensures facts are presented within a legal environment and not subjective opinions based on what the viewer believes.

When forensic analysts present imagery or video in a courtroom, they must do so with an understanding of how that data was originally constructed. Analysis of the compression and all subsequent processes to that visual image forms the science of Forensic Video Analysis. Just saying that you can see a line of light pixels is worthless if you cannot explain that the line of pixels is reliable.

Video Compression

We will look at two types of video compression, spatial and temporal.

compressed image

Spatial Compression

In the above CCTV frame capturing a vehicle in the distance, we can see the pixels, blocks, and macroblocks.

The pixels (green) are the smallest component of a digital image. They are the little colored squares that form the image. But they don’t all have color originally. In most compression schemes, there is no need to include a color value for every pixel. During the encoding, these can be discarded and then added later using interpolation when it is being uncompressed. Why take up valuable data space when the eye doesn’t need it? We will look at this more when we get to the decoding stage later in the series.

Next is the block of 8×8 pixels (blue). Notice how the blocking cuts off the bottom of the vehicle tires. These blocks play a very important role in image compression as they allow color conversion and subsampling, along with a conversion of the values into a frequency domain.
Finally, we have the macroblocks (red). These are 16×16 pixel blocks. In modern video encoding, these blocks can be larger and are named differently. We will learn about those shortly.

For now, though, everything you see in that image has been compressed using information from that “space”, using spatial compression.

Before we learn a little about temporal compression, look back at the vehicle. Consider how many pixels or blocks form that vehicle. It is often objects and marks of that size in pixels that cause questions and doubt. It is these small groups of pixels that suffer the most when being compressed. Secondly, when applying further compression, either through a mistake in the incorrect acquisition or poor handling within a processing stage, the objects of this size in pixels will change the most. This causes interpretation and reliability challenges.

Therefore, if this is the size of an object you are viewing, how reliable is your interpretation of it? Statements of fact at a pixel or block level require understanding how that pixel or block was formed. The origins of that block are where we head next.

Temporal Compression

Here we have compression using time, rather than space, to make decisions on what to retain, what to change, and what is not required.

You may remember from the previous post on proprietary data where we looked at a selection of frames that were not in the correct order for decoding.

table of frames

We explained that although the data could be understood by a standard decoder, it would be displayed incorrectly. Without the streams being sorted, the frames would not be referencing the correct frame before it.

This is the key to temporal compression. If frames can be referenced and then use a previous frame to decide what needs to be coded, then why bother coding something that has not changed?

And what about if an object moves? If the compression can move that block and apply some changes, it will take up less space than encoding it all as new information.

macroblocks

Here we have a CCTV video frame of a vehicle travelling towards the camera. We can see all the blocks, and the block types are differentiated using a colored overlay.

Let’s start first with the macroblocks shown here in purple. This is new data encoded at that moment in time. The visual information within that block is low frequency and has little detail—the road, a wall, etc. The block was compared against that in the previous frame. The encoder identified that the change required the data to be newly encoded.

Next, we have the macroblocks shown in red. These are also new but have more high-frequency details, such as patterns or edges. These block types are both known as intra-blocks (I). The blocks marked in red get split up and encoded as 4 8×8 blocks. The purple blocks, as they have less detail, get encoded as a single 16×16 block.

A frame that can only contain intra-blocks is an I-frame (I). In video compression, “Intra” defines a block of pixels compressed entirely within itself. That is to say, the block is only spatially compressed and does not reference data from another frame or block.

Now we have the blocks that are shown here in green. These blocks have been compared against those in the previous frame and the encoder found that the difference only requires the pixel values to change. The storing of the difference takes up much less space than encoding it all as new.

You will also notice that the blocks over the moving vehicle have small arrows associated with them. These are motion vectors. During encoding, the new frame is referenced to the previous frame. Blocks can be moved while making a slight change to the pixel values. Again, storing the motion vector parameters and the difference between the reference frame and the new frame takes up much less space than encoding it all as new. It is also easier to decode, as it only needs to present the changes.

This block type is known as predicted. A frame that references the previous frame and itself is known as a Predicted frame (P).

Prediction

This use of the word “Predicted” often confuses. It is also used in error by many who do not understand video compression. Many question the reliability of a frame or block if prediction is used, believing or attempting to portray that the information is formed through guesswork.

To help us understand it a bit better, the Latin origin is “praedict”, meaning “made known beforehand”. This is true in our encoding context, as a reference frame is being used (the frame prior) to evaluate the changes in the frame being encoded.

Now let us look at the dictionary differences (dictionary.cambridge.org) between the UK and the US.

  • US – To say that an event or action will happen in the future.
  • UK – To say what you think will happen in the future.

These are very different, and this difference is likely to have caused a lot of misunderstandings.
A meteorologist can predict the weather, and an astronomer can predict a solar eclipse.

Predictions can be accurate if the data to form that prediction is accurate.

That last sentence is very important. Within our CCTV world, small errors often occur. Due to often sub-optimal lighting or low-quality cameras, information can be lost or added during this compression stage. As frames can use previous frames to form the data, the errors get copied into consecutive frames. Therefore, they can appear reliable to the unknowing viewer.
An Amped FIVE user in the last year analyzed two camera views of the same scene. A person could be seen on one but was not captured by another. All due to compression errors.

Blocks

Let us get back to the blocks for a moment.
In the image above we also see grey blocks. These have no encoded difference from the previous block, but there is a very slight difference during the decoding. How is that?

Do you remember earlier when we said that color can be added back during the decoding stage using interpolation? Well, that is what is happening here. The encoder did not change any data during the encoding, but the spatial compression caused very small changes during the decoding. This is why it is important to have control of decoding parameters. We will be looking into these controls later in the series.

Lastly, we have the black blocks, and you may notice that these are both 16×16 and 8×8.
These blocks show that there is no difference between the previous frame and the one displayed.

There is one other set of blocks, not shown in the image above and they relate to blocks formed through Bi-directional prediction. These frames will reference the previous frame, and also the one before it. Frames that can reference both are known as B-frames. These are often found in systems that have a small buffer within the encoder to allow for the storage of the frame information before encoding. This data buffer could be in the camera, or within the NVR/DVR itself. B-frames are also commonly found in transcodes. The software performing the transcode can look ahead, as the frame already exists in the original file.

Finally, we mentioned earlier about larger blocks in modern video encoding.
We have really only just scraped the surface of video compression and have concentrated on some of the core concepts. However, it is worth considering that some modern codecs utilize slightly different technology and terminology. They still use spatial and temporal compression, but the blocks can be up to 64×64 pixels. In one type we will look at soon, they are termed Coding Tree Units (CTU), rather than Macroblocks.

GOP

This stands for Group of Pictures. We have learned that there are three picture types. I, P, and B. A video GOP can be all I-frames, but it cannot be all P-frames. If the codec uses P or B block compression, there must be at least one I-frame to start the GOP.

GOPs can look like this:
I B B P B B P B B P B B,
and this could be referred to as a 12-frame GOP.

You may also see GOPs referred to this way:
M=3:N=12, with 3 stating the amount of frames between I and P-frames, and N stating the total frames in that GOP.

GOPs can have a fixed length, so in our example, every GOP will be 12. They can also be flexible. This allows an encoder to create an I-frame only when necessary. GOPs are also referred to as being open or closed. An open GOP enables frames to reference a frame outside of its own GOP. In our example above, the last B-frame will be able to reference the next I-frame if the GOP is “open”.

We have learned so far that it is not the frame type that ensures the reliability of the image, but the blocks that construct it. Again, those who do not fully understand compression may suggest that a frame cannot be relied upon because it is not an I-frame. There may be a misunderstanding that because a video file only consists of 4% I-frames, that 96% is somehow unreliable. These types of metrics can be misleading if used incorrectly. Even within authenticity analysis, it is the blocks that are analyzed and not the frame type.

Before we move on to the codecs and what compression they use, it is vital to remember a key point. If you are referring to an item at the block level, you must be able to prove the reliability of the data being used. This proof should extend beyond just the decoded image.

Think back to the words of caution surrounding spatially compressed blocks and pixels. Now consider how those blocks are being used within modern compression. Since spatial compression alone can cause artifacts, it follows that even I-frames can misrepresent the details or shape of a subject. However, temporal compression can propagate an artifact to surrounding frames. The propagation of errors and artifacts can easily then be interpreted as reliable data.

  • Marks on a person’s face caused by compression blocking may be incorrectly interpreted as creases or shapes.
  • The shape of a hand or object being held can be interpreted incorrectly due to the formation of the blocks.
  • Observations on an item of clothing can be mistaken as being reliable when the formation of the observed point is not reliable.

Read more about some forms of compression artifacts in our blog post from the Video Evidence Pitfalls series.

Source Codecs

MJPEG

Motion JPEG, as the name suggests, is simply a series of JPEG images that should be presented one after another, thereby giving the viewer the perception of motion. Although an older codec, it is still seen regularly in dashcams, motion-activated wildlife cameras, and systems using Video Motion Detection (VMD). Standard Motion JPEG files only use spatial compression. Therefore, they will often be much larger in file size than videos also using temporal compression.

As mentioned in the previous post, there are several proprietary variants of MJPEG. The JPEG images may not be fully decodable without the proprietary player or the Amped Engine to reconstruct and decode the JPEG blocks.

One of the challenges found with some early JPEG-based systems is that they forced the analog signal into a computer-based resolution. An example would be a PAL 704×576 interlaced frame being squeezed into a 640×480 full-frame image. Yes, it has a 4:3 aspect ratio, but how it got there was all wrong. Consequently, even when using restoration and enhancement techniques, the damage cannot be reversed.

We now move on to codecs that utilize both spatial and temporal compression.

MPEG1 and MPEG2

You will see these letters a lot, MPEG, stands for Moving Picture Experts Group. They are behind the standardization of many of the codecs we will examine.

These are here for completeness. MPEG1 and its early iterations, were probably why many video surveillance manufacturers went their own way. Licensing issues, along with either expensive or low-powered chips caused some problems in the early years of digital CCTV.

MPEG2 is still used in some legacy surveillance systems. There was only one CCTV DVR manufacturer to utilize the codec though as it was quite computationally heavy at the time. However, it found its home being the codec for Video DVDs, and for some time was used to replace the VHS recorder.

dvd and vhs recorder

Here we have an early generation DVR, with a video DVD recorder on top, being used to re-record anything required as evidence.

So, the DVR was taking in an analog video signal, and then deinterlacing and compressing the image. It was converting it again into another analog video signal for output. The output was then ingested into the DVD recorder, where it was converted into another digital format, MPEG2.
That is what Law Enforcement then received as evidence. That footage was often referred to as garbled rubbish! Considering the number of changes made to the data, it was amazing that we ever got anything from systems like this.

Video DVDs (MPEG2) would also be used for several years as a presentation medium for CCTV evidence. It’s only now, with further understanding and hindsight, that we realize how dangerous that was. Transcoding compressed video evidence using a codec that was, in many cases, worse than the original was perhaps a mistake.

MPEG 4 Part 2 / M4V

This is still a fairly commonly seen codec here at Amped Software. Used extensively in the early days of digital CCTV, the hardware utilizing it is still found in many small stores. The codec itself is broken down into profiles and then levels.

One of the challenges encountered with systems using the Simple Profile was that although it supported PAL and NTSC resolutions, it did not support interlaced video. The challenge was then how to deal with the analog signal coming in. We see two methods very regularly, and the tricks that the manufacturers learned here are still used today, even in more modern systems.

The first is only saving a single field. Rather than two separate fields to make up the frame, just save a single field and then manage that on playback.
So, a single field gives an image of 704×240 if NTSC was being used or 704×288 with PAL.

There were some regular problems with this when it came to the proprietary players. Many would not interpolate the lines correctly to give a full frame. They would, instead reduce the width to retain a 4:3 aspect ratio. This would result in an NTSC field being reduced to 320×240 by the playback software. This is also what it would export.

The next method seen to deal with interlacing was a separation of the fields. So rather than the fields being interlaced, they would be interleaved. The player would then be responsible for joining everything back up during the decode stage.

test video
An interleaved test video, showing the top field on the top and the bottom field on the bottom

MPEG 4 Part 10 / AVC / H264

The most common codec used within the surveillance industry today. Advanced Video Coding (AVC) is very well documented, researched and studied. This does make it easier for us within the forensic video world to deal with. Used in most video recording devices, and supported within every video playback method. From an application on your cellphone to your TV, you cannot escape its presence.

The efficiency of the encoding increases considerably from MPEG 4 Part 2, with much more flexibility in terms of how it references data from previous frames. This brings a higher responsibility to those examining the compression within a legal setting.

One of the many additions to the H264 codec was a deblocking filter at both the encoder and decoder end. Remember our image of the blocks earlier when we looked at spatial compression? This filter smooths them out.

So, take into consideration all that we have learned so far, on what is reliable at a pixel and block level. Now consider that during the encoding of H264, the edges of those blocks are softened. Further to this, the advances in the transmission of the encoded data have allowed for more control. This too comes with some warnings as error resilience is built in, to ensure full frame presentation, even when part of the data is lost or corrupted.

H264/AVC again utilizes profiles and levels to manage thousands of different configurations and use cases for this highly flexible codec. As the codec supports interlacing and resolutions up to 8K, there are very few limitations. However, as is always the case with surveillance video, the biggest challenges are storage and transmission.

Again, the use of decoding tricks can be utilized in an attempt to cut down on data size. One of the common tricks is to record at half-width. So, rather than encode a FullHD frame (1920×1080), just encode 960×1080, and then set a display aspect ratio of 16:9. We looked at that in our acquisition blog series.

dvr player

HEVC / H265 / MPEG-H Part 2

We are now entering the modern era of video coding. As you will have guessed, along with it comes a whole new level of technicality. Also, up until this point, there has been a clear evidential increase in the imagery. The amount of information that can be obtained has increased at each stage. However, things get a little more complicated now.

Let us start with the blocks, that are now called Coding Tree Units (CTU). In a similar way to how H264 and previous MPEG codecs worked, there are still regions. However, these have now increased from a maximum of 16×16 to 64×64 pixels, with much more advanced prediction within the same frame. Several other improvements ensure that encoding and decoding are much more computationally efficient at higher resolutions.

This is where some issues begin to arise. There is a belief that newer technology and higher resolution bring higher fidelity. However, this is not always the case. At a scene level, taking in the camera’s entire field of view, there is often a quantitative and visual increase in quality. However, objects in motion at a CTU level have been found to possess less information. Over the past few years, we have been monitoring the increase in H265 submissions to Amped Support and have found several challenges for the FVA community.

4K footage, so that’s usually 3840×2160, or 8,294,400 pixels, is a lot of data. This is why the CTU structure of HEVC is required. However, it is though, only as good as the information it receives.

Lighting plays a huge part in image generation. In photography and film, correct lighting is key. This is the same for compression to work correctly. It is lighting that allows the system to interpret changes in the scene correctly. When objects move against a background that has been built up over consecutive frames, the movement may not be correctly replicated. This is especially true if there is limited lighting in that area.

An example is that a person’s legs may be captured walking along a sidewalk. However, the upper body and head are not captured at all, with the trees and shrubs being prioritized. The legs have a higher contrast against the lighter concrete, whereas the upper body is darker against the trees.

Next, we have the lower shutter speed of many CCTV cameras. This allows more light in, and as we know, light is king! However, a lower shutter speed means motion blur.
An object that suffers from motion blur can be restored when an image has low compression. However, this tends not to be the case with many surveillance systems.

For example, the minimum number of vertical pixels to visualize the letter E is 7.
We have seen examples of license plates, captured with 4K HEVC being over 50 pixels in height, but the spatial compression used has removed all the high-frequency detail. The result is CTUs formed of a single color. It is impossible to restore detail when there is no detail to restore.

Finally, we have an increase in high-resolution cameras being installed to cover much larger areas than may have been considered before. Previously, a camera may have been 1280×720 and only covered the entrance to a shop. Now, there is a single 4K dome camera covering the entire premises. This may have benefits at the installation and data management stage for the shop owner.

However, whereas the smaller cameras had a narrower field of view to ensure detail, the larger, single camera loses it. A mark or a scar was replicated correctly originally, but now even a person’s face is difficult to recognize. All of these issues, lighting, shutter speed, field of view, and compression happen with other codecs but they seem more prevalent with HEVC.

blurred image of a license plate

In this example, you can see the loss of high-frequency detail in a static license plate compared to using H264 with the same camera and lighting. This small loss can make a huge difference when attempting restoration and enhancement to establish facts.

There are many other codecs that you may encounter within source files. Wavelet variants, JPEG2000, H263, and WMV (the codec, not the file extension!). However, they are mainly what we would refer to as legacy codecs and even WMV is now rarely used as an output codec and format.

We will look at delivery, or output codecs soon, but first, we have the ones in between.

Intermediate Codecs

An Intermediate codec is one required to be used to carry on processing the video.
They are used in a variety of scenarios with the first being the transcoding of proprietary video. We will look more at the transcoding process later in the series. However, if it is required, and the video requires restoration and or enhancement, then that transcoding must be lossless. You can imagine, that if you transcoded with a lossy codec, that process may destroy the data you need!

The next reason may be to assist in processing workflow.
It may be necessary to conduct several restoration processes, some enhancement, and then apply various annotations to an evidential video. Then, have this in a Frame Size that is suitable for the presentation requirements.
Following on from this base video, there may be several annotations to highlight and track persons of interest.
It is often advantageous to create the base file as a separate source before continuing with many other requests and annotations.

Within Amped FIVE there are two completely lossless codecs.

Uncompressed RGB

As the name implies, this is every Red, Green, and Blue value, for every pixel, with no data compression, sub-sampling, luminance changes, or spatial or temporal compression.
If you started with a compressed format and then decoded it… you would be writing the decoded output. This means that the resulting data size is going to be huge.

Using the 10-second 1280×720 Amped Testcard, the resulting file is 659mb.

FFV1

This codec is also completely lossless but uses data rather than imagery compression to reduce the space needed to store all the values. It is also optimized for decoding.

Using the same Testcard, the file size is only 19mb.

How can we tell that there is no difference? By using Video Mixer and observing the Similarity Metrics.

video mixer

You could also use the Blend Mode > Absolute Difference.

There is one other codec that must be mentioned here, as it may be used accidentally.

Rawvideo

This uses the YUV color space. What is decoded and presented within the viewer is in RGB. There will naturally be a loss during the switch to YUV, along with some chroma sampling.

Here we have the Testcard on the top and the difference between the original and the Rawvideo output on the bottom.

testcard

If our output had been the same, thereby being completely lossless, the bottom image would be 100% black.

As a comparison, the filesize using this “slightly lossy” codec is 330 MB.

Delivery Codecs

The selection of delivery codec will be dictated by the request and the viewing criteria. With various decisions to be made on the size and layout of the imagery, we must also decide on what codec to output to. It will be necessary to link this with any specific requirement on container format.

When using Video Writer in Amped FIVE,  there are several options under Video Codec.

video writer

MSMPEG4 v2 and v3

During the early days of MPEG4 Part 2, there seemed to have been a few issues with Microsoft and its requirements within the Directshow multimedia framework. Consequently, they went their own way and created variants of the standard. These were designed to be formatted within their Advanced Streaming Format (ASF), using the WMV file extension. The quality and codec efficiency can be considerably lower than the more modern compression schemes due to their age.

They are here for legacy compatibility purposes. There are still some jurisdictions that are stuck with older Microsoft Directshow limitations and are unable to easily view anything not designed specifically for this framework.

MJPEG

This is also mainly here for testing and review purposes. The jpeg quality usually results in a considerable loss in fidelity but is useful if an MJPEG Source video is required for validation or testing purposes.

H264 and H265

We have placed these here together as they both have the same control options. Quality:

  • Visually Lossless (1)
  • High (12)
  • Default (18)

We use the Constant Rate Factor for quality within the x264 and x265 encoders and the values are marked in brackets above. The values ensure very high quality for evidential video. The below text, taken from this link, details the ranges very well.

“The range of the CRF scale is 0–51, where 0 is lossless (for 8-bit only, for 10-bit use -qp 0), 23 is the default, and 51 is the worst quality possible. A lower value generally leads to higher quality, and a subjectively sane range is 17–28. Consider 17 or 18 to be visually lossless or nearly so; it should look the same or nearly the same as the input but it isn’t technically lossless.”

In our testing, the values provide a good balance to ensure quality is maintained, speed of encoding, and file size.

Hardware acceleration options:

  • None
  • CUDA
  • Quicksync

With no acceleration, your system will simply use the CPU’s standard capability to encode the frames.
With CUDA or Quicksync however, this process will be passed to a compatible NVidia Graphics Processing Unit (GPU), or Intel CPU function if available. As this will utilize a different version of the encoder, the quality settings will change slightly.

During testing, we identified the settings that closely matched the CRF values detailed above.

It is important to understand how the final image gets created before it is sent to the encoder for encoding. Here we have two chains, starting with the same video, and ending in quite similar results.

history tab

The first chain has two very powerful, but computationally intensive filters at the end.
The second has an extra filter, but none puts too much of a strain on the processor.
There is a time and a place for all filters. The “Smart” filters are extremely useful, but can impact further processing.

The first chain can only decode at approximately 1fps, whereas the chain without the intensive filters can decode at over 25fps. This is important to be aware of. No amount of hardware acceleration will be able to manage the processing required before it gets there. However, if you only cropped the video, placed some text annotation, and selected the range required for the incident, then Hardware Acceleration will significantly speed up the video writing task.

FourCC

We cannot write an article about codecs without giving comments on the powerful four Character Code (FourCC). Mainly used within the AVI Container, FourCC identifies the specific codec required to decode the video.

Many of you will probably have asked what type of video an investigator has, and have been met with the response, “It’s just an AVI”. As you know, this is merely the container, and you need to know what the codec is. Some examples:

  • IMM4
  • SN40
  • HEV1
  • PLV1

There are hundreds, and again, these do not have to be registered, so manufacturers can do their own thing.

The Raw video that we looked at earlier displays the following FourCC when analyzed with MediaInfo.

codec

We will look more at these shortly.

Evaluation and Validation

The ability to perform visual and quantitative assessments within Amped FIVE after the writing process enables users to conduct their tests and retain the projects for any future requirements. There will always be a slight change in the visual data when lossy compression is used. However, as was stated at the beginning of this article, the ability to evaluate the level of loss and prove that the facts are still retained without affecting evidential integrity is paramount.

Audio Codecs

It is only fair that audio gets a mention as the rules regarding compatibility with codecs and containers also come into play here. It may be possible for a video encoded with a specific codec to reside inside a certain container format, such as MP4. However, if the video stream has an associated audio stream, the selection of compatible audio codec must also be considered.

When selecting Video Writer in Amped FIVE, within a chain with audio being decoded, you have several options regarding the audio codec to be used.

audio codec

We will not be going into these here, but we must highlight some important functionality.
If you do not wish to include the audio, simply select “None”.
This enables you to perhaps write one video with the audio, and another without.

Next, if you select an incompatible audio codec for the chosen container, then this will be detected.

amped five notification

If that audio codec is important, you can obviously then change the container to suit.

Container Formats

In our introduction to this series, we learned how the multimedia container format is the controller of the video and audio streams inside. Without having some form of control over those streams, there would be challenges in ensuring synchronization.

We briefly looked at the AVI format in the first post, so where better to start?

AVI (Audio Video Interleave)

It says something about the strength of the AVI Format that it is still going strong after all these years. Although not as efficient perhaps as the other containers we will look at, its strength comes down to the very simple structure. There are though, some issues that must be addressed.

Based on the Resource Interchange File Format (RIFF) from the early 1990s, the container splits the data inside into chunks, with a header at the top and then an index at the end. A huge amount of information is available about this format, with breakdowns of the header and how the timing and indexes work. To avoid duplication and not reinvent the wheel for this and the other formats, we will concentrate on some issues affecting forensic analysis.

The alarm for us, when receiving an AVI video file as evidence, is the age of this format. We must, therefore ask some questions.

  • Does the format contain video encoded with an old legacy or proprietary codec?
  • Does the format contain a video encoded with a modern codec that it was not designed for?

Here we have two AVI files as seen in Windows Explorer:

two avi files

Windows has identified both as video, but the first one does not have a thumbnail. This is an immediate flag to warn you that Microsoft Media Foundation (the newer multimedia framework within Windows 10 and 11), has not been able to decode the video and create a thumbnail.

Let us then look at that one first. It is possible to read a lot of the format data by loading the file into the Advanced File Info within Amped FIVE, including the codec FourCC.

codec

From here we can attempt several things. Firstly, we can attempt to load it, but as the FJHT codec is proprietary, we would need to source it, and then install it. We DO NOT recommend installing surveillance codecs into a host computer though. These can conflict and contaminate a working environment so a Virtual PC or Sandbox is recommended.

Codec packs will often come with an installer such as this new example below.

installer

This will ensure that the necessary dlls are placed into the correct directory and registered for use within the Microsoft Multimedia Framework.

It must be remembered though that you will then be decoding the video in the way the manufacturer wants you to. They may apply further decoding changes such as deblurring, sharpening, or contrast correction.

Next, we could attempt conversion through the Amped Engine, and we will look at that in the next post in this series.

That was the first AVI, which was using an old legacy proprietary codec. How about the next one?

Here we have the video, which loaded into FIVE with no codec required. For this example, we are using the Microsoft Directshow Multimedia Framework. We will look specifically at the decoding of files later in the series.

I can see from the basic file info, that the codec being used has a FourCC of M4S2.

amped five parameters

Directshow, FFprobe, and MediaInfo all report 29967 frames with a playback rate of 25fps. However, with the AVI container, these numbers refer to the total number of chunks in the stream and the fixed playback rate of the chunks.

There are, in fact, only 5994 video frames in the stream. To achieve a variable frame duration, empty chunks have been inserted where appropriate. Where empty chunks are encountered during playback, the last decoded frame’s duration will be increased by the fixed chunk duration.

Conducting a Frame analysis reveals what is happening.

frame analysis

The footage was actually recorded at around 5 fps, with a codec that supports variable frame durations. When the investigator selected the AVI “Open Format” option during the acquisition, the DVR did its best job. A single fixed chunk duration has been set in the stream header. Therefore, we have a frame of video, and then it is held there for several “empty” chunks until the next image is presented.

A common mistake when analyzing files like this is mistaking the pkt_duration as the frame duration. As you can tell, that assumes there are 29967 frames. The real frame durations are highlighted in green and are computed from the Presentation Time Stamp.

Let’s have a look at a second in time, and see how these break down.

frames breakdown

For this second, we have 5 consecutive frames shown vertically. The first 40ms chunk is the set duration of that frame. However, we can see that inserting extra empty 40ms chunks extends the duration. The first chunk of that frame contains the decodable frame data. The following ones are empty until the next decodable frame.

Dealing with AVI’s can cause a lot of unnecessary challenges because of the limitations of that format. We see all modern codecs being forced into it and it may appear to be an easy option to enable a quick preview. However, conducting a native acquisition first, and then allowing tools such as Convert DVR in Amped FIVE to deal with it correctly, would avoid problems and questions regarding timing. The AVI’s also do not often have the data timestamp or have been transcoded with the timestamp on.
We will look more at these during the next posts in the series.

MP4 (MPEG4 Part 14)

Probably the most common source video format. Based on Apple’s MOV format, the MP4 container also splits everything up, but this time into Atoms.

In AVI, if the header is damaged or corrupted, it may be possible to decode some of the video inside. However, if some of the atoms are corrupted, the process becomes a little more tricky.

Here we have the end of an MP4 video and the MOOV atom is highlighted in the hex.

hex viewer

If the power was lost to the recording device before it got to writing the important atoms at the end, it may not be possible to reconstruct that data without another working file from the same system.

How the data is stored is very reliant on the formatting and the final processing of the container file.

Another common issue with the MP4 extension is that it sometimes does not conform to the MP4 standard.
Manufacturers, as previously discussed, can use any extension they like. They can also remove or add different Atoms. This can make format analysis a challenge. However, it can also assist in ensuring integrity. Files written in that way are difficult to manipulate and then retain the proprietary formatting.

MKV (Matroska)

It’s not often seen as a source file container format (which is a shame). Also, unfortunately, some that have used it, have not complied with standards. The result are video streams within the open container, but the streams cannot be easily extracted for forensic use.

We are huge fans here at Amped Software of using MKV as a container for the extracted video streams from proprietary formats.

One of the biggest benefits is evidential archiving.
It is a requirement in many jurisdictions to retain evidence for a considerable time. With video evidence, this can be a problem. Even now, many have issues playing back proprietary data that require a player. It may mean using Windows XP, 7, or even worse 8!
Having all the footage in an open format, but with the footage having no visual loss is a huge advantage to evidence archivers.

We will look more at this during the Video Conversion article.

MKV stores its data in blocks, with metadata and indexing of all the included streams.

WMV (Windows Media Video) or ASF (Advanced Streaming Format)

Files with the WMV file extension use Microsoft Advanced Streaming Format.
It can be a little troublesome in our video surveillance world so let us break it down into Source and Delivery.

Source:
If you receive a .asf file containing a video using perhaps H264 or H265, then it is likely a surveillance stream that has been placed into this Windows container, rather than the proprietary format. You may wish to question if this process has resulted in any timing differences or data loss.

Delivery:
If you need to stay within the Windows Media Framework, use either of the MS Codecs, MSMPEG4 v2 and v3, that we looked at earlier. However, use the .WMV file extension.

The format is the same, but the file extensions are different! We said it was a little complicated!
In summary:

  • .asf = Will accept non Microsoft codecs inside
  • .wmv = Will only accept Microsift codecs inside

Finally

Have you made it here? Well done, although we appreciate it may have taken some time. It’s been a long post, and maybe one of the largest we have written. However, it is impossible to separate the two keystones of digital multimedia. The codec and the container.

We hope it has given you some history behind the codecs used by the systems we encounter and an understanding of the challenges we face within the forensic video community. For those new to the world of CCTV investigation, this should be a good stepping stone into the many rabbit holes of codecs and containers. We have just scratched the surface. It’s quite fascinating how much you can find when you dig a little deeper.

We have now covered proprietary data, along with standard codecs and formats. In our next post, we will head into the conversion of both these sources.

Until then, stay safe.

Table of Contents

Share on

Subscribe to our Blog

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Subscribe to our Blog

Related posts