Skip to main content

Mute the Sound: Chaining Vulnerabilities to Achieve RCE on Outlook: Pt 2

Ben Barnea

Written by

Ben Barnea

December 18, 2023

Ben Barnea

Written by

Ben Barnea

Ben Barnea is a Security Researcher at Akamai with interest and experience in conducting low-level security research and vulnerability research across various architectures, including Windows, Linux, IoT, and mobile. He enjoys learning how complex mechanisms work and, more important, how they fail.

Using the vulnerability we described in part 1 of this blog series, we once again have the ability to play a custom sound file on the target, abusing Outlook’s reminder sound feature.

Introduction

Using the vulnerability we described in part 1 of this blog series, we once again have the ability to play a custom sound file on the target, abusing Outlook’s reminder sound feature. To leverage this ability and transform it into a full remote code execution (RCE), we started searching for vulnerabilities in the parsing of sound files on Windows.

Attack surfaces

The sound file to be played by Outlook is in Waveform Audio File Format (WAV). It is played through the PlaySound function, which receives the sound file path. PlaySound is going to load the file, parse it, and then call soundOpen which will call the different wave functions, such as waveOutOpen.

WAV files act as a container (or wrapper) for multiple audio codecs. A codec is a program or code that encodes or decodes a data stream (such as image, video, or audio). Usually, the codec will be pulse -code modulation (PCM) encoding, which is a simple way to represent sampled analog signals.

There are three major attack surfaces where we can search for vulnerabilities:

  1. WAV format parsing

  2. Audio Compression Manager 

  3. Different audio codecs

WAV format parsing

WAV format parsing is implemented inside the function soundInitWavHdr in winmm.dll (the library that implements Windows Multimedia API). The attack surface presented there is not large and seems to have been reviewed; we did not find any vulnerabilities there.

What is Audio Compression Manager?

The Audio Compression Manager (ACM) is the code responsible for handling cases in which the codec used in the WAV file does not use a simple PCM encoding, and thus needs to be decoded by a custom decoder. Those decoders are implemented in files with the .acm extension. One popular example is the MP3 codec, which is implemented in l3codeca.acm. Each codec is handled by a driver (which is not the same as a kernel-mode driver, but similar in function), that is registered through the ACM.

Whenever a transformation is needed, such as transforming MP3 to PCM or vice versa, the ACM comes into action and manages this transformation. When we use a WAV file where PCM is not used, the ACM will be queried to see if the codec specified in the file itself exists and can handle the transformation (Figure 1). 

When we use a WAV file where PCM is not used, the ACM will be queried to see if the codec specified in the file itself exists and can handle the transformation (Figure 1). Fig. 1: Special audio codecs are loaded using the ACM

Each ACM driver must implement functions such as acmdStreamSize and acmdStreamOpen. The first returns the size (in bytes) needed for the output of the transformation; the second one creates a stream struct and sets the appropriate fields, such as the decoding callback function.

ACM’s attack surface is not that large. Yet, we managed to find a vulnerability in that code, as we’ll show later in this blog post.

Different audio codecs

For the last attack surface, we have different audio codecs that are installed by default. The codec is specified in the WAV file using two different ways:

  1. The wFormatTag in the FORMAT chunk

  2. If the wFormatTag is WAVE_FORMAT_EXTENSIBLE, then SubFormat will hold the GUID for the audio codec.

The list of available codecs is described in the Appendix.

Audio signal processing basics

Before diving into the code, let's familiarize ourselves with some of the basics of audio signal processing. If you’re already familiar with these concepts, feel free to skip to the next section.

When we hear sound, what we actually hear is vibration that propagates through a transmission medium. Sound is the reception of those vibration waves as picked up by our ears and perceived by our brain.

An audio signal is a continuous waveform. To process it digitally, we need to convert it from an analog signal to a digital signal (e.g., using an ADC converter). This conversion is the discretization of the analog signal. To do it, we sample the analog signal multiple times in evenly spaced data points (called samples). The sampling rate (also known as sampling frequency) determines how many samples are taken per second. Higher sampling rates capture more details, but they also require more storage and processing.

Other than the sampling rate, we have the sample size, which is how many bits are used for each sample. Once again, the higher the sample size is, the better quality (or closer to the original sound) the sample is. Sample sizes are usually 16 bits or 24 bits per sample.

Audio codecs are based on psycho-acoustic models. Psycho-acoustics is the scientific study of sound perception and audiology — how the human auditory system perceives various sounds. For example, the human hearing range is between 20 Hz to 20,000 Hz. Thus, to further reduce the file size, audio codecs may get rid of frequencies that can not be heard by humans. Additionally, the audio codec can get rid of signals if their volume is not strong enough for the human ear to hear. For example, a 20 Hz sound won’t be heard if it’s less than 60 decibels.

There are many more examples of psycho-acoustic models; for example, the masking of quiet signals when they are close in frequency or time to a loud signal.

The analysis, interpretation, and modification of signals are performed using filter banks, which divide the signal into subbands — distinct bands of components that allow for a more detailed examination and manipulation of specific portions of the signal. Commonly used filter banks are the DCT and polyphase filter banks.

With this basic knowledge in mind, we can dive into researching different codecs.

First attempt: Out-of-bounds write to the samples buffer

We first tried looking at MP3, since compared with the other codecs, MP3 is much more complex. Most of the other codecs only perform somewhat simple conversions, while MP3 has multiple steps during its decoding process (Figure 2).

https://www.researchgate.net/publication/289674716_A_Robust_Data_Embedding_Method_for_MPEG_Layer_III_Audio_Steganography Fig. 2: MP3’s decoding process [source]

MP3 audio data is organized into a series of frames, with each frame representing a small segment of audio. A frame consists of a header and the audio data. The audio data is compressed using Huffman coding. Each frame represents exactly 1,152 frequency-domain samples per channel (mono/stereo). This is partitioned into two chunks called granules, each of 576 samples. Each frame also holds information related to its decoding, called side info (Figure 3).

 Each frame also holds information related to its decoding, called side info (Figure 3). Fig. 3: Audio mono frame structure

Most of the operations performed as part of the MP3 decoding process are complex, and while theoretically they could be an interesting (and cool) place to look for subtle vulnerabilities, in practice many of the operations (such as the modified DCT, polyphase filterbank and  alias reduction) are carried out on a buffer that constantly holds values for 576 samples. Thus, finding an out-of-bounds write vulnerability here is not plausible. One interesting place is the huffman decoding, as naturally it works on more dynamic data (as opposed to the 576 sample buffer).

MP3 Huffman decoding

Huffman coding of a granule (576 samples) is implemented using code tables (instead of a binary tree). The total frequency range from 0 to 22,050 (Nyquist frequency) is partitioned into five regions (Figure 4):

1., 2., and 3. Three regions of “big values” — samples whose value is between -8,206 to 8,206

4. “count1 region” — quadruples of values -1, 0, or 1

5. “rzero region” — higher frequency values are assumed to have low amplitudes, and therefore don’t need to be coded; these values are equal to 0.

The total frequency range from 0 to 22,050 (Nyquist frequency) is partitioned into five regions (Figure 4) Fig. 4: The partition of 576 sample’s frequency lines to the different regions [http://www.mp3-tech.org/programmer/docs/mp3_theory.pdf]

Each region has its own Huffman tables (except for the 0 region), and thus samples of different frequencies are coded differently.

The classification of samples into regions relies on the following variables:

  • big_values — specifies how many total samples are in the big values region

  • region0_count and region1_count — partitions the big_values into subregions; subtracting their sum from big_values yields the number of samples in region2.

  • part2_3length — specifies how many bits are used for scalefactors (part 2) and how many for the Huffman coding (part 3)

The process of decoding the 576 samples happens as follows:

  • Decode the samples in the big_values region

  • Decode the samples in count1 region

  • If the number of bits processed is greater than part2_3length, then we’ve practically decoded all input data, and even overread data — therefore, subtract 4 from total_samples_read (i.e., discard these input bits)

  • If there are samples left, fill them with the 0 (this forms the 0 region)

This logic in Table 1 is shown as pseudo-code.

  total_samples_read = 0;

// Decode big_values region
[redacted for brevity]

// Decode count1 region [1]
for (int i = 0; i < count1; i++) {
    samples[total_samples_read++] = huff_decode(bitstream, count1_huff_table);
}
if (bits_processed > part2_3length) [2] {
    // Overread. Throw last 4 samples
    total_samples_read -= 4;
}

// Fill rzero region with zeros
for (int i = total_samples_read; i < 576; i++) {
    samples[total_samples_read++] = 0; [3]
}

Table 1: Pseudo-code of huffman decoding logic

Unfortunately, the code misses one specific edge case, which leads to an integer underflow.

  • Big values region size is 0

  • Count1 region size is 0

  • Part2_3length is 0

  • Bits_processed is more than 0

In this specific case, bits_processed is going to be larger than part2_3length (it reaches a non-zero value prior to the decoding process during scalefactors decoding). Thus, the code is going to “throw” the last four samples. As the code did not process any samples, total_samples_read is 0. We’re going to have an underflow here, and the code thinks we processed -4 samples. Now, it’s going to fill the 0 region as follows:

  • Set the buffer pointer to &samples[total_samples_read]. This points to 16 bytes before the samples buffer.

  • Set the write size as 576 - total_samples_read = 576 - (-4) = 580 integers.

Thus, we have an out-of-bounds write directly before the samples buffer with the value zero. Neat!

Samples buffer Fig. 5: The out-of-bounds write from the samples buffer

So why didn’t this vulnerability receive a CVE? The samples buffer is part of a struct, and the field right before the samples buffer is the scalefactor array. This is an array with fields we already control, and thus we don’t really have an interesting impact here.

The same vulnerability happens also after the code has dequantized the samples. It once again fills the 0 region, this time with the dequantized buffer. Want to guess what’s before the dequantized buffer? The Huffman decoded sample buffer. This is the same samples buffer we saw earlier, which we also have control over. So, once again, we don’t have real impact.

Those out-of-bounds writes still exist in the MP3 decoder (both accessible through WAV and through .mp3 files), and according to Microsoft they may be fixed in the future. Although no impactful vulnerability was found during the reversing of the codec, we believe there may be vulnerabilities that hide in the different complex operations carried by the decoder.

Second attempt: Integer overflow in the IMA ADPCM codec

Our next attempt is going to be the IMA ADPCM codec, which is implemented in imaadp32.acm. As we now know, the ACM is going to manage transformations from and between different codecs. To register a codec, the code must implement ACM functions. One such function is the acmStreamSize, which returns the number of bytes needed for the destination buffer.

The IMA ADPCM codec calculates the destination buffer size based on the size of the input payload (cbSrcLength), alignment (nBlockAlign), and samples per block (wSamplesPerBlock; Table 2).

  (cbSrcLength / pwfxSrc->nBlockAlign) *(pwfxSrc->wSamplesPerBlock * pwfxDst->nBlockAlign)

Table 2: Buffer size calculation

Before multiplying, the code makes sure the computation will not result in an integer overflow (Table 3).

  SrcNumberOfBlocks = cbSrcLength / pwfxSrc->WaveFormat.nBlockAlign;
  v14 = pwfxSrc->wSamplesPerBlock * pwfxDst->nBlockAlign;

  if ( 0xFFFFFFFF / v14 < SrcNumberOfBlocks )
    return ERROR_OVERFLOW;

  IsThereRemainder = cbSrcLength % pwfxSrc->WaveFormat.nBlockAlign;

  if ( IsThereRemainder )
    ++SrcNumberOfBlocks;

  DstBufferLengthInBytes = v14 * SrcNumberOfBlocks;

Table 3: Computation check intending to prevent integer overflow

This check apparently is not enough to prevent an overflow. If there's a remainder in the division of cbSrcLength / pwfxSrc->nBlockAlign, the code increments the result of (cbSrcLength / pwfxSrc->nBlockAlign), which is used in the multiplication. The overflow check doesn't cover this incrementation. As a result, we can still overflow the destination buffer length by specifying custom values.

We need to provide cbSrcLength that has a remainder when divided by pwfxSrc->nBlockAlign.

Table 4 shows an example of values leading to an integer overflow.

  cbSrcLength = 0x71c71c72
  pwfxSrc->nBlockAlign = 8
  pwfxSrc->wSamplesPerBlock = 9
  pwfxDst->nBlockAlign = 2

Table 4: Example of values leading to integer overflow

This results in a destination buffer of 0xE bytes, while the destination buffer should be much larger.

Although this seems like a integer overflow that can lead to an out-of-bounds write, the decode function correctly makes sure that no writes happen after the allocated buffer, and does not assume that the buffer was allocated with the correct size.

So, although we provide multiple samples, in practice when the destination buffer is full, the code stops. The same behavior happens in the AD PCM codec (implemented in msadp32.acm).

Third attempt: Integer overflow in ACM (CVE-2023-36710)

Finally, we found a nice vulnerability in the ACM code. As part of the process of playing a WAV file, the function mapWavePrepareHeader in the ACM manager (implemented in msacm32.drv) is called.

This function has an integer overflow vulnerability. It calls acmStreamSize, which calls the driver's callback. Recall that this function returns the needed size for the destination buffer. After receiving this size, mapWavePrepareHeader adds 176 bytes (the size of the stream header that will proceed the destination buffer) with no checks of overflows. The result of this addition is passed to GlobalAlloc (Figure 6).

The result of this addition is passed to GlobalAlloc (Figure 6). Fig. 6: The vulnerable code in mapWavePrepareHeader

This is an exploitable issue. We can cause GlobalAlloc to allocate a really small buffer instead of a large one by causing acmStreamSize to return a value between 0xffffff50 to 0xffffffff. After this allocation we can cause two out-of-bounds writes:

  1. The stream header values, such as the struct size, source, and destination buffer pointers and sizes. These values are partially controllable.

  2. The codec’s decoded values. These values are fully controlled.

To trigger the vulnerability we need to provide a WAV sample with a size, when decoded, that will be bigger or equal to 0xffffff50. Although this sounds easy to accomplish, we found throughout our attempts that it might not be possible to achieve in some codecs. For example, with the MP3 codec, as part of the calculation there’s a multiplication by either 1,152 or 576 (which are the number of samples per frame). The result of that calculation will never be in the range we want.

Finally, we managed to trigger the vulnerability using the IMA ADP codec. The file size is approximately 1.8 GB. By performing the math limit operation on the calculation we can conclude that the smallest possible file size with IMA ADP codec is 1 GB. 

Exploitation of such a vulnerability is made easier when there’s a scripting engine available for dynamically crafting an exploit. Since Windows Media Player doesn’t have one, exploitation can be more challenging. This might be still possible (as demonstrated by Chris Evans in his “Advancing exploitation: a scriptless 0day exploit against Linux desktops” blog post). Yet, there are more chances for this vulnerability to be successfully exploited in the Outlook application context (or other instant messaging applications).

Summary

This blog series covered research that began with a vulnerability that was exploited in the wild. (Read Part 1, if you haven’t already.) The research journey then continued by looking for bypasses, and finally found a companion vulnerability to chain it to in order to achieve a zero-click RCE chain. Although those vulnerabilities are fixed, attackers continue to look for similar attack surfaces and vulnerabilities that can be remotely exploited.

As of now, the attack surface in Outlook that we researched still exists, and new vulnerabilities can be found and exploited. Although Microsoft patched Exchange to drop mails containing the PidLidReminderFileParameter property, we can not rule out the possibility of bypassing this mitigation.

Appendix

This site lists all of the media types and codecs available in Microsoft’s Media Foundation.

Our testing indicates that only the following codecs are available through WAV in practice:

1 — PCM

2 — ADPCM

6 — A-LAW

7 — U-LAW

11 — IMA ADPCM

31 — GSM 6.10

55 — MPEG-1 Audio Layer III (MP3)

00000003_0000_0010_8000_00aa00389b71 — IEEE Float

00000008-0000-0010-8000-00aa00389b71 — DTS Audio

00000092-0000-0010-8000-00aa00389b71 — Dolby Digital

00000164-0000-0010-8000-00aa00389b71 — Microsoft WMA



Ben Barnea

Written by

Ben Barnea

December 18, 2023

Ben Barnea

Written by

Ben Barnea

Ben Barnea is a Security Researcher at Akamai with interest and experience in conducting low-level security research and vulnerability research across various architectures, including Windows, Linux, IoT, and mobile. He enjoys learning how complex mechanisms work and, more important, how they fail.