This blog is part of a series that takes a deep dive into the science behind Microphone Mist™ technology. The series also includes A groundbreaking approach to gain control that cuts conference call annoyance, Unified coverage map: a radically better technology for hybrid spaces and Why thousands of virtual microphones matter — acoustic spatial resolution in a 3D space. Each piece originally appeared in audioXpress magazine.
Conference call quality has been an area of concern since the beginning of remote communication between individuals and/or teams. Conference calls can be plagued with numerous technical issues, from poor microphone pickup and placement to properly hearing participants, and adequate noise control to connection and bandwidth problems. With these issues in mind, establishing and maintaining a high sound quality and a reliable conference call experience has proven to be very challenging. Somewhat compounding the problem, product designers need to develop products that can work in as many spaces as possible, as you never know how your product is going to be used and installed.
A common issue with conference calling is a phenomenon called acoustic return echo. Nureva has developed a unique and innovative approach to effectively deal with acoustic return echo by adapting and calibrating the acoustic echo canceler in a manner that other systems are unable to achieve.
Acoustic return echo is the effect of callers hearing their voices in their headset speaker delayed in time after they have spoken, which sounds just like the echo you may hear in a canyon or very large room. Say you are at the Grand Canyon and you yell out the words, “hello out there!” Sometime later you will hear the same words echoed back at you, say maybe half a second to a second later. If the echo is really bad, you might hear the same phrase echoed back at you more than once ... maybe two or three times. Now, if this phenomenon happens to you during a conference call, it does not take much imagination to realize how bothersome this will quickly become, resulting most likely in listening confusion and poor intelligibility of the voice from the far end. The person at the far end of the call does not hear this echo as you do, so if the person starts talking before the return echo has dissipated, you will hear both your echo and the other person’s voice at the same time.
When a sound is produced by conference system speakers, it bounces around the room and is picked up by the system microphones, as illustrated in Figure 1a. If this effect is not dealt with, an acoustic return echo signal is generated. If there is no echo canceler circuit to eliminate this acoustic return echo signal, the remote participants hear their voices delayed in time in their earpiece — kind of like what happens when you yell into the Grand Canyon.
Figure 1a: When a sound is produced by conference system speakers, it bounces around the room and is picked up by the system microphones.
Figure 1b: Each speaker and microphone combination has its own unique set of acoustic echo path delays that are determined by the acoustic properties of the room.
Each speaker and microphone combination has its own unique set of acoustic echo path delays that are determined by the acoustic properties of the room. The sound emitted from each speaker takes a different path (room bounce) to each individual microphone, creating a unique acoustic echo path delay for each microphone. The more speakers and microphones you have installed, the more acoustic delay combinations your system creates, resulting in numerous acoustic return echoes that must be canceled. With so many variables and possible combinations, this is not a trivial problem to solve (see Figure 1b and Figure 1c).
Figure 1c: The sound emitted from each speaker takes a different path (room bounce) to each individual microphone, creating a unique acoustic echo path delay for each microphone.
Typically, the larger the room, the longer the acoustic delay will be. The room acoustic delay, also known as echo path delay, is a measure of the time in milliseconds it takes the signal to travel from system loudspeaker(s) to bounce around the room and be picked up by the system microphone(s). Small rooms have a short delay and larger rooms have longer delays. To add to that, a room can be dead sounding (minimally reverberant) or lively sounding (highly reverberant). The echo canceler needs to be able to handle any scenario, which adds to the processor complexity and system memory requirements.
A common approach to deal with return echo issues is adaptive echo canceling, as illustrated in Figure 2. Adaptive echo canceling has well-known constraints limiting its effectiveness to adapt, which can result in poor call quality. At a basic level, adaptive echo canceling compares the speaker output signal that contains the remote talker conversation referred to as remote sound x(n) to the microphone input signal d(n) that contains the remote talker reflected sound h(n) by the room. The adaptive filter measures e(n) and adjusts the echo canceler parameters y(n) over time, with the goal of reducing the level of the acoustic echo return signal (remote sound) before it is sent out by the conference system to the remote participant.
What may not be obvious is that for the adaptive echo canceler to measure and adapt its filter parameters, it requires an active far-end talker signal x(n) and an in-room microphone signal d(n) that contains only acoustic echo return of x(n) as an input. This significantly limits the adaptive echo canceler’s ability to manage real-time room changes and provide high-quality conference call performance.
Let’s take a slight detour to understand when an adaptive echo canceler can and cannot dynamically adapt its filter parameters. Figure 3 outlines the four possible states the system can be in at any time during the call. The “remote sound” state occurs when the only person talking is the remote talker. If only near-end talkers are talking, the call state is defined as the “in room” sound state. If the remote talker and the near-end talker happen to talk at the same time, which can often happen, the call state is defined as “remote and in room sound” state, also known as double talk. The final state is referred to as “idle state,” because no one is talking or the room is empty.
As it turns out, the adaptive filter (shown in Figure 2) is limited to only calibrating and adapting the echo canceler circuit during the “remote sound” state. Ideally, the adaptive filter is designed to detect non-valid states and pause the calibration routine. This can be difficult to do in practice, which leads to echo canceler parameters that may not be optimal, because the echo canceler is calibrating in a call state that it should not be in, resulting in the echo canceler requiring extra time to recover. All this can, and often does, lead to compromised call quality.
Figure 2: A common approach to deal with return echo issues is adaptive echo canceling.
Complex system installations may need to have a technician configure and calibrate the echo canceler system through a manual calibration process that attempts to consider the room acoustic properties. Any changes to the room (e.g., number of people, furniture arrangement, or microphone and/or speaker locations) after the manual calibration is completed could reduce the effectiveness of the manual calibration and may potentially impact the echo canceler performance. Once again, far-end talkers hear their return acoustic echo, which would then require a new manual calibration process.
To undertake a manual calibration, an impulse response (a short duration tone pulse — think of a hand clap or a balloon burst) is transmitted through the speaker system that is picked up by the microphone system. The acoustic return echo delay can be measured and used to calculate echo canceler parameters for the system. The process can be intrusive and time-consuming and is not practical while a meeting room is in use, so the meeting room is taken offline and calibrated.
Figure 3: There are four possible states the system can be in at any time during the call.
A novel approach to echo canceler calibration and performance is achieved by Nureva’s patented Microphone Mist technology. Current approaches and technology seemed insufficient, so we set a challenge for ourselves to improve the far-end talker conference call experience, by continuously accounting for situations and environment changes while avoiding the limitations of adaptive echo cancelers requiring only a “remote sound” state and/or, in more complex systems, a manual calibration step. We’re excited to be able to report that we met the challenge!
Using a concept called sound masking — commonly deployed in many commercial environments — we can calibrate the echo canceler in situations not normally suited to calibration. The novelty is in using a sound mask-like signal to measure for the acoustic environment in real time for each microphone and speaker combination. Not only is the loud calibration signal not required (which would be impractical when the meeting room is in use), the echo canceler is not limited to adapting during the remote sound state only. The echo canceler also adapts to the idle call state and is ready to perform during all call states. This means that the echo canceler will have consistent performance from the beginning of the call to the end.
A sound mask signal, which is usually some form of shaped acoustic pink noise, is transmitted in commercial spaces to increase the ambient background noise of the environment. Because the sound mask is perceived as background ambient noise, it is easily ignored. By using the sound mask in this novel way, Microphone Mist technology can leverage a common concept to achieve significant increased benefit over typical approaches to echo canceler calibration performance.
Figure 4: The common adaptive echo canceler stage is replaced with Nureva’s unique sound mask processor.
Figure 4 illustrates replacing the common adaptive echo canceler stage with Nureva’s unique sound mask processor. The sound mask processor generates a sound mask SM(n)-like signal that is injected into the speaker audio output path along with any far-end talker x(n) signal, which is sent to the speaker system. The sound mask signal SM(n), after it has been reflected in the acoustic environment, is picked up by the microphone system d(n) and re-input into the sound mask processor.
An echo path h(n) measurement for each speaker and microphone combination can be completed. This is important, because in a distributed microphone and speaker system, each microphone requires its own echo canceler function. The acoustic return echo path for each microphone and speaker combination is unique and cannot be treated as a summed microphone signal for the purposes of acoustic return echo cancellation. The sound mask processor outputs a correction signal RTF(n) to the output path for each microphone, canceling out the return echo signal present in the microphone channel.
Figure 5 illustrates how the sound mask processor meets the challenge. Because the sound mask processor does not require a far-end talker, the conference system is able to adapt and optimize echo canceler parameters when other systems cannot. For example, during the idle state, which may be an empty or silent room, the sound mask processor can measure the acoustic environment and determine return echo path values to calibrate the echo canceler. This ensures the system will be ready to perform the instant it is required for a conference call.
Figure 5: Because the sound mask processor does not require a far-end talker, the conference system is able to adapt and optimize echo canceler parameters when other systems cannot.
While other systems are sitting idle doing nothing, Microphone Mist technology is preparing for your next call. To top it off, a manual calibration process is no longer required. For example, as people enter the room and/or arrange the furniture and supporting equipment, the sound mask processor can measure this change and adapt the echo canceler parameters in real time even if there is no one talking in the meeting. Remember, adaptive echo cancelers require a speaking far-end talker to calibrate and adapt. The sound mask processor is able to adapt in real time, so the participants at the far end (remote sound) are able to stay engaged and focused on the conversations in the conference room without having to deal with their own or other remote participants’ acoustic return echo sounds during the call.
The implementation is future-proof and scales with an increase in system complexity. The costs associated with having an acoustic or IT person to maintain the system calibration are eliminated.
The continuous echo canceler calibration functionality builds on Nureva’s patented Microphone Mist technology, which places thousands of virtual microphones throughout a space to pick up sound from any location to ensure that everyone is clearly heard regardless of where they are in the room or the direction they are facing. Users also have the option to limit audio pickup to a specific zone within a space, such as the front of the room, making it ideal for corporate presentation scenarios and lecture capture in higher education.
The technology uses sophisticated algorithms to simultaneously process sound from all virtual microphones to provide remote participants with a high-quality listening experience, enabled by continuous autocalibration, simultaneous echo cancellation, position-based automatic gain control and sound masking. Microphone Mist technology adjusts the sound source level to work with Microsoft Teams, Skype for Business, Zoom, Cisco Spark, Cisco WebEx, GoToMeeting, Pexip Infinity Connect and other common UC&C applications.