Directional hear-through techiques for acoustic transparency to deliver augmented/mixed reality audio experiences over hearables

Auditory information is an essential sensory input required to construct an Augmented/Mixed Reality (AR/MR) experience. The main goal of AR/MR audio is to provide a feeling of presence to the user by the seamless fusion of the reproduced virtual audio with real-world sounds, which can be altered as...

Full description

Saved in:
Bibliographic Details
Main Author: Gupta, Rishabh
Other Authors: Gan Woon Seng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169123
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Auditory information is an essential sensory input required to construct an Augmented/Mixed Reality (AR/MR) experience. The main goal of AR/MR audio is to provide a feeling of presence to the user by the seamless fusion of the reproduced virtual audio with real-world sounds, which can be altered as desired. AR/MR audio can generate the binaural sound pressure required to provide the desired experience over headphones or loudspeakers. Smart headphones, called hearables, are wearable devices that provide critical advantages for AR/MR audio over loudspeakers, such as privacy and ease for user's movements across different physical locations. However, the occluding-type hearables, which have an earcup or outer shell as part of their physical construction, modify the real sound spectrum. A fundamental AR/MR audio experience, which has become crucial in today's hearables, is the unaltered listening of real sounds, referred to as acoustic transparency. Acoustic transparency is critical for several AR/MR audio applications such as navigating in the real world, real and virtual social interactions and engaging with multimedia content such as movies or games. The techniques to achieve acoustic transparency involving capture, processing, and playback of real sounds, are called Hear-Through (HT) techniques. This thesis focuses on HT techniques for achieving acoustic transparency of real sound for AR/MR audio experiences delivered over occluding-type hearables, based on a closed-back circumaural (over-ear) design with one external microphone placed on each side of the hearable's ear cup. Past studies on acoustic transparency have computed HT filters based on the averaging of measured responses used for computing HT filters across different source directions or averaging of HT filters derived from measured responses for a few sound source directions. HT filters derived using averaging methods (avgHT) can result in a mismatch with the target open-ear response, leading to localization errors, timbre differences, and a poor acoustic transparency experience. The first part of the thesis focuses on computing directional HT (dirHT) filters to closely match open-ear responses by incorporating spatial information about the sound source directions, along with individualized spectral cues used by humans to accurately localize sound source directions. Adaptive filtering based on Filtered-x Normalized Least Mean Square (FxNLMS) is used to compute the dirHT filters using the responses measured for each sound source direction. Results show that the computed dirHT filters show a closer match to the open-ear response compared to the avgHT filters. The proposed method of computing HT filters incorporating spatial information was found to be better than the overall averaging scheme, even when the directional resolution at which the HT filters were computed was reduced to 60°, referred to as groupedHT filters. Perceptual test results showed that the listeners found the estimated open-ear signals synthesized using dirHT filters closely matched to the target open-ear reference signals. In the first part of the thesis, we compute dirHT filters with the following assumptions: Each filter is computed using the measured responses for a single sound source, and thus, each computed dirHT filter is designed for acoustic transparency of only one sound source; The direction of the sound source is known a priori; All the required responses for deriving dirHT filters are measured accurately for each user, hearable fitting, and source position; The ANC mode of the prototype hearables can be used to cancel out the leaked real sound by more than 15-20 dB to avoid comb-filtering artifacts. In the following parts of the thesis, we investigate each of these assumptions and propose different techniques to improve HT performance using dirHT filters when the assumptions listed above are relaxed. AvgHT filters computed in past works can lead to poor acoustic transparency performance for multiple source scenarios. The dirHT filters are computed with the assumption that only a single sound source is present, and thus, dirHT filters cannot be utilized directly for multiple source scenarios. Furthermore, implementing dirHT filters in an HT system requires estimating the direction of the sound source. The second part of this thesis presents a multiple source dirHT framework based on a parametric approach to achieve acoustic transparency in a more complex acoustic environment consisting of multiple sound sources. We propose a parametric dirHT equalization approach in the time-frequency domain by estimating a sub-band Direction of Arrival (DoA) using Neural Networks (NN) and selecting the corresponding dirHT filters for each sound source direction from a precomputed database. Objective analysis using Spectral Difference (SD) is conducted to evaluate the performance of the proposed parametric approach, with the open-ear scenario used as a reference. Using dummy head measurements with band-limited pink noise and real source signals, it was found that the estimated open-ear signals, derived using parametric dirHT filtering techniques, were closely matched to the target open-ear signals in multiple source scenarios. While Individualized (Ind) dirHT filters are the most accurate, the most precise computation of these filters can only be done using all responses required for the computation of dirHT filters measured in-situ for each user since they depend on the individual's anthropometry, device characteristics, device coupling, and source position. Most past studies and commercial products use Non-Individualized (Non-Ind) HT filters since the in-situ measurement is tedious and infeasible in practice. It is hypothesized that major differences exist between Ind and Non-Ind dirHT filters due to individualized pinna cues. The third part of the thesis examines the perceptual differences between Ind and Non-Ind dirHT filters. We investigated the differences in the pinna-dependent Directional Transfer Function (DTF) used to compute the dirHT filters. The objective and subjective results showed that there are perceptually distinguishable differences in timbre and high errors in localization for Non-Ind dirHT filters compared to Ind dirHT filters. Comb-filtering artifacts are a major issue in the practical implementation of all HT systems, including dirHT systems, which occur when the passively attenuated real sound that leaks into the hearables acoustically adds to the delayed HT equalized signal played back through the hearable's transducers. Mitigating comb-filtering artifacts in HT systems usually involves reducing the processing delay or further attenuating the leaked real sound signal using techniques such as Active Noise Control (ANC). While digital or analog low-latency HT system implementations can reduce the processing delay, a digital implementation can be expensive to design, whereas an analog implementation is not programmable and, thus, cannot be easily modified to account for changes in the direction of sound sources. We have investigated the mitigation of comb-filtering artifacts for the implementation of dirHT filters in hearables. Previous studies have shown that directional ANC (dirANC) techniques can improve the noise cancellation performance for real sound sources located at different spatial locations. We present two different filter design methods to closely match the open-ear reference and cancel out the leaked real sound to reduce comb-filtering artifacts: separately trained dirHT and dirANC filters; a jointly trained single filter. The filters are computed using the FxNLMS algorithm. Detailed simulation experiments are carried out to compare the two proposed configurations: a parallel implementation of separately trained filters and a single filter implementation of a jointly trained filter in an HT system evaluation framework. It was found that the implementation of a jointly trained filter is efficient if the causality constraints for the cancellation of leaked real sound are satisfied. When causality is violated for dirANC, implementing the separately trained dirHT and dirANC control filters in parallel leads to a reduction in comb-filtering artifacts while allowing a larger latency margin for the HT processing. The additional latency margin can be utilized to incorporate additional features to improve HT performance, such as accurately estimating the direction of the incident sound source. To summarize, in this thesis, we propose a dirHT technique to achieve acoustic transparency in hearables targeted toward AR/MR audio applications.