Invented by Adrian Murtaza, Harald Fuchs, Bernd CZELHAN, Jan PLOGSTIES, Matteo AGNELLI, Ingo Hofmann, Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV

The market for optimizing audio delivery for virtual reality (VR) applications is rapidly expanding as the demand for immersive experiences continues to grow. As VR technology becomes more accessible and affordable, developers are focusing on enhancing the audio aspect of these virtual environments to provide users with a truly immersive and realistic experience. Audio plays a crucial role in creating a sense of presence and immersion in VR applications. It helps to transport users into a virtual world by providing spatial cues and realistic sound effects that mimic real-life scenarios. Whether it’s the sound of footsteps approaching from behind or the distant echo of a waterfall, high-quality audio can greatly enhance the overall VR experience. One of the key challenges in optimizing audio delivery for VR applications is achieving accurate spatial audio. Spatial audio refers to the ability to perceive sound coming from different directions and distances, just like in the real world. This is achieved through the use of advanced audio algorithms and techniques that simulate the way sound waves interact with the environment and the listener’s ears. To achieve accurate spatial audio, developers often rely on binaural audio techniques. Binaural audio involves capturing sound using specialized microphones that mimic the human ear’s shape and placement. These recordings are then processed and played back through headphones, creating a 3D audio experience that accurately represents the direction and distance of sound sources. Another important aspect of optimizing audio delivery for VR applications is ensuring low latency. Latency refers to the delay between the user’s actions and the corresponding audio response. In VR, even the slightest delay can break the sense of immersion and cause discomfort or motion sickness. Therefore, developers need to minimize latency to ensure a seamless and realistic audio experience. To address these challenges, companies specializing in audio technology are developing innovative solutions specifically tailored for VR applications. These solutions include advanced audio engines, software development kits (SDKs), and plugins that enable developers to create immersive audio experiences with ease. These tools often provide features such as real-time audio rendering, spatial audio simulation, and low-latency processing. The market for optimizing audio delivery for VR applications is not limited to gaming. Various industries, including education, healthcare, and entertainment, are exploring the potential of VR to enhance their products and services. For example, VR simulations can be used in medical training to provide realistic scenarios for students to practice their skills. In such applications, accurate and immersive audio is crucial for creating a lifelike environment and enhancing the learning experience. As the demand for VR applications continues to rise, the market for optimizing audio delivery is expected to grow significantly. According to a report by MarketsandMarkets, the global market for VR in gaming and entertainment alone is projected to reach $45.09 billion by 2027. This growth is likely to drive the demand for advanced audio technologies that can deliver high-quality and immersive audio experiences. In conclusion, the market for optimizing audio delivery for VR applications is expanding rapidly as developers strive to create immersive and realistic virtual environments. Achieving accurate spatial audio and low latency are key challenges that companies specializing in audio technology are addressing through innovative solutions. As VR becomes more prevalent across various industries, the demand for high-quality audio experiences is expected to grow, driving the market for audio optimization in VR applications.

The Fraunhofer Gesellschaft zur Forderung der Angewandten Forschung eV invention works as follows

In one example, a system is equipped with at least one video decoder that can decode video signals to display VR, AR or 360-degree environments to the user. At least one audio encoder is included in the system to decode audio streams. The system is configured for requesting at least one audiostream and/or audio element of an Audio stream, and/or adaptation set from a server based on at least the current viewport of the user and/or his head orientation, and/or movement and/or interaction metadata or virtual positional data.

Background for Optimizing Audio Delivery for Virtual Reality Applications

In a Virtual Reality environment (VR), or similar in an Augmented Reality or Mixed Reality (MR), or 360-degree Video environments, the user can usually visualise 360-degree content, using for example a Head Mounted Display, and listen to it on headphones (or loudspeakers with correct rendering depending on its position).

In a simple case, content is created in a manner that reproduces only one audio/video (360-degree video, for example) at a given moment. The audio/video has a fixed position (e.g. a sphere, with the user in the middle), and the user cannot move within the scene. He can only rotate his head to different directions (yaw pitch roll). The user’s head orientation determines the video and audio that is displayed (different viewports).

While video is delivered in 360-degrees, it also includes metadata to describe the rendering process, such as stitching information and projection mapping. Audio content is not selected according to the user’s current viewport but rather the entire scene. The audio content is adapted based on metadata to the user’s current viewport. For example, an audio object may be rendered differently depending on the information about the user orientation or viewport. It is important to note that 360-degree media refers to content that can be viewed from more than one angle simultaneously (e.g., by the user’s head or using a remote control device).

In a more complicated scenario, the user can move around in the VR scene or “jump” The audio content may also change from one scene into the next (e.g. audio sources that are not audible on one scene could become audible on the next scene). ?a door is opened?). With the existing systems, audio scenes can be encoded in one stream, and if necessary, in additional streams (depending on the main stream). These systems are called Next Generation Audio Systems (e.g. MPEG-H3D Audio). Some examples of these use cases include:

For the purposes of describing this situation, the notion of Discrete Viewpoints is introduced. This is a discrete location (or VR environment) for which audio/video content can be accessed.

The ?straight-forward? “The’straight-forward’ solution is to use a real time encoder that changes the encoding (numbers of audio elements and spatial information) based on feedback from the playback device about user position/orientation. based on feedback from the playback device about user position/orientation. This would require, for example, in a streaming-environment, a complex communication between client and server.

The complexity of such a system is beyond the capabilities and features available in equipment and systems today, or in those that will be developed within the next decade.

The content that represents the entire VR environment (?the whole world?) The content representing the complete VR environment (?the entire world?) could be delivered continuously. The problem would be solved, but the bandwidth required could be so high that it is not possible to use the existing communications links.

This is a complex use case for a real time environment. Alternative solutions that are low-complexity enable this functionality.

2. “Terminology and Definitions”.

The following terms are used in technical fields:

In this context, the notion of Adaptation Sets is used more generically and sometimes refers to the Representations. The media streams (audio/video) are also encapsulated into Media segments, which are the actual files that are played by the client. Media segments can be in a variety of formats, including ISO Base Media File Format, which is similar to MPEG-4 container format or MPEG-2 transport stream (TS). The encapsulation into Media Segments and in different Representations/Adaptation Sets is independent of the methods described in here, the methods apply to all various options.

The methods described in this document are based on a DASH server-client communication. However, they can be used with any other delivery environment, including MMT, MPEG-2, DASH-ROUTE and File Format.

In general, an adaptation set can be referred to as a layer higher than a stream, and it may contain metadata (e.g. associated with positions). A stream can contain a number of audio elements. A scene audio may be associated with a number of streams as part of multiple adaptation sets.

3. Current Solutions

Current solutions include:

The current solutions only allow users to rotate in VR but not move.

According to one embodiment, the system for a 360-degree video, virtual reality, mixed reality or augmented reality environment may include:

According to a second embodiment, the system may include:

One embodiment” may include a server that delivers audio and video streams for a virtual environment (VR), augmented environment (AR), mixed reality (MR), or 360-degree video, and the video and audio stream to be reproduced on a media consumption device.

Another embodiment can include: a server that delivers audio and video streams for a virtual environment such as VR, AR, MR or mixed reality. The video and audio streams are then reproduced on a media consumption device.

Click here to view the patent on Google Patents.