IR // VR · CMSC731

CMSC731 · Final Project · Spring 2026

Interactive
Image-Guided
Acoustic Authoring
in VR.

From what you see to what you hear — synthesizing room acoustics on the fly from a single in-game screenshot. We pair a generative model with a real-time Unity audio plugin so the world sounds like the space the player is standing in.

Open source Watch demo Unity project Audio plugin Slides

Team: Anushk · Dahong · Parth
Institution: University of Maryland
Stack: PyTorch · Flask · Unity · C++ DSP
Status: Working end-to-end prototype

01 · The problem

Realistic VR audio is the half nobody finishes.

Immersion in VR depends on more than visuals — but every existing path to good room acoustics forces a tradeoff. Either creators tune for hours, or the headset's GPU burns through frames doing physics, or the model's too heavy to run live.

Manual

Plugin-based reverb.

Tools like Steam Audio handle reverb and transmission, but every parameter — decay, absorption, room size — must be hand-tuned per scene.

Compute-heavy

Physics-based 3D audio.

Geometric and ray-traced acoustics produce realistic results, but cost is high and scales poorly with scene complexity.

Hardware-bound

End-to-end ML inference.

Models can predict realistic acoustics directly, but running them every frame requires a powerful GPU on the headset side.

02 · The approach

Predict the room, then apply it.

An impulse response is a short audio recording of how a space responds to a single broadband stimulus — a clap, a balloon pop, a sine sweep. It encodes everything: the direct path, the early reflections, and the long reverberant tail.

Once you have it, applying it is cheap: a single STFT-domain convolution per audio source. The expensive part is getting the right IR for the room — and that's exactly where the ML model earns its keep.

Our system uses Image2Reverb to generate an IR for whatever the user is looking at, on demand. Heavy work happens once. Light work happens always.

Time-domain convolution

y(t)=x(t)∗h(t)

x(t)dry source

h(t)impulse response

y(t)reverberant output

In practice we move into the STFT domain — multiplication is much cheaper than full-length convolution, and the IR can be swapped between frames without audible clicks.

03 · System

End-to-end pipeline.

Screenshot → preprocessing → Image2Reverb on a Flask backend → IR → custom convolution plugin → spatialised audio in Unity.

Step 01 →

VR Capture

Press 'A' on the controller. The current frame is captured directly from the headset's render target.

Step 02 →

Preprocess

Resize, normalize, encode to an RGB tensor matching the model's expected input.

Step 03 →

Flask Backend

Python service on an RTX 4070 Ti Super accepts the frame and runs inference.

Step 04 →

Image2Reverb

Cross-modal conditional GAN emits a log-mel spectrogram of the predicted IR.

Step 05 →

Custom Plugin

STFT convolution applies the IR; a hot-swap mechanism switches without artifacts.

Step 06 ●

Spatial Output

Directional sources stay spatialized; a global bus carries the room's reverberant signature.

~3^s

End-to-end IR generation

From screenshot to a ready-to-convolve impulse response.

Real-time

Convolution + IR hot-swap

Right-trigger toggles convolution on/off mid-scene.

2^streams

Directional + global audio

Object-local sources plus a shared room-reverb bus.

04 · Inside Image2Reverb

How a picture becomes a room.

Image2Reverb model architecture: input image goes through Monodepth2 to produce an RGB+depth tensor, then through an encoder/generator pair to produce an IR spectrogram judged by a discriminator.

Figure · Image2Reverb architecture from Singh et al. (2021). The RGB image is paired with a Monodepth2 depth map to form a 4-channel input; the GAN learns to produce a matching IR spectrogram.

Depth estimation

Monodepth2 produces a per-pixel depth map. Stacked with RGB to form a 4 × 224 × 224 input.
Scene encoder

ResNet50 pre-trained on Places365, fine-tuned end-to-end, emits a 365-d scene embedding.
Conditional GAN

Generator decodes (embedding ⊕ noise) into a 512 × 512 log-mel spectrogram.
Discriminator + T60 loss

LSGAN with ℓ1 + a differentiable T60 term (≈ JND for reverberation time).
Spectrogram → waveform

Griffin-Lim reconstructs a 5.94 s, 22.05 kHz waveform we then feed straight into the plugin.

05 · Audio routing

Two paths: directional + global.

We split every sound the user hears into two simultaneous paths — one anchored to its source in space, one carrying the room's reverberant signature.

Directional audio

Anchored to the source.

HRTF-style spatialisation — the listener hears where the sound is coming from. The source itself stays dry or minimally processed so left/right and near/far cues survive.

Footsteps, NPC speech, object interactions — anything that belongs to a point in space.

// Example · a clap or footstep keeps its directional cues.

Global audio

Carries the room.

A shared "wet" bus — every source feeds the same room IR. The path carries the echo and the reverberant tail predicted by Image2Reverb, swapped instantly when the user enters a new acoustic zone.

// Example · the same clap acquires the cathedral's reverberant tail.

Final mix = HRTF-spatialised dry source + global ( source ∗ IR_room )

06 · DSP plugin

The convolution engine.

Dry source

STFT

× H(f)

ISTFT

Wet

The plugin holds the active IR in memory and convolves it with each source frame in the STFT domain — a single complex-multiply per bin, much cheaper than running an ML model every frame.

Zero-glitch hot-swap. When the backend returns a new IR, the buffer is replaced atomically on a frame boundary, so the user never hears a click during transitions. The right trigger toggles convolution on and off, which makes the difference audible on demand during the demo.

Backend server

Framework: Flask · Python
GPU: RTX 4070 Ti Super
Latency: ~3 s / request
Trigger: 'A' button on controller
Returns: WAV impulse response

[ Backend server screenshot ]

07 · Demo

Three rooms, one model.

Walking through a Unity scene built with three acoustically distinct zones — the same sources sound dramatically different in each one as the IR re-renders on demand.

// Walkthrough · YouTube · 3 scenes, one model

Cathedral T60 ≈ 3–4 s

Large stone interior — long, bright reverberant tail, dense early reflections.

Graveyard T60 ≈ 0.5–1 s

Semi-open outdoor — short tail with low-frequency emphasis, sparse reflections.

Forest T60 ≈ 0.2 s

Damped, diffuse — minimal tail, dry directional sources stay crisp.

08 · Looking forward

Where this goes next.

Cut the latency.

3 s is workable but not invisible. Targets: model distillation, a smaller GAN decoder, and server-side caching of IRs by scene embedding so revisits are instant.

→ Performance

360° screenshots.

A single rectilinear view biases the IR toward what's on-camera. Sampling a panoramic capture and aggregating predictions should give a much more faithful estimate — needs paired 360°/IR data and fine-tuning.

→ Accuracy

Automatic scene detection.

Right now the user presses 'A' to re-sample. We'd like periodic captures plus an image-embedding similarity check, so the IR re-renders only when the scene actually changes.

→ Autonomy

09 · Team

Individual contributions.

Four work-streams: ML backend, Unity plugin, audio system, demo scene. Each member led one area and contributed to the others.

Anushk

Team member

ML Backend — Image2Reverb pipeline, Flask server
Unity Plugin — STFT convolution, IR hot-swap
Audio System — directional + global routing
Demo Scene — Unity authoring, controller input

Dahong

Team member

ML Backend — Image2Reverb pipeline, Flask server
Unity Plugin — STFT convolution, IR hot-swap
Audio System — directional + global routing
Demo Scene — Unity authoring, controller input

Parth

Team member

ML Backend — Image2Reverb pipeline, Flask server
Unity Plugin — STFT convolution, IR hot-swap
Audio System — directional + global routing
Demo Scene — Unity authoring, controller input

Interactive Image-Guided Acoustic Authoring in VR.

Realistic VR audio is the half nobody finishes.

Plugin-based reverb.

Physics-based 3D audio.

End-to-end ML inference.

Predict the room, then apply it.

End-to-end pipeline.

VR Capture

Preprocess

Flask Backend

Image2Reverb

Custom Plugin

Spatial Output

How a picture becomes a room.

Depth estimation

Scene encoder

Conditional GAN

Discriminator + T60 loss

Spectrogram → waveform

Two paths: directional + global.

Anchored to the source.

Carries the room.

The convolution engine.

Backend server

Three rooms, one model.

Where this goes next.

Cut the latency.

360° screenshots.

Automatic scene detection.

Individual contributions.

Interactive
Image-Guided
Acoustic Authoring
in VR.