CMSC731 · Final Project · Spring 2026

Interactive
Image-Guided
Acoustic Authoring
in VR.

From what you see to what you hear — synthesizing room acoustics on the fly from a single in-game screenshot. We pair a generative model with a real-time Unity audio plugin so the world sounds like the space the player is standing in.

Team
Anushk · Dahong · Parth
Institution
University of Maryland
Stack
PyTorch · Flask · Unity · C++ DSP
Status
Working end-to-end prototype
01 · The problem

Realistic VR audio is the half nobody finishes.

Immersion in VR depends on more than visuals — but every existing path to good room acoustics forces a tradeoff. Either creators tune for hours, or the headset's GPU burns through frames doing physics, or the model's too heavy to run live.

Manual

Plugin-based reverb.

Tools like Steam Audio handle reverb and transmission, but every parameter — decay, absorption, room size — must be hand-tuned per scene.

Compute-heavy

Physics-based 3D audio.

Geometric and ray-traced acoustics produce realistic results, but cost is high and scales poorly with scene complexity.

Hardware-bound

End-to-end ML inference.

Models can predict realistic acoustics directly, but running them every frame requires a powerful GPU on the headset side.

02 · The approach

Predict the room, then apply it.

An impulse response is a short audio recording of how a space responds to a single broadband stimulus — a clap, a balloon pop, a sine sweep. It encodes everything: the direct path, the early reflections, and the long reverberant tail.

Once you have it, applying it is cheap: a single STFT-domain convolution per audio source. The expensive part is getting the right IR for the room — and that's exactly where the ML model earns its keep.

Our system uses Image2Reverb to generate an IR for whatever the user is looking at, on demand. Heavy work happens once. Light work happens always.

Time-domain convolution

y(t)=x(t)h(t)

x(t)dry source
h(t)impulse response
y(t)reverberant output

In practice we move into the STFT domain — multiplication is much cheaper than full-length convolution, and the IR can be swapped between frames without audible clicks.

03 · System

End-to-end pipeline.

Screenshot → preprocessing → Image2Reverb on a Flask backend → IR → custom convolution plugin → spatialised audio in Unity.

Step 01

VR Capture

Press 'A' on the controller. The current frame is captured directly from the headset's render target.

Step 02

Preprocess

Resize, normalize, encode to an RGB tensor matching the model's expected input.

Step 03

Flask Backend

Python service on an RTX 4070 Ti Super accepts the frame and runs inference.

Step 04

Image2Reverb

Cross-modal conditional GAN emits a log-mel spectrogram of the predicted IR.

Step 05

Custom Plugin

STFT convolution applies the IR; a hot-swap mechanism switches without artifacts.

Step 06

Spatial Output

Directional sources stay spatialized; a global bus carries the room's reverberant signature.

~3s
End-to-end IR generation
From screenshot to a ready-to-convolve impulse response.
Real-time
Convolution + IR hot-swap
Right-trigger toggles convolution on/off mid-scene.
2streams
Directional + global audio
Object-local sources plus a shared room-reverb bus.
04 · Inside Image2Reverb

How a picture becomes a room.

Image2Reverb model architecture: input image goes through Monodepth2 to produce an RGB+depth tensor, then through an encoder/generator pair to produce an IR spectrogram judged by a discriminator.
Figure · Image2Reverb architecture from Singh et al. (2021). The RGB image is paired with a Monodepth2 depth map to form a 4-channel input; the GAN learns to produce a matching IR spectrogram.
  1. Depth estimation

    Monodepth2 produces a per-pixel depth map. Stacked with RGB to form a 4 × 224 × 224 input.

  2. Scene encoder

    ResNet50 pre-trained on Places365, fine-tuned end-to-end, emits a 365-d scene embedding.

  3. Conditional GAN

    Generator decodes (embedding ⊕ noise) into a 512 × 512 log-mel spectrogram.

  4. Discriminator + T60 loss

    LSGAN with ℓ1 + a differentiable T60 term (≈ JND for reverberation time).

  5. Spectrogram → waveform

    Griffin-Lim reconstructs a 5.94 s, 22.05 kHz waveform we then feed straight into the plugin.

05 · Audio routing

Two paths: directional + global.

We split every sound the user hears into two simultaneous paths — one anchored to its source in space, one carrying the room's reverberant signature.

Directional audio

Anchored to the source.

HRTF-style spatialisation — the listener hears where the sound is coming from. The source itself stays dry or minimally processed so left/right and near/far cues survive.

Footsteps, NPC speech, object interactions — anything that belongs to a point in space.

// Example · a clap or footstep keeps its directional cues.
Global audio

Carries the room.

A shared "wet" bus — every source feeds the same room IR. The path carries the echo and the reverberant tail predicted by Image2Reverb, swapped instantly when the user enters a new acoustic zone.

// Example · the same clap acquires the cathedral's reverberant tail.
Final mix = HRTF-spatialised dry source + global ( source IRroom )
06 · DSP plugin

The convolution engine.

Dry source
STFT
× H(f)
ISTFT
Wet

The plugin holds the active IR in memory and convolves it with each source frame in the STFT domain — a single complex-multiply per bin, much cheaper than running an ML model every frame.

Zero-glitch hot-swap. When the backend returns a new IR, the buffer is replaced atomically on a frame boundary, so the user never hears a click during transitions. The right trigger toggles convolution on and off, which makes the difference audible on demand during the demo.

Backend server

Framework
Flask · Python
GPU
RTX 4070 Ti Super
Latency
~3 s / request
Trigger
'A' button on controller
Returns
WAV impulse response
[ Backend server screenshot ]
07 · Demo

Three rooms, one model.

Walking through a Unity scene built with three acoustically distinct zones — the same sources sound dramatically different in each one as the IR re-renders on demand.

// Walkthrough · YouTube · 3 scenes, one model
Cathedral T60 ≈ 3–4 s

Large stone interior — long, bright reverberant tail, dense early reflections.

Graveyard T60 ≈ 0.5–1 s

Semi-open outdoor — short tail with low-frequency emphasis, sparse reflections.

Forest T60 ≈ 0.2 s

Damped, diffuse — minimal tail, dry directional sources stay crisp.

08 · Looking forward

Where this goes next.

01

Cut the latency.

3 s is workable but not invisible. Targets: model distillation, a smaller GAN decoder, and server-side caching of IRs by scene embedding so revisits are instant.

→ Performance
02

360° screenshots.

A single rectilinear view biases the IR toward what's on-camera. Sampling a panoramic capture and aggregating predictions should give a much more faithful estimate — needs paired 360°/IR data and fine-tuning.

→ Accuracy
03

Automatic scene detection.

Right now the user presses 'A' to re-sample. We'd like periodic captures plus an image-embedding similarity check, so the IR re-renders only when the scene actually changes.

→ Autonomy
09 · Team

Individual contributions.

Four work-streams: ML backend, Unity plugin, audio system, demo scene. Each member led one area and contributed to the others.

Anushk
Team member
  • ML Backend — Image2Reverb pipeline, Flask server
  • Unity Plugin — STFT convolution, IR hot-swap
  • Audio System — directional + global routing
  • Demo Scene — Unity authoring, controller input
Dahong
Team member
  • ML Backend — Image2Reverb pipeline, Flask server
  • Unity Plugin — STFT convolution, IR hot-swap
  • Audio System — directional + global routing
  • Demo Scene — Unity authoring, controller input
Parth
Team member
  • ML Backend — Image2Reverb pipeline, Flask server
  • Unity Plugin — STFT convolution, IR hot-swap
  • Audio System — directional + global routing
  • Demo Scene — Unity authoring, controller input