Plugin-based reverb.
Tools like Steam Audio handle reverb and transmission, but every parameter — decay, absorption, room size — must be hand-tuned per scene.
From what you see to what you hear — synthesizing room acoustics on the fly from a single in-game screenshot. We pair a generative model with a real-time Unity audio plugin so the world sounds like the space the player is standing in.
Immersion in VR depends on more than visuals — but every existing path to good room acoustics forces a tradeoff. Either creators tune for hours, or the headset's GPU burns through frames doing physics, or the model's too heavy to run live.
Tools like Steam Audio handle reverb and transmission, but every parameter — decay, absorption, room size — must be hand-tuned per scene.
Geometric and ray-traced acoustics produce realistic results, but cost is high and scales poorly with scene complexity.
Models can predict realistic acoustics directly, but running them every frame requires a powerful GPU on the headset side.
An impulse response is a short audio recording of how a space responds to a single broadband stimulus — a clap, a balloon pop, a sine sweep. It encodes everything: the direct path, the early reflections, and the long reverberant tail.
Once you have it, applying it is cheap: a single STFT-domain convolution per audio source. The expensive part is getting the right IR for the room — and that's exactly where the ML model earns its keep.
Our system uses Image2Reverb to generate an IR for whatever the user is looking at, on demand. Heavy work happens once. Light work happens always.
y(t)=x(t)∗h(t)
In practice we move into the STFT domain — multiplication is much cheaper than full-length convolution, and the IR can be swapped between frames without audible clicks.
Screenshot → preprocessing → Image2Reverb on a Flask backend → IR → custom convolution plugin → spatialised audio in Unity.
Press 'A' on the controller. The current frame is captured directly from the headset's render target.
Resize, normalize, encode to an RGB tensor matching the model's expected input.
Python service on an RTX 4070 Ti Super accepts the frame and runs inference.
Cross-modal conditional GAN emits a log-mel spectrogram of the predicted IR.
STFT convolution applies the IR; a hot-swap mechanism switches without artifacts.
Directional sources stay spatialized; a global bus carries the room's reverberant signature.
Monodepth2 produces a per-pixel depth map. Stacked with RGB to form a 4 × 224 × 224 input.
ResNet50 pre-trained on Places365, fine-tuned end-to-end, emits a 365-d scene embedding.
Generator decodes (embedding ⊕ noise) into a 512 × 512 log-mel spectrogram.
LSGAN with ℓ1 + a differentiable T60 term (≈ JND for reverberation time).
Griffin-Lim reconstructs a 5.94 s, 22.05 kHz waveform we then feed straight into the plugin.
We split every sound the user hears into two simultaneous paths — one anchored to its source in space, one carrying the room's reverberant signature.
HRTF-style spatialisation — the listener hears where the sound is coming from. The source itself stays dry or minimally processed so left/right and near/far cues survive.
Footsteps, NPC speech, object interactions — anything that belongs to a point in space.
A shared "wet" bus — every source feeds the same room IR. The path carries the echo and the reverberant tail predicted by Image2Reverb, swapped instantly when the user enters a new acoustic zone.
The plugin holds the active IR in memory and convolves it with each source frame in the STFT domain — a single complex-multiply per bin, much cheaper than running an ML model every frame.
Zero-glitch hot-swap. When the backend returns a new IR, the buffer is replaced atomically on a frame boundary, so the user never hears a click during transitions. The right trigger toggles convolution on and off, which makes the difference audible on demand during the demo.
Walking through a Unity scene built with three acoustically distinct zones — the same sources sound dramatically different in each one as the IR re-renders on demand.
Large stone interior — long, bright reverberant tail, dense early reflections.
Semi-open outdoor — short tail with low-frequency emphasis, sparse reflections.
Damped, diffuse — minimal tail, dry directional sources stay crisp.
3 s is workable but not invisible. Targets: model distillation, a smaller GAN decoder, and server-side caching of IRs by scene embedding so revisits are instant.
A single rectilinear view biases the IR toward what's on-camera. Sampling a panoramic capture and aggregating predictions should give a much more faithful estimate — needs paired 360°/IR data and fine-tuning.
Right now the user presses 'A' to re-sample. We'd like periodic captures plus an image-embedding similarity check, so the IR re-renders only when the scene actually changes.
Four work-streams: ML backend, Unity plugin, audio system, demo scene. Each member led one area and contributed to the others.