We present WaveFormer, a novel framework capable of reconstructing dense, temporally consistent 3D sea surfaces from standard video captured by monocular cameras. Unlike rigid terrain, water surfaces are optically complex and lack static features; our method solves this persistent photogrammetric challenge without requiring cumbersome stereo rigs.
Our approach leverages a Spatio-Temporal Transformer to decouple and learn the predictable hydrodynamic laws governing wave motion. We observe that deep regression models inherently suffer from a smoothing bias, and propose a Wave Texture Refinement Module to dynamically recover fine-grained ripples and high-frequency geometry. To bridge the domain gap between simulation and reality, we employ a self-training strategy where a simulation-trained teacher generates pseudo-labels on real footage, ensuring the model learns realistic photometric features while retaining physical consistency.
We show that WaveFormer can turn standard ocean footage into pixel-aligned absolute wave height maps suitable for marine robotics. We evaluate our method against state-of-the-art estimators and stereo-vision benchmarks. We show that our method significantly outperforms existing solutions in reconstruction accuracy and temporal coherence, offering a scalable, hardware-light alternative for dynamic surface photogrammetry.
We demonstrate the zero-shot generalization capability of WaveFormer. Applied directly to unseen real-world footage without any fine-tuning, our method achieves high-fidelity depth reconstruction (Left) and robust semantic segmentation (Right).
Recovering wave geometry in a zero-shot manner.
Robustly segmenting complex real-world scenes.
We compare our method against WASS (a state-of-the-art stereo-based pipeline). Crucially, WASS requires a stereo rig with wide baselines, whereas WaveFormer uses only a single monocular camera. As shown below, our method recovers comparable geometric details and wave structures without the need for complex hardware calibration.
Comparisons on the WASS benchmark.(Left: Ours [Monocular], Right: WASS [Stereo])