WaveFormer:Monocular Sea Surface Reconstruction

Huangjinchen Zheng1,*, Qiancheng Zhou1,*, Zhenglin Li1, Wenbo Xie1, Rui Song1, Xiaodong Yue1, Nenglun Chen2, Yan Peng1, Wenhua Zhang1,†
1Shanghai University, 2Nanjing University of Information Science and Technology

WaveFormer takes a single RGB video (Left) to simultaneously reconstruct dense 3D wave geometry (Center) and estimate a semantic segmentation mask (Right). portraits.

Abstract

We present WaveFormer, a novel framework capable of reconstructing dense, temporally consistent 3D sea surfaces from standard video captured by monocular cameras. Unlike rigid terrain, water surfaces are optically complex and lack static features; our method solves this persistent photogrammetric challenge without requiring cumbersome stereo rigs.

Our approach leverages a Spatio-Temporal Transformer to decouple and learn the predictable hydrodynamic laws governing wave motion. We observe that deep regression models inherently suffer from a smoothing bias, and propose a Wave Texture Refinement Module to dynamically recover fine-grained ripples and high-frequency geometry. To bridge the domain gap between simulation and reality, we employ a self-training strategy where a simulation-trained teacher generates pseudo-labels on real footage, ensuring the model learns realistic photometric features while retaining physical consistency.

We show that WaveFormer can turn standard ocean footage into pixel-aligned absolute wave height maps suitable for marine robotics. We evaluate our method against state-of-the-art estimators and stereo-vision benchmarks. We show that our method significantly outperforms existing solutions in reconstruction accuracy and temporal coherence, offering a scalable, hardware-light alternative for dynamic surface photogrammetry.

Zero-shot Generalization to Real-world Scenes

We demonstrate the zero-shot generalization capability of WaveFormer. Applied directly to unseen real-world footage without any fine-tuning, our method achieves high-fidelity depth reconstruction (Left) and robust semantic segmentation (Right).

Depth Reconstruction

Recovering wave geometry in a zero-shot manner.

Sea-Sky Segmentation

Robustly segmenting complex real-world scenes.

Comparison on WASS Dataset

We compare our method against WASS (a state-of-the-art stereo-based pipeline). Crucially, WASS requires a stereo rig with wide baselines, whereas WaveFormer uses only a single monocular camera. As shown below, our method recovers comparable geometric details and wave structures without the need for complex hardware calibration.

Comparisons on the WASS benchmark.(Left: Ours [Monocular], Right: WASS [Stereo])