3D Gaussian Splatting (3DGS) has emerged as a powerful representation due to its efficiency and high-fidelity rendering. However, 3DGS training requires a known camera pose for each input view, typically obtained by Structure-from-Motion (SfM) pipelines. Pioneering works have attempted to relax this restriction but still face difficulties when handling long sequences with complex camera trajectories. In this work, we propose Rob-GS, a robust framework to progressively estimate camera poses and optimize 3DGS for arbitrarily long video sequences. Leveraging the inherent continuity of videos, we design an adjacent pose tracking method to ensure stable pose estimation between consecutive frames. To handle arbitrarily long inputs, we adopt a "divide and conquer" scheme that adaptively splits the video sequence into several segments and optimizes them separately. Extensive experiments on the Tanks and Temples dataset and our collected real-world dataset show that our Rob-GS outperforms the state-of-the-arts.
3D高斯点云(3DGS)因其高效性和高保真度渲染而成为一种强大的表示方法。然而,3DGS的训练需要每个输入视图的已知相机姿态,这通常通过结构光束法(SfM)管道获得。先驱性的研究尝试放宽这一限制,但在处理具有复杂相机轨迹的长序列时仍然面临困难。在本研究中,我们提出了 Rob-GS,一个稳健的框架,用于逐步估计相机姿态并优化3DGS,以处理任意长度的视频序列。利用视频的固有连续性,我们设计了一种相邻姿态跟踪方法,确保连续帧之间姿态估计的稳定性。为了处理任意长度的输入,我们采用了一种“分而治之”的策略,动态地将视频序列划分为多个段,并分别对其进行优化。在Tanks and Temples数据集和我们收集的真实世界数据集上的大量实验表明,Rob-GS 超越了现有的最先进方法。