Gallant
Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains
Qingwei Ben*, Botian Xu*, Kailin Li*, Feiyu Jia, Wentao Zhang, Jingping Wang,
Jingbo Wang, Dahua Lin, Jiangmiao Pang.   (*: equal contribution) Shanghai Artificial Intelligence Laboratory The Chinese University of Hong Kong University of Science and Technology of China University of Tokyo Shanghai Jiao Tong University

Gallant a voxel-grid-based framework for humanoid locomotion and local-navigation in 3D constrained terrains.

Abstract.
Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure. This paper presents Gallant, a voxel-grid-based framework for humanoid locomotion and local navigation in 3D constrained terrains. It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency. Experimental results show that Gallant's broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near-100% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization.
Real World Demo.
Approach.
Training in Sim

Figure 1: Method Overview. (a) Curriculum-based training over 8 representative terrains enhances generalization. (b) Realistic voxel path alignment achieved via efficient LiDAR simulation with domain-randomized latency and noise. (c) A 2D CNN-based perceptual module processes voxel grid using the z-dimension as input channels, balancing efficiency and representation capability. (d) A latent-aware PPO policy enables zero-shot sim-to-real transfer across diverse obstacles, including ground, lateral, and overhead challenges.

Results.
Simulation Results

Figure 2: Ablation experiment results of Gallant, \(E_{succ}\) is the terrain traversal success rate, \(E_{collision}\) describes the average impulse of collisions during traversal. Experiments show that Gallant's design yields higher success rate and fewer collisions.

Real-world Results

Figure 3: Comparison of success rates between Gallant and baselines in real-world deployment demonstrates that Gallant's design, utilizing voxel grids and domain randomization for LiDAR, is essential.

Acknowledgements.

We thank Huayi Wang and Moji Shi for their guidance with the deployment of elevation map. We are grateful to Junli Ren, Tao Huang, Zirui Wang, Weishuai Zeng, Weixiang Zhong, Xiaojie Niu, Shunlin Lu for helpful advice and discussions during the paper. We thank Shi Zhang, Shenghan Zhang, Quanli Xuan, Haihua Zhu, Wei Shi for their help in transforming robot's structure and building real-world terrains.

BibTeX
@article{ben2025gallant,
          title     = {Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains},
          author    = {Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, Jiangmiao Pang},
          journal   = {arXiv preprint arXiv:2511.14625},
          year      = {2025}
        }