PointHPS: Cascaded 3D Human Pose and
Shape Estimation from Point Clouds

1S-Lab, Nanyang Technological University, 2SenseTime Research
*equal contributions, corresponding author


Human pose and shape estimation (HPS) has attracted increasing attention in recent years. While most existing studies focus on HPS from 2D images or videos with inherent depth ambiguity, there are surging need to investigate HPS from 3D point clouds as depth sensors have been frequently employed in commercial devices. However, real-world sensory 3D points are usually noisy and incomplete, and also human bodies could have different poses of high diversity. To tackle these challenges, we propose a principled framework, PointHPS, for accurate 3D HPS from point clouds captured in real-world settings, which iteratively refines point features through a cascaded architecture. Specifically, each stage of PointHPS performs a series of downsampling and upsampling operations to extract and collate both local and global cues, which are further enhanced by two novel modules: 1) Cross-stage Feature Fusion (CFF) for multi-scale feature propagation that allows information to flow effectively through the stages, and 2) Intermediate Feature Enhancement (IFE) for body-aware feature aggregation that improves feature quality after each stage. Notably, previous benchmarks for HPS from point clouds consist of synthetic data with over-simplified settings (e.g., SURREAL) or real data with limited diversity (e.g., MHAD). To facilitate a comprehensive study under various scenarios, we conduct our experiments on two large-scale benchmarks, comprising i) a dataset that features diverse subjects and actions captured by real commercial sensors in a laboratory environment, and ii) controlled synthetic data generated with realistic considerations such as clothed humans in crowded outdoor scenes. Extensive experiments demonstrate that PointHPS, with its powerful point feature extraction and processing scheme, outperforms State-of-the-Art methods by significant margins across the board. Ablation studies validate the effectiveness of the cascaded architecture, powered by CFF and IFE. The pretrained models, code, and data will be publicly available to facilitate future investigation in HPS from point clouds.

Qualitative Results


        title   =   {PointHPS: Cascaded 3D Human Pose and Shape Estimation from Point Clouds},
        author  =   {Cai, Zhongang and Pan, Liang, and Wei, Chen and Yin, Wanqi, and Hong, Fangzhou and Zhang, Mingyuan and Loy, Chen Change, and Yang, Lei, and Liu, Ziwei},
        year    =   {2023},
        journal =   {arXiv preprint arXiv:2308.14492}


This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). The project is also supported by NTU NAP and Singapore MOE AcRF Tier 2 (MOET2EP20221-0012).

We referred to the project page of ProPainter when creating this project page.

More Fantastic Works on 3D Virtual Humans 🔥

Motion Generation

(Coming Soon) FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory

3D Human Generation

EVA3D: Compositional 3D Human Generation from 2D Image Collections

AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars


SynBody: Synthetic Dataset with Layered Human Models for 3D Human Perception and Modeling

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling

GTA-Human: Playing for 3D Human Recovery

Human Segmentation

Human3D: 3D Segmentation of Humans in Point Clouds with Synthetic Data