We present DAD-3DHeads, a dense and diverse large-scale dataset, and a robust model for accurate 3D Dense Head Alignment in-the-wild.
DAD-3DHeads contains annotations of over 3.5k landmarks that accurately represent 3D head shape compared to the ground-truth scans. The data-driven model, DAD-3DNet, trained on our dataset, learns shape, expression, and pose parameters and performs 3D head reconstruction using a FLAME mesh. The model also incorporates a landmark prediction branch to take advantage of rich supervision and co-training of multiple related tasks.
Experimentally, we show that DAD-3DNet outperforms or is comparable to the state-of-the-art models in (i) 3D Head Pose Estimation on AFLW2000-3D and BIWI, (ii) 3D Face Shape Reconstruction on NoW and Feng et al., and (iii) 3D Dense Head Alignment and 3D Landmarks Estimation on DAD-3DHeads dataset.
DAD-3DHeads is well-balanced over a wide range of poses, facial expressions, and occlusions. It enables a benchmark to study in-the-wild generalisation and robustness to distribution shifts.
These are randomly selected (not cherry-picked) samples from our dataset. They are displayed in pairs (image, fitted mesh)
at multiple views.
DAD-3DHeads accuracy on selected samples from the NoW dataset.
First row: input image; second row: GT scan; third row: the result of our annotation; fourth row: alignment of the mesh (wireframe) and the GT scan (with color-coded errors overlaid).
The scale of the errors relates to the real-world size of the scans. Note that the resulting meshes accurately capture the coarse shape of the frontal part of the head, the regions of higher error heavily overlap with finer facial structures.
As DAD-3DHeads dataset is dense, it allows for training different models, localizing many more than the usual 68 landmarks. This flexibility saves the human annotator efforts, because the data should not be relabeled every time a different setup is needed. Moreoever, DAD-3DNet training pipeline allows for inference on any subset of head vertices, given its 3DMM prediction branch, as they can be subsampled after the entire mesh is predicted (see examples of different landmark subsets in the figure on the left).
Upper row: "face" subset, captures frontal part of the head without ears.
Lower row: "head" subset, captures the head without neck. The eyeballs are excluded in both.
@InProceedings{dad3dheads,
author = {Martyniuk, Tetiana and Kupyn, Orest and Kurlyak, Yana and Krashenyi, Igor and Matas, Ji\v{r}{\'\i} and Sharmanska, Viktoriia},
title = {DAD-3DHeads: A Large-Scale Dense, Accurate and Diverse Dataset for 3D Head Alignment From a Single Image},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {20942-20952}
}
For dataset inquiries - drop an e-mail to orest@pinatafarm.com
.