Estimating the pose of a person from a single monocular frame is a challenging task due to many confounding factors such as perspective projection, the variability of lighting and clothing, self-occlusion, occlusion by objects, and the simultaneous presence of multiple interacting people. Nevertheless, the performance of human pose estimation algorithms has recently improved dramatically, thanks to the development of suitable deep architectures and the availability of well-annotated image datasets, such as MPII Human Pose and COCO.
There is broad consensus that performance is saturated on simpler single-person datasets (LSP, CO_LSP), and researchers' focus is shifting towards less constrained and more challenging datasets, where images may contain multiple instances of people, and a variable number of body parts (or keypoints) are visible.
However, evaluation is challenging: more complex datasets make it harder to benchmark algorithms due to the many sources of error that may affect performance, and existing metrics, such as Average Precision (AP) or mean Percentage of Correct Parts (mPCP), hide the underlying causes of error and are not sufficient for truly understanding the behaviour of algorithms.
We study the errors occurring in multi-instance pose estimation, and how they're affected by physical characteristics of the portrayed people. We build upon currently adopted evaluation metrics and provide the tools for a fine-grained description of performance, which allows to quantify the impact of different types of error at a single glance. The fine-grained Precision-Recall curves are obtained by fixing an OKS threshold and evaluating the performance of an algorithm after progressively correcting its detections.
Our goal is to propose a principled method for analyzing pose algorithms' performance. Specifically our main contributions are:
A taxonomization of the types of error that are typical of the multi-instance pose estimation framework into: Miss, Swap, Inversion, Jitter, Scoring, Background False Positive and False Negative
The analysis of the state of the art methods for multi-instance pose estimation showing that, despite design differences, they display similar error patterns and tend to fail when high occlusion and crowding are present.
The finding that about 25% of the predicted keypoints have localization errors, and that miss errors are the most impactful on performance.
The finding that instance scoring is key to performance, and the suggestion that it may be improved by learning to regress the matching score.
The analysis of the most widely adopted keypoint dataset (COCO) showing that occlusion and crowding impact performance more than the size of instances, and that these two hard cases are not sufficiently represented.
Our analysis extends beyond humans, to any object category where the location of parts is estimated along with detections, and to situations where cluttered scenes may contain multiple object instances. This is common in fine-grained categorization, or animal behavior analysis, where part alignment is often crucial.
We provide the id of all the COCO training set ground-truth instances contained in the benchmarks we defined in our analysis to understand how the physical characteristics of portrayed people, such as the number of visible keypoints (occlusion), the amount of overlap between instances (crowding) and size, affect errors and performance of algorithms.
We will keep an up-to-date list with the results of our evaluation analysis from the most recent multi-instance pose estimation algorithms. Use the contact form if you would like your method's analysis published.
We provide a python implementation of our evaluation methodology. The GitHub repository contains i) the COCOanalyze class, a wrapper of the COCOeval class for extended keypoint error estimation analysis, and ii) an API for creating an automatic performance report.
If you find our paper or the released data or code useful to your work, please cite:
@inproceedings{ DBLP:conf/iccv/RonchiP17, author = {Matteo Ruggero Ronchi and Pietro Perona}, title = {Benchmarking and Error Diagnosis in Multi-instance Pose Estimation}, booktitle = {IEEE International Conference on Computer Vision, {ICCV} 2017, Venice, Italy, October 22-29, 2017}, pages = {369--378}, year = {2017}, crossref = {DBLP:conf/iccv/2017}, url = {https://doi.org/10.1109/ICCV.2017.48}, doi = {10.1109/ICCV.2017.48}, timestamp = {Thu, 11 Jan 2018 13:21:37 +0100}, biburl = {https://dblp.org/rec/bib/conf/iccv/RonchiP17}, bibsource = {dblp computer science bibliography, https://dblp.org}}