Reconstructing 3D animal poses from single images

Animal behaviours are vastly diverse, from dexterous leg movements of a fruit fly foraging for food or the acrobatic swing of a macaque between trees. To understand the neural basis of these behaviours, neuroscientists have long been pushing the frontiers to find ways to describe of these motions with increasing fidelity. Early techniques have relied on techniques such as cyclograms, taken using long exposure photographs with stroboscopic lighting. However, technology has progressed immensely and modern neuroscience is increasingly relying on 3-dimensional pose tracking, i.e., by following the coordinates of a set of relevant body parts over time. 3D poses provide a complete description of a movement; using 3D poses the angle of any joint can be computed and any other description can be unambiguously derived. In addition, 3D poses are becoming increasingly relevant because they bridge the gap between the converging advances in biomechanics and robotics.

3D pose tracking has high hardware requirements

Despite the advantages of 3D poses, obtaining them has required considerably effort and specialised hardware setups with multiple synchronised cameras keeping the moving animal in focus. This is because previous techniques of obtaining 3D poses relied on triangulation, which is a method of inferring the 3D coordinate of a point based on its 2D location from multiple camera angles. A line drawn through a point in 3D space and the focal point of the camera has a unique projection on the camera plane (sensor). Further, two (non-coincidental) lines can only have one unique intersection. It follows that triangulating a point requires having at least two cameras and performing camera calibration, in order to know the relative orientation of the cameras. To ensure that all points of interest in a moving animal are in focus of at least two, but preferably more cameras for increased accuracy, state-of-the-art fruit fly and rodent studies studies have used six cameras, while a recent study in macaques used as many as 62 cameras!

3D pose estimation has required many synchronised cameras in the past. Examples shows recording of tethered fruit flies using six cameras

Such hardware requirement pose substantial challenges particularly when studying animal behaviour in naturalistic environments, where the animal can be an any position relative to the cameras. As a result, some joints may be self-occluded (not visible from a given camera), which may mean that there are not enough camera views to perform triangulation. Therefore, many previous studies have used 2D poses or other metrics such as gait diagrams. However, these descriptors do not allow the unambiguous inference of 3D poses, they often lead to uncertainties in kinematic or behavioural analyses.

Lifting 2D poses to 3D poses

Our approach, LiftPose3D, relies on a different technique called lifting, which surmounts the challenges associated with 3D pose estimation by reconstructing 3D poses directly from 2D poses. At first sight this may sound like a contradiction: if a 2D pose can correspond to multiple 3D poses, how can the 3D pose be reconstructed from a given 2D pose? This contradiction is resolved by realising that the pose repertoire of animals covers only a fraction of all possible poses. Indeed, owing to limits in the range of motion of joints and fact that animals seek to find efficient ways to coordinate their joints, the position of any limb is in a strict geometric relationship with other limbs. LiftPose3D uses a deep neural net to learn these spatial geometric relationships between key points using a 3D pose library of typically used poses. Effectively, the network acts as a regressor, which for a given 2D pose input infers the most likely 3D pose in the pose library. The network is trained via an extensive array of 2D-3D pose pairs that cover the typically used pose repertoire of the animal during its movement.

The deep neural architecture used by LiftPose3D. For details, click here.

Although the neural network that we use is relatively shallow (only contains 6 layers in total) it still requires a substantial amount of data be trained, that is often not available by experimentalists. LiftPose3D’s unique advantages lie in the data augmentation techniques it uses, which can overcome the challenges in training data often faced in laboratory studies associated with a wide range of animals, camera angles, experimental systems, and behaviours.

Our study demonstrated in fruit flies and macaques that a library of 3D poses can be used to train our network to lift 2D poses from one camera, while not having to know the camera positioning. Consequently, no camera calibration is required and pre-trained LiftPose3D networks can be used across laboratories and datasets. Strikingly, the accuracy of 3D poses is almost as good as triangulation.

Obtaining 3D pose in tethered Drosophila (left) Tethering makes it easy to keep the animal in focus by multiple synchronised cameras. (right) Since multiple viewpoints are available for each joint, their positions can be triangulated.

Occlusions

As mentioned earlier, occlusions can be very common in freely moving animals because some body parts may be hidden from some cameras due obstacles or simply the position of the animal. These conditions often make triangulation impossible for occluded joints, hence only incomplete 3D poses can be obtained. A challenge here is that with such incomplete 3D poses it is not clear how to train a LiftPose3D network which can infer complete 3D poses from 2D poses.

We found that by aligning the animals in the same coordinate frame and replacing the unknown coordinates by zeros the network could be efficiently trained to predict full 3D poses. In fact, alignment removed the uncertainty regarding the animals position in 3D space, thus the network only needs to learn the geometric relationships between keypoints in animal poses. Further, replacing the unseen coordinates by zeros means that occluded points do not bias the network training. We also showed in fruit flies, mice and rats that our network can overcome occlusions and outliers in training data.

Training the network with incomplete information. (left) The moving fruit fly is viewed from two angles, a bottom (ventral) view and a side (lateral) view. Depending on the orientation of the animal, some joints are not visible from the side view. In the video, line segments are drawn whenever the 2D pose detector could detect the leg segment. (right) Using the ventral view, where all joints are visible, LiftPose3D could predict the 3D position of the joints, even those that are not visible from the side view (dashed line). The predictions correspond closely to the true positions (solid line) when these are known.

Adapting a LiftPose3D network to new experiments

Finally, when 3D pose information is not available, for example, in experiments where only 2D data has been collected using a single camera, then LiftPose3D may still be used to obtain 3D poses. This scenario may not seem different from the above at first sight because it simply involves predicting 3D pose from 2D pose. Naively, a network could be trained on another dataset where 3D poses are available and applied directly to the 2D poses to predict 3D poses. However, different experiments can involve different animals having varying proportions and different cameras having different hardware-related distortions. Thus, it is unlikely that this naive approach will succeed because the network may not be able to interpret 2D poses given this unknown variability. Even if it could, it is unlikely to yield 3D poses with the desired accuracy.

To account for the variability in the new dataset, we used pre-trained networks in combination with domain adaptation to generalise networks across datasets. In other words, we initially proceeded with the naive approach by training a network using another dataset where 3D poses were available. However, rather than applying the network to the 2D poses in the new dataset directly, we deformed these 2D poses to 'look like' those in the dataset in which the network was trained. We found that this approach could effectively remove small differences between datasets to lift realistic 3D poses.

It has never been easier to obtain 3D poses of animals! The following image sequence shows images obtained using an open source setup we called LiftPose3D station, which we build from less than $150. We provide the hardware list, build instructions, 2D network, LiftPose3D network for your next Drosophila project!

Reconstructing 3D poses of Drosophila in the LiftPose3D station (left) Ventral video of a walking Drosophila. (middle) Cropped and aligned video superimposed with 2D poses annotated by a trained DeepLabCut network. (right) 3D poses reconstructed by LiftPose3D from the ventral 2D poses.

References

Gosztolai, A., Günel, S., Rios, V.L., Abrate, M.P., Morales, D., Rhodin, H., Fua, P. and Ramdya, P. LiftPose3D, a deep learning-based approach for transforming 2D to 3D poses in laboratory animals. Nature Methods, 18, 975–981 (2021) [article] [code][video summary]