Summary

This IBM Developer code pattern explains how to detect human poses in a given image by using the Human Pose Estimator model from the Model Asset eXchange that is hosted on the Machine Learning eXchange. Using coordinates, the pose lines that are created by the model are assembled into full body poses for each of the humans that are detected in the image.

Description

The Human Pose Estimator model detects humans and their poses in a given image. The model first detects the human in the input image and then identifies the body parts, including the nose, neck, eyes, shoulders, elbows, wrists, hips, knees, and ankles. Next, each pair of associated body parts is connected by a “pose line,” as shown in the following image. A line might connect the left eye to the nose, while another might connect the nose to the neck.

Each pose line is represented by a list [x1, y1, x2, y2], where the first pair of coordinates (x1, y1) is the starting
point of the line for one body part, while the second pair of coordinates (x2, y2) is the ending point of the line for the
other associated body part. The pose lines are assembled into full body poses for each of the humans detected in the
image.

The model is based on the TF implementation of the OpenPose model. The code in this repository deploys the model as a web service in a Docker container.

Yogait, a yoga assistant that uses the Human Pose Estimator MAX Model to guess which yoga pose a user is performing, uses a pre-trained SVM to classify poses. Instead of using the Cartesian lines that the MAX model returns, Yogait uses a Polar representation to perform classification. This was done to make it much easier to classify poses. Instead of training the SVM on an x-y coordinate system, which would require translation and rotation when augmenting data, the polar representation relies only upon the location of the joints relative to the center of the estimated model.

Related work from others:  O'Reilly Media - Generative AI in the Real World: Chip Huyen on Finding Business Use Cases for Generative AI

The [x,y] coordinates are converted to [phi, rho] for each joint.

The SVM performs classification on a flattened version of the polar vectors. Compared to a Cartesian representation, this polar representation uses little data and can perform classification on a human in any part of a captured frame. If the Cartesian representation was to be used, then you would have to perform all of the poses in the center of the camera frame.

When you have completed the code pattern, you’ll understand how to:

Build a Docker image of the Human Pose Estimator MAX Model
Deploy a deep learning model with a REST endpoint
Generate a pose estimation for a person in a frame of video using the MAX Model’s REST API
Run a web application that uses the model’s REST API

Flow

The server sends the captured video frame-by-frame from the webcam to the model API.
The web UI requests the pose lines estimated for the frame from the server.
The server receives data from the model API and updates the result to the web UI.

Instructions

Find the detailed steps for this pattern in the README. Those steps show you how to:

Set up the MAX model.
Start the web app.
Run locally with a Python script.
Use a Jupyter Notebook.

Similar Posts