Understanding Pose Inputs in the EchoMimicV2 Pipeline | Portrait Animation Series #3

This article analyses the pose input format and its role in the EchoMimicV2 (EM2) pipeline. We are able to set customised pose inputs in the previous post. But now we want to dive in and have a better understanding of this data and its role in the pipeline.

Customised Pose Inputs in EchoMimicV2 | Portrait Animation Series #2
Customised Pose Inputs in EchoMimicV2 | Portrait Animation Series #2 This article explores how to set customised pose…heyulong3d.medium.com

The EchoMimicV2 paper [1] does not lock itself to a specific keypoint detector. It only mentions a keypoint map needs to be passed into the Pose Encoder.

Besides, we also integrate a Pose Encoder Ep to extract keypoint maps.

However, the code in the repository [2] seems that it depends on DWPose.

Although we can change the pose inputs, but the question here is how does it work in the pipeline?

The key observation is that EM2 passes pose data as keypoint images/maps into the encoder instead of raw data. This is a little surprising to me. I thought it is costly, but it looks like that the system can benefit from this design and the pose encoder is not the bottleneck.

· .npy to pose image (per frame)
∘ Entrance point
∘ Simple optimisation
· The pose sequence goes to the Pose Encoder
· Back to NPY Format
∘ Data dict
∘ Index ordering
· References

`.npy` to pose image (per frame)

Each *.npy file is a dict with hands , hands_score , bodies , faces , faces_score , numand draw_pose_params . You can use the script below to load the data from *.npy files.

Entrance point

When you run app.py , the “generate video” command will invoke a function called generate . It will initialise all the models and the arguments on the Web page. It will then load those .npy files. It loads one file per frame, and then call src/utils/dwpose_util.py::draw_pose_select_v2(...) .

# app.py
# ...
pose_list = []
for index in range(start_idx, start_idx + length):
    # step 1
    tgt_musk = np.zeros((height, width, 3)).astype('uint8')
    # step 2
    tgt_musk_path = os.path.join(inputs_dict['pose'], "{}.npy".format(index)) 
    detected_pose = np.load(tgt_musk_path, allow_pickle=True).tolist()
    imh_new, imw_new, rb, re, cb, ce = detected_pose['draw_pose_params']
    # step 3
    im = draw_pose_select_v2(detected_pose, imh_new, imw_new, ref_w=800)
    # step 4
    im = np.transpose(np.array(im),(1, 2, 0))
    tgt_musk[rb:re,cb:ce,:] = im
    # step 5
    tgt_musk_pil = Image.fromarray(np.array(tgt_musk)).convert('RGB')
    pose_list.append(torch.Tensor(np.array(tgt_musk_pil)).to(dtype=dtype, device=device).permute(2,0,1) / 255.0)

poses_tensor = torch.stack(pose_list, dim=1).unsqueeze(0)

Step 1:

Allocates a blank canvas per frame to the final video size (768*768).

Step 2: Load the npy file.

detected_pose contains all the pose information.

A pose patch is just a small picture that shows the detected keypoints (e.g., the hand skeleton) on a black background. We draw only the part that matters (the hands) into a small sticker of size imh_new × imw_new. Then we stick it onto a big blank page (your full frame, H × W) at the position given by:

rows rb:re (top→bottom): it just means row_begin and row_end
cols cb:ce (left→right): it just means col_begin and col_end

So those numbers mean:

imh_new — height of the sticker
imw_new — width of the sticker
rb, re — where to place it vertically on the big page
cb, ce — where to place it horizontally on the big page

Two checks always hold:

re - rb == imh_new
ce - cb == imw_new

But here the size of pose patch is the same as the target size. Honestly, I still do not understand when it can be useful. We can just accept it and let it be here.

Step 3: Call draw_pose_select_v2(...)

This function draws 21-keypoint hands (left and right) into a CHW uint8 image. Here we can see the result image:

So, basically, this function uses the pose data to draw the hands. It returns an image.

Step 4: Paste it back

Nothing special here. Because the patch size is the same as the target size. tgt_musk will hold the above image.

Step 5: NumPy → PIL → NumPy → Tensor

These two lines take your tgt_musk, ensure it is in RGB format, convert it to a normalized PyTorch tensor (C, H, W) on the right device, and add it to pose_list.

Simple optimisation

We can avoid PIL round-trip (it’s slower and allocates more) and go straight to torch.

def build_poses_tensor(
        pose_dir: str,
        start_idx: int,
        length: int,
        H: int,
        W: int,
        device: torch.device,
        dtype: torch.dtype,
        renderer: Callable = draw_pose_select_v2,
        ref_w: int = 800,
) -> torch.Tensor:
    """
    Loads per-frame pose .npy files, renders hands to a patch, pastes onto an HxW canvas,
    and returns a tensor shaped [1, 3, T, H, W] in [0,1].
    """
    # preallocate to avoid Python lists + stack
    poses = torch.empty((1, 3, length, H, W), device=device, dtype=dtype)

    for t, index in enumerate(range(start_idx, start_idx + length)):
        # 1) blank canvas (H, W, C)
        canvas = np.zeros((H, W, 3), dtype=np.uint8)

        # 2) load pose dict
        npy_path = os.path.join(pose_dir, f"{index}.npy")
        pose = np.load(npy_path, allow_pickle=True).item()

        # 3) draw small patch then paste
        imh, imw, rb, re, cb, ce = pose["draw_pose_params"]
        patch_chw = renderer(pose, imh, imw, ref_w=ref_w)  # CHW uint8
        canvas[rb:re, cb:ce, :] = patch_chw.transpose(1, 2, 0)  # HWC uint8

        # 4) to torch [C,H,W] in [0,1], place into buffer
        poses[0, :, t] = torch.from_numpy(canvas).to(device=device).permute(2, 0, 1).to(dtype) / 255.0

    return poses

def generate(image_input, audio_input, pose_input, width, height, length, steps, sample_rate, cfg, fps, context_frames,
             context_overlap, quantization_input, seed):

# ...
    start_idx = 0
    import time
    start = time.time()
    pose_list = []
    for index in range(start_idx, start_idx + length):
        tgt_musk = np.zeros((height, width, 3)).astype('uint8')
        tgt_musk_path = os.path.join(inputs_dict['pose'], "{}.npy".format(index))
        detected_pose = np.load(tgt_musk_path, allow_pickle=True).tolist()
        imh_new, imw_new, rb, re, cb, ce = detected_pose['draw_pose_params']
        im = draw_pose_select_v2(detected_pose, imh_new, imw_new, ref_w=800)
        im = np.transpose(np.array(im), (1, 2, 0))
        tgt_musk[rb:re, cb:ce, :] = im

        tgt_musk_pil = Image.fromarray(np.array(tgt_musk)).convert('RGB')
        pose_list.append(torch.Tensor(np.array(tgt_musk_pil)).to(dtype=dtype, device=device).permute(2, 0, 1) / 255.0)

    poses_tensor = torch.stack(pose_list, dim=1).unsqueeze(0)
    end = time.time()
    elapsed = end - start
    print(f"Elapsed time: {elapsed:.2f} seconds")

    start = time.time()
    # TODO: this one may be more efficient
    poses_tensor2 = build_poses_tensor(
        pose_dir=inputs_dict["pose"],
        start_idx=start_idx,
        length=length,
        H=height,  # note: H first
        W=width,  # then W
        device=torch.device(device),
        dtype=dtype,
        renderer=draw_pose_select_v2,  # swap this to include body/face if you want
        ref_w=800,
    )
    end = time.time()
    elapsed = end - start
    print(f"Elapsed time: {elapsed:.2f} seconds")
    assert torch.allclose(poses_tensor, poses_tensor2, rtol=1e-5, atol=1e-8)
# ...

I did a rough comparison for performance and equivalence.

Elapsed time: 1.19 seconds
Elapsed time: 0.51 seconds

Then I replaced the original logic with my new version on my forked repository.

The pose sequence goes to the Pose Encoder

Now, you should have a better understanding of what a pose image is. It is a very sparse image/matrix. The pose sequence poses_tenorsis [B=1, C=3, T, H, W] in [0, 1].

The Pose Encoder is a small CNN that turns each pose image into features. Those pose features are fed into the 3D denoising UNet. The PoseEncoder is defined in the src/model/pose_encoder.py and its weight is pretrained_weights/pose_encoder.pth .

Currently, we only need to know the input and output of the pose encoder. We will dive in it in the future.

Back to NPY Format

Data dict

As detail freaks, let us look back to the NPY format.

bodies/candidate : all body joints as normalised [x, y] in [0, 1]
bodies/subset : for each detected person
bodies/score : per-joint confidence
hands : (2, 21, 2) -> left and right hands
hands_score : (2, 21) -> confidences
faces : (F, 68, 2) -> face landmarks
face_score : (F, 68) -> confidences
draw_pose_params : size and paste-location of the pose patch

All coordinates are normalised. You can convert to pixels with x_px = x * W , y_px = y * H .

Index ordering

The critical here is which index order it uses.

DWPose is a modern, distillation-enhanced model for pose estimation model, like OpenPose. Importantly, the keypoints format is based on COCO-WholeBody [3–5].

DWPose predicts COCO-WholeBody (133 points) internally. However, the EM2 repo converts its body joints (COCO-17, also called MMPose) to OpenPose “BODY_18” index.

Openpose ordering from Abnormal gesture recognition based on multi-model fusion strategy

You can see that in your src/models/dwpose/wholebody.py where they:

compute a neck point as the mean of left & right shoulders,
insert it, and
reorder indices to match OpenPose.

# wholebody.py
# ...
mmpose_idx = [
            17, 6, 8, 10, 7, 9, 12, 14, 16, 13, 15, 2, 1, 4, 3
        ]
openpose_idx = [
    1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17
]
new_keypoints_info[:, openpose_idx] = \
    new_keypoints_info[:, mmpose_idx]
# ...

Here, 17 (in the code) is reconstructed neck index, mapping to 1. 6 (in the code) is 7 in the COCO-17 image (it can be confusing because indices in this image starts with 1), right shoulder, mapping to 2 in OpenPose ordering. 8 (in the code) is 9 in COCO-17 image, right elbow, mapping to 3 in OpenPose ordering. So on and so forth.

Therefore, the body order is:

0  Nose
1  Neck                      (inserted: mean(L-shoulder, R-shoulder))
2  R-Shoulder
3  R-Elbow
4  R-Wrist
5  L-Shoulder
6  L-Elbow
7  L-Wrist
8  R-Hip
9  R-Knee
10 R-Ankle
11 L-Hip
12 L-Knee
13 L-Ankle
14 R-Eye
15 L-Eye
16 R-Ear
17 L-Ear

Hands (21) order (MediaPipe/DWPose):

0  wrist
Thumb: 1,2,3,4
Index: 5,6,7,8
Middle: 9,10,11,12
Ring: 13,14,15,16
Pinky: 17,18,19,20

Face (68) order (iBUG-68):

0–16   jawline
17–26  eyebrows
27–35  nose
36–41  right eye, 42–47 left eye
48–67  mouth

Finally, we can validate the ordering by visualising those keypoints:

# experiments/validate_joint_ordering.py
import os, numpy as np, matplotlib.pyplot as plt
from pathlib import Path

from src.utils.dwpose_util import draw_bodypose_white, draw_handpose, draw_facepose

# pick one frame
npy_path = Path("../assets/halfbody_demo/pose/01/100.npy")
pose = np.load(npy_path, allow_pickle=True).item()
H, W = 768, 768

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# ------------ BODY (OpenPose-18) ------------
ax = axes[0]; ax.set_title("Body (OpenPose-18)")
canvas_body = np.full((H, W, 3), 255, dtype=np.uint8)

b = pose["bodies"]
canvas_body = draw_bodypose_white(canvas_body, b["candidate"], b["subset"], b["score"])

# (optional) label the 18 joints to verify ordering
if len(b["subset"]) > 0:
    idxs = b["subset"][0].astype(int)
    for j, idx in enumerate(idxs):
        if idx == -1:
            continue
        x, y = b["candidate"][idx]
        ax.text(x*W + 3, y*H, str(j), color="black", fontsize=8)

ax.imshow(canvas_body[..., ::-1])  # BGR->RGB for matplotlib
ax.set_xlim(0, W); ax.set_ylim(H, 0); ax.axis("off")

# ------------ LEFT HAND (21) ------------
ax = axes[1]; ax.set_title("Left hand (21)")
canvas_l = np.full((H, W, 3), 255, dtype=np.uint8)
if pose["hands"].shape[0] >= 1:
    # draw_handpose expects a LIST of hands and a LIST of scores
    canvas_l = draw_handpose(canvas_l, [pose["hands"][0]], [pose["hands_score"][0]])
    # label points
    for j, (x, y) in enumerate(pose["hands"][0]):
        ax.text(x*W + 3, y*H, str(j), color="black", fontsize=8)
ax.imshow(canvas_l[..., ::-1])
ax.set_xlim(0, W); ax.set_ylim(H, 0); ax.axis("off")

# ------------ RIGHT HAND (21) ------------
ax = axes[2]; ax.set_title("Right hand (21)")
canvas_r = np.full((H, W, 3), 255, dtype=np.uint8)
if pose["hands"].shape[0] >= 2:
    canvas_r = draw_handpose(canvas_r, [pose["hands"][1]], [pose["hands_score"][1]])
    for j, (x, y) in enumerate(pose["hands"][1]):
        ax.text(x*W + 3, y*H, str(j), color="black", fontsize=8)
ax.imshow(canvas_r[..., ::-1])
ax.set_xlim(0, W); ax.set_ylim(H, 0); ax.axis("off")

plt.tight_layout(); plt.show()

If you found this article helpful, please show your support by clicking the clap icon 👏 and following me 🙏. Thank you for taking the time to read it, and have a wonderful day!

References

Meng, R., Zhang, X., Li, Y., & Ma, C. (2025). Echomimicv2: Towards striking, simplified, and semi-body human animation. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5489–5498).
https://github.com/he-yulong/echomimic_v2/tree/8ad07a4017efc5c674dfef5dcdc64f394d70a8ce
jin-s13/COCO-WholeBody
DeepWiki-huggingface/controlnet_aux
Keypoints in wholebody-coco format?

Post Views: 159

September 6, 2025

hyl3d

Computer Vision, Deep Learning

HeYulong 3D

Understanding Pose Inputs in the EchoMimicV2 Pipeline | Portrait Animation Series #3