Understanding Pose Inputs in the EchoMimicV2 Pipeline | Portrait Animation Series #3

This article analyses the pose input format and its role in the EchoMimicV2 (EM2) pipeline. We are able to set customised pose inputs in the previous post. But now we want to dive in and have a better understanding of this data and its role in the pipeline.
Customised Pose Inputs in EchoMimicV2 | Portrait Animation Series #2
Customised Pose Inputs in EchoMimicV2 | Portrait Animation Series #2 This article explores how to set customised pose…heyulong3d.medium.com
The EchoMimicV2 paper [1] does not lock itself to a specific keypoint detector. It only mentions a keypoint map needs to be passed into the Pose Encoder.
Besides, we also integrate a Pose Encoder Ep to extract keypoint maps.
However, the code in the repository [2] seems that it depends on DWPose.

Although we can change the pose inputs, but the question here is how does it work in the pipeline?
The key observation is that EM2 passes pose data as keypoint images/maps into the encoder instead of raw data. This is a little surprising to me. I thought it is costly, but it looks like that the system can benefit from this design and the pose encoder is not the bottleneck.
· .npy to pose image (per frame)
∘ Entrance point
∘ Simple optimisation
· The pose sequence goes to the Pose Encoder
· Back to NPY Format
∘ Data dict
∘ Index ordering
· References
.npy
to pose image (per frame)
Each *.npy
file is a dict with hands
, hands_score
, bodies
, faces
, faces_score
, num
and draw_pose_params
. You can use the script below to load the data from *.npy
files.

Entrance point
When you run app.py
, the “generate video” command will invoke a function called generate
. It will initialise all the models and the arguments on the Web page. It will then load those .npy
files. It loads one file per frame, and then call src/utils/dwpose_util.py::draw_pose_select_v2(...)
.
# app.py
# ...
pose_list = []
for index in range(start_idx, start_idx + length):
# step 1
tgt_musk = np.zeros((height, width, 3)).astype('uint8')
# step 2
tgt_musk_path = os.path.join(inputs_dict['pose'], "{}.npy".format(index))
detected_pose = np.load(tgt_musk_path, allow_pickle=True).tolist()
imh_new, imw_new, rb, re, cb, ce = detected_pose['draw_pose_params']
# step 3
im = draw_pose_select_v2(detected_pose, imh_new, imw_new, ref_w=800)
# step 4
im = np.transpose(np.array(im),(1, 2, 0))
tgt_musk[rb:re,cb:ce,:] = im
# step 5
tgt_musk_pil = Image.fromarray(np.array(tgt_musk)).convert('RGB')
pose_list.append(torch.Tensor(np.array(tgt_musk_pil)).to(dtype=dtype, device=device).permute(2,0,1) / 255.0)
poses_tensor = torch.stack(pose_list, dim=1).unsqueeze(0)
Step 1:
Allocates a blank canvas per frame to the final video size (768*768).

Step 2: Load the npy file.
detected_pose
contains all the pose information.

A pose patch is just a small picture that shows the detected keypoints (e.g., the hand skeleton) on a black background. We draw only the part that matters (the hands) into a small sticker of size imh_new × imw_new
. Then we stick it onto a big blank page (your full frame, H × W
) at the position given by:
- rows
rb:re
(top→bottom): it just meansrow_begin
androw_end
- cols
cb:ce
(left→right): it just meanscol_begin
andcol_end
So those numbers mean:
imh_new
— height of the stickerimw_new
— width of the stickerrb, re
— where to place it vertically on the big pagecb, ce
— where to place it horizontally on the big page
Two checks always hold:
re - rb == imh_new
ce - cb == imw_new
But here the size of pose patch is the same as the target size. Honestly, I still do not understand when it can be useful. We can just accept it and let it be here.
Step 3: Call draw_pose_select_v2(...)
This function draws 21-keypoint hands (left and right) into a CHW uint8 image. Here we can see the result image:

So, basically, this function uses the pose data to draw the hands. It returns an image.
Step 4: Paste it back
Nothing special here. Because the patch size is the same as the target size. tgt_musk
will hold the above image.
Step 5: NumPy → PIL → NumPy → Tensor
These two lines take your tgt_musk
, ensure it is in RGB format, convert it to a normalized PyTorch tensor (C, H, W) on the right device, and add it to pose_list.
Simple optimisation
We can avoid PIL round-trip (it’s slower and allocates more) and go straight to torch.
def build_poses_tensor(
pose_dir: str,
start_idx: int,
length: int,
H: int,
W: int,
device: torch.device,
dtype: torch.dtype,
renderer: Callable = draw_pose_select_v2,
ref_w: int = 800,
) -> torch.Tensor:
"""
Loads per-frame pose .npy files, renders hands to a patch, pastes onto an HxW canvas,
and returns a tensor shaped [1, 3, T, H, W] in [0,1].
"""
# preallocate to avoid Python lists + stack
poses = torch.empty((1, 3, length, H, W), device=device, dtype=dtype)
for t, index in enumerate(range(start_idx, start_idx + length)):
# 1) blank canvas (H, W, C)
canvas = np.zeros((H, W, 3), dtype=np.uint8)
# 2) load pose dict
npy_path = os.path.join(pose_dir, f"{index}.npy")
pose = np.load(npy_path, allow_pickle=True).item()
# 3) draw small patch then paste
imh, imw, rb, re, cb, ce = pose["draw_pose_params"]
patch_chw = renderer(pose, imh, imw, ref_w=ref_w) # CHW uint8
canvas[rb:re, cb:ce, :] = patch_chw.transpose(1, 2, 0) # HWC uint8
# 4) to torch [C,H,W] in [0,1], place into buffer
poses[0, :, t] = torch.from_numpy(canvas).to(device=device).permute(2, 0, 1).to(dtype) / 255.0
return poses
def generate(image_input, audio_input, pose_input, width, height, length, steps, sample_rate, cfg, fps, context_frames,
context_overlap, quantization_input, seed):
# ...
start_idx = 0
import time
start = time.time()
pose_list = []
for index in range(start_idx, start_idx + length):
tgt_musk = np.zeros((height, width, 3)).astype('uint8')
tgt_musk_path = os.path.join(inputs_dict['pose'], "{}.npy".format(index))
detected_pose = np.load(tgt_musk_path, allow_pickle=True).tolist()
imh_new, imw_new, rb, re, cb, ce = detected_pose['draw_pose_params']
im = draw_pose_select_v2(detected_pose, imh_new, imw_new, ref_w=800)
im = np.transpose(np.array(im), (1, 2, 0))
tgt_musk[rb:re, cb:ce, :] = im
tgt_musk_pil = Image.fromarray(np.array(tgt_musk)).convert('RGB')
pose_list.append(torch.Tensor(np.array(tgt_musk_pil)).to(dtype=dtype, device=device).permute(2, 0, 1) / 255.0)
poses_tensor = torch.stack(pose_list, dim=1).unsqueeze(0)
end = time.time()
elapsed = end - start
print(f"Elapsed time: {elapsed:.2f} seconds")
start = time.time()
# TODO: this one may be more efficient
poses_tensor2 = build_poses_tensor(
pose_dir=inputs_dict["pose"],
start_idx=start_idx,
length=length,
H=height, # note: H first
W=width, # then W
device=torch.device(device),
dtype=dtype,
renderer=draw_pose_select_v2, # swap this to include body/face if you want
ref_w=800,
)
end = time.time()
elapsed = end - start
print(f"Elapsed time: {elapsed:.2f} seconds")
assert torch.allclose(poses_tensor, poses_tensor2, rtol=1e-5, atol=1e-8)
# ...
I did a rough comparison for performance and equivalence.
Elapsed time: 1.19 seconds
Elapsed time: 0.51 seconds
Then I replaced the original logic with my new version on my forked repository.
The pose sequence goes to the Pose Encoder
Now, you should have a better understanding of what a pose image is. It is a very sparse image/matrix. The pose sequence poses_tenors
is [B=1, C=3, T, H, W] in [0, 1].

The Pose Encoder is a small CNN that turns each pose image into features. Those pose features are fed into the 3D denoising UNet. The PoseEncoder
is defined in the src/model/pose_encoder.py
and its weight is pretrained_weights/pose_encoder.pth
.
Currently, we only need to know the input and output of the pose encoder. We will dive in it in the future.

Back to NPY Format
Data dict
As detail freaks, let us look back to the NPY format.
bodies/candidate
: all body joints as normalised [x, y] in [0, 1]bodies/subset
: for each detected personbodies/score
: per-joint confidencehands
:(2, 21, 2)
-> left and right handshands_score
:(2, 21)
-> confidencesfaces
:(F, 68, 2)
-> face landmarksface_score
:(F, 68)
-> confidencesdraw_pose_params
: size and paste-location of the pose patch
All coordinates are normalised. You can convert to pixels with x_px = x * W
, y_px = y * H
.
Index ordering
The critical here is which index order it uses.
DWPose is a modern, distillation-enhanced model for pose estimation model, like OpenPose. Importantly, the keypoints format is based on COCO-WholeBody [3–5].

DWPose predicts COCO-WholeBody (133 points) internally. However, the EM2 repo converts its body joints (COCO-17, also called MMPose) to OpenPose “BODY_18” index.

You can see that in your src/models/dwpose/wholebody.py
where they:
- compute a neck point as the mean of left & right shoulders,
- insert it, and
- reorder indices to match OpenPose.
# wholebody.py
# ...
mmpose_idx = [
17, 6, 8, 10, 7, 9, 12, 14, 16, 13, 15, 2, 1, 4, 3
]
openpose_idx = [
1, 2, 3, 4, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17
]
new_keypoints_info[:, openpose_idx] = \
new_keypoints_info[:, mmpose_idx]
# ...
Here, 17 (in the code) is reconstructed neck index, mapping to 1. 6 (in the code) is 7 in the COCO-17 image (it can be confusing because indices in this image starts with 1), right shoulder, mapping to 2 in OpenPose ordering. 8 (in the code) is 9 in COCO-17 image, right elbow, mapping to 3 in OpenPose ordering. So on and so forth.
Therefore, the body order is:
0 Nose
1 Neck (inserted: mean(L-shoulder, R-shoulder))
2 R-Shoulder
3 R-Elbow
4 R-Wrist
5 L-Shoulder
6 L-Elbow
7 L-Wrist
8 R-Hip
9 R-Knee
10 R-Ankle
11 L-Hip
12 L-Knee
13 L-Ankle
14 R-Eye
15 L-Eye
16 R-Ear
17 L-Ear
Hands (21) order (MediaPipe/DWPose):
0 wrist
Thumb: 1,2,3,4
Index: 5,6,7,8
Middle: 9,10,11,12
Ring: 13,14,15,16
Pinky: 17,18,19,20
Face (68) order (iBUG-68):
0–16 jawline
17–26 eyebrows
27–35 nose
36–41 right eye, 42–47 left eye
48–67 mouth
Finally, we can validate the ordering by visualising those keypoints:

# experiments/validate_joint_ordering.py
import os, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
from src.utils.dwpose_util import draw_bodypose_white, draw_handpose, draw_facepose
# pick one frame
npy_path = Path("../assets/halfbody_demo/pose/01/100.npy")
pose = np.load(npy_path, allow_pickle=True).item()
H, W = 768, 768
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# ------------ BODY (OpenPose-18) ------------
ax = axes[0]; ax.set_title("Body (OpenPose-18)")
canvas_body = np.full((H, W, 3), 255, dtype=np.uint8)
b = pose["bodies"]
canvas_body = draw_bodypose_white(canvas_body, b["candidate"], b["subset"], b["score"])
# (optional) label the 18 joints to verify ordering
if len(b["subset"]) > 0:
idxs = b["subset"][0].astype(int)
for j, idx in enumerate(idxs):
if idx == -1:
continue
x, y = b["candidate"][idx]
ax.text(x*W + 3, y*H, str(j), color="black", fontsize=8)
ax.imshow(canvas_body[..., ::-1]) # BGR->RGB for matplotlib
ax.set_xlim(0, W); ax.set_ylim(H, 0); ax.axis("off")
# ------------ LEFT HAND (21) ------------
ax = axes[1]; ax.set_title("Left hand (21)")
canvas_l = np.full((H, W, 3), 255, dtype=np.uint8)
if pose["hands"].shape[0] >= 1:
# draw_handpose expects a LIST of hands and a LIST of scores
canvas_l = draw_handpose(canvas_l, [pose["hands"][0]], [pose["hands_score"][0]])
# label points
for j, (x, y) in enumerate(pose["hands"][0]):
ax.text(x*W + 3, y*H, str(j), color="black", fontsize=8)
ax.imshow(canvas_l[..., ::-1])
ax.set_xlim(0, W); ax.set_ylim(H, 0); ax.axis("off")
# ------------ RIGHT HAND (21) ------------
ax = axes[2]; ax.set_title("Right hand (21)")
canvas_r = np.full((H, W, 3), 255, dtype=np.uint8)
if pose["hands"].shape[0] >= 2:
canvas_r = draw_handpose(canvas_r, [pose["hands"][1]], [pose["hands_score"][1]])
for j, (x, y) in enumerate(pose["hands"][1]):
ax.text(x*W + 3, y*H, str(j), color="black", fontsize=8)
ax.imshow(canvas_r[..., ::-1])
ax.set_xlim(0, W); ax.set_ylim(H, 0); ax.axis("off")
plt.tight_layout(); plt.show()
If you found this article helpful, please show your support by clicking the clap icon 👏 and following me 🙏. Thank you for taking the time to read it, and have a wonderful day!
References
- Meng, R., Zhang, X., Li, Y., & Ma, C. (2025). Echomimicv2: Towards striking, simplified, and semi-body human animation. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5489–5498).
- https://github.com/he-yulong/echomimic_v2/tree/8ad07a4017efc5c674dfef5dcdc64f394d70a8ce
- jin-s13/COCO-WholeBody
- DeepWiki-huggingface/controlnet_aux
- Keypoints in wholebody-coco format?