Running EchoMimicV2 on Windows 11

EchoMimicV2 is an opensource framework for audio driven human portrait animation. Compared to many other solutions (talking head) which only focusing on facial animation and head movement, this framework includes upper-body movement.

The official repository tested the system on the Linux environment. However, this article records the process of running this framework on Windows 11 with NVIDIA GeForce RTX 4080.

Environment Setup

My CUDA version: 12.4

nvcc --version

The output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

Then use anaconda to create a python virtual environment:

conda create -n EM2 python=3.10
conda activate EM2

Follow the instructions below inspired by the official repository’s README file:

git clone https://github.com/antgroup/echomimic_v2
# or my tested version: https://github.com/antgroup/echomimic_v2/tree/a312dec05ec7f9b3e0e2c2802e4a1a5d3788cfb3
cd echomimic_v2
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers==0.0.28.post3 --index-url https://download.pytorch.org/whl/cu124
pip install torchao
pip install -r requirements.txt
pip install --no-deps facenet_pytorch==2.6.0
# some fix for windows
pip install -U "gradio==4.44.1" "gradio_client==1.3.0" "fastapi==0.115.5" "starlette==0.41.2" "pydantic==2.9.2"
pip install triton-windows==3.1.0.post17

We need to install another dependency: FFmpeg. Just download the files and add its bin folder to the environment variable Path . Therefore, when you run:

ffmpeg -version

The output would be like:

ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 14.2.0 (Rev1, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband
libavutil      59. 39.100 / 59. 39.100
libavcodec     61. 19.101 / 61. 19.101
libavformat    61.  7.100 / 61.  7.100
libavdevice    61.  3.100 / 61.  3.100
libavfilter    10.  4.100 / 10.  4.100
libswscale      8.  3.100 /  8.  3.100
libswresample   5.  3.100 /  5.  3.100
libpostproc    58.  3.100 / 58.  3.100

The next step is to download some pretrained weights (time-consuming):

git lfs install
git clone https://huggingface.co/BadToBest/EchoMimicV2 pretrained_weights
git clone https://huggingface.co/stabilityai/sd-vae-ft-mse pretrained_weights/sd-vae-ft-mse
git clone https://huggingface.co/lambdalabs/sd-image-variations-diffusers pretrained_weights/sd-image-variations-diffusers

Then create a folder pretrained_weights/audio_processor which contains a file tiny.pt . The pretrained_weights should be finally like this:

Running the demo

Now, ready to go! 🚀🚀🚀 Run the script to launch a web server:

 python app.py

Set the staic image, audio, and pose input. For example:

image: assets/halfbody_demo/refimag/natural_bk_openhand/0066.png
audio: assets/halfbody_demo/audio/chinese/fight.wav
pose: assets/halfbody_demo/pose/01

Press generate video (生成视频). After 10+ minutes, a video is generated and can be downloaded.

If you found this article helpful, please show your support by clicking the clap icon 👏 and following me 🙏. Thank you for taking the time to read it, and have a wonderful day!

References

EchoMimicV2
Meng, R., Zhang, X., Li, Y., & Ma, C. (2025). Echomimicv2: Towards striking, simplified, and semi-body human animation. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5489–5498).

Post Views: 103

August 24, 2025

hyl3d

Computer Vision, Deep Learning, Memo

HeYulong 3D

Running EchoMimicV2 on Windows 11

Environment Setup

Running the demo

References

Leave a Reply Cancel reply