
EchoMimicV2 is an opensource framework for audio driven human portrait animation. Compared to many other solutions (talking head) which only focusing on facial animation and head movement, this framework includes upper-body movement.
The official repository tested the system on the Linux environment. However, this article records the process of running this framework on Windows 11 with NVIDIA GeForce RTX 4080.
Environment Setup
My CUDA version: 12.4
nvcc --version
The output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0
Then use anaconda to create a python virtual environment:
conda create -n EM2 python=3.10
conda activate EM2
Follow the instructions below inspired by the official repositoryโs README file:
git clone https://github.com/antgroup/echomimic_v2
# or my tested version: https://github.com/antgroup/echomimic_v2/tree/a312dec05ec7f9b3e0e2c2802e4a1a5d3788cfb3
cd echomimic_v2
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers==0.0.28.post3 --index-url https://download.pytorch.org/whl/cu124
pip install torchao
pip install -r requirements.txt
pip install --no-deps facenet_pytorch==2.6.0
# some fix for windows
pip install -U "gradio==4.44.1" "gradio_client==1.3.0" "fastapi==0.115.5" "starlette==0.41.2" "pydantic==2.9.2"
pip install triton-windows==3.1.0.post17
We need to install another dependency: FFmpeg. Just download the files and add its bin folder to the environment variable Path
. Therefore, when you run:
ffmpeg -version
The output would be like:
ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 14.2.0 (Rev1, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband
libavutil 59. 39.100 / 59. 39.100
libavcodec 61. 19.101 / 61. 19.101
libavformat 61. 7.100 / 61. 7.100
libavdevice 61. 3.100 / 61. 3.100
libavfilter 10. 4.100 / 10. 4.100
libswscale 8. 3.100 / 8. 3.100
libswresample 5. 3.100 / 5. 3.100
libpostproc 58. 3.100 / 58. 3.100
The next step is to download some pretrained weights (time-consuming):
git lfs install
git clone https://huggingface.co/BadToBest/EchoMimicV2 pretrained_weights
git clone https://huggingface.co/stabilityai/sd-vae-ft-mse pretrained_weights/sd-vae-ft-mse
git clone https://huggingface.co/lambdalabs/sd-image-variations-diffusers pretrained_weights/sd-image-variations-diffusers
Then create a folder pretrained_weights/audio_processor
which contains a file tiny.pt
. The pretrained_weights should be finally like this:

Running the demo
Now, ready to go! ๐๐๐ Run the script to launch a web server:
python app.py

Set the staic image, audio, and pose input. For example:
- image: assets/halfbody_demo/refimag/natural_bk_openhand/0066.png
- audio: assets/halfbody_demo/audio/chinese/fight.wav
- pose: assets/halfbody_demo/pose/01

Press generate video (็ๆ่ง้ข). After 10+ minutes, a video is generated and can be downloaded.

If you found this article helpful, please show your support by clicking the clap icon ๐ and following me ๐. Thank you for taking the time to read it, and have a wonderful day!
References
- EchoMimicV2
- Meng, R., Zhang, X., Li, Y., & Ma, C. (2025). Echomimicv2: Towards striking, simplified, and semi-body human animation. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5489โ5498).