Running EchoMimicV2 on Windows 11


EchoMimicV2

EchoMimicV2 is an opensource framework for audio driven human portrait animation. Compared to many other solutions (talking head) which only focusing on facial animation and head movement, this framework includes upper-body movement.

The official repository tested the system on the Linux environment. However, this article records the process of running this framework on Windows 11 with NVIDIA GeForce RTX 4080.

Environment Setup

My CUDA version: 12.4

nvcc --version  

The output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:28:36_Pacific_Standard_Time_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

Then use anaconda to create a python virtual environment:

conda create -n EM2 python=3.10
conda activate EM2

Follow the instructions below inspired by the official repositoryโ€™s README file:

git clone https://github.com/antgroup/echomimic_v2
# or my tested version: https://github.com/antgroup/echomimic_v2/tree/a312dec05ec7f9b3e0e2c2802e4a1a5d3788cfb3
cd echomimic_v2
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers==0.0.28.post3 --index-url https://download.pytorch.org/whl/cu124
pip install torchao
pip install -r requirements.txt
pip install --no-deps facenet_pytorch==2.6.0
# some fix for windows
pip install -U "gradio==4.44.1" "gradio_client==1.3.0" "fastapi==0.115.5" "starlette==0.41.2" "pydantic==2.9.2"
pip install triton-windows==3.1.0.post17

We need to install another dependency: FFmpeg. Just download the files and add its bin folder to the environment variable Path . Therefore, when you run:

ffmpeg -version

The output would be like:

ffmpeg version 7.1.1-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 14.2.0 (Rev1, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband
libavutil 59. 39.100 / 59. 39.100
libavcodec 61. 19.101 / 61. 19.101
libavformat 61. 7.100 / 61. 7.100
libavdevice 61. 3.100 / 61. 3.100
libavfilter 10. 4.100 / 10. 4.100
libswscale 8. 3.100 / 8. 3.100
libswresample 5. 3.100 / 5. 3.100
libpostproc 58. 3.100 / 58. 3.100

The next step is to download some pretrained weights (time-consuming):

git lfs install
git clone https://huggingface.co/BadToBest/EchoMimicV2 pretrained_weights
git clone https://huggingface.co/stabilityai/sd-vae-ft-mse pretrained_weights/sd-vae-ft-mse
git clone https://huggingface.co/lambdalabs/sd-image-variations-diffusers pretrained_weights/sd-image-variations-diffusers

Then create a folder pretrained_weights/audio_processor which contains a file tiny.pt . The pretrained_weights should be finally like this:

Running the demo

Now, ready to go! ๐Ÿš€๐Ÿš€๐Ÿš€ Run the script to launch a web server:

 python app.py

Set the staic image, audio, and pose input. For example:

  • image: assets/halfbody_demo/refimag/natural_bk_openhand/0066.png
  • audio: assets/halfbody_demo/audio/chinese/fight.wav
  • pose: assets/halfbody_demo/pose/01

Press generate video (็”Ÿๆˆ่ง†้ข‘). After 10+ minutes, a video is generated and can be downloaded.

If you found this article helpful, please show your support by clicking the clap icon ๐Ÿ‘ and following me ๐Ÿ™. Thank you for taking the time to read it, and have a wonderful day!

References

  1. EchoMimicV2
  2. Meng, R., Zhang, X., Li, Y., & Ma, C. (2025). Echomimicv2: Towards striking, simplified, and semi-body human animation. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5489โ€“5498).
, ,

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php