I'm not sure I understand how AI video can be so much more energy intensive than images or text. To a first approximation, isn't the energy consumption proportional to how many GPUs you need to run the model, and for how long? So unless video models need 10x more chips or take 10x as long to generate an output I don't see how they can consume 10x more energy (picked 10 as a random multiplier). AFAIK, the models aren't bigger and the videos don't take that much longer to generate, so where does the energy consumption come from? Is the user-facing latency just really uncorrelated with the actual generation time?
The video estimates are way off (see table below). I do inference with several open-source image and video models on a daily basis and I can report that they do not use that much energy at all, even accounting for the 40%-50% additional energy consumption overhead (from CPUs, DRAM, fans, NICs, PCUs and VRMs) as stated in the Microsoft Paper: Characterizing Power Management Opportunities for LLMs in the Cloud.
Video generation is currently having its "Stable Diffusion moment". The open-source has seen several high quality models this year, and even the release of optimizations (teacache, sageattention) that are cutting inference time up to 50%. More importantly, the type of hardware you run these models is crucial for the estimates, as some GPUs will have lower utilization and be more inefficient if paired incorrectly with a model.
The table below summarizes data I've gathered from inference with open-source model (links to the specific reports below). Sora is included, but only based on the estimate made by Factorial Funds, March 2024; 5 minutes of video on a H100 in an hour and the fact they state on the website that it takes 4x longer to generate 720p than 480p.
All numbers below are for a 5 second (81 frame) video clips. Feel free to copy this and past into an online markdown viewer - not possible to include screenshots in the comments):
Conclusion: ranging from roughly 7 Wh (25k joules) to 214 Wh (750k joules) the MIT article certainly doesn't give us a nuanced overview of the energy consumption of video models at all.
When you create a five second video, with 16Hz that means 80 image frames to create. So I would expect such a video to use approximately 80 times more power than a image of similar resolution and quality. This would lead to 80*6706 or 537 k Joules per video.
Alternatively, let's consider the power usage by the computer running the AI. Let's consider this is 7 Kw compute node, running a CPU + 8 high end GPU) When you issue a prompt to such a system, and you get the answer in 20 seconds, it cannot use more than 20*7000 / 3600 = 39 Wh, or 140,000 Joules. Note that this excludes power usage of training a model and power usage of transferring the video through internet.
Both these estimates come to much lower numbers that the 944 Wh / 3.4 M Joules numbers stated above.
Thanks for continuing this conversation in your usual clear and informative way!
I'm not sure I understand how AI video can be so much more energy intensive than images or text. To a first approximation, isn't the energy consumption proportional to how many GPUs you need to run the model, and for how long? So unless video models need 10x more chips or take 10x as long to generate an output I don't see how they can consume 10x more energy (picked 10 as a random multiplier). AFAIK, the models aren't bigger and the videos don't take that much longer to generate, so where does the energy consumption come from? Is the user-facing latency just really uncorrelated with the actual generation time?
I’m confused too. Might be that they make bad assumptions in the report
The video estimates are way off (see table below). I do inference with several open-source image and video models on a daily basis and I can report that they do not use that much energy at all, even accounting for the 40%-50% additional energy consumption overhead (from CPUs, DRAM, fans, NICs, PCUs and VRMs) as stated in the Microsoft Paper: Characterizing Power Management Opportunities for LLMs in the Cloud.
What the authors writes in the article even deviates from what the official CogVideoX authors states in their repo (550 sec on a H100 for 5 sec video) : https://github.com/THUDM/CogVideo?tab=readme-ov-file#model-introduction
Video generation is currently having its "Stable Diffusion moment". The open-source has seen several high quality models this year, and even the release of optimizations (teacache, sageattention) that are cutting inference time up to 50%. More importantly, the type of hardware you run these models is crucial for the estimates, as some GPUs will have lower utilization and be more inefficient if paired incorrectly with a model.
The table below summarizes data I've gathered from inference with open-source model (links to the specific reports below). Sora is included, but only based on the estimate made by Factorial Funds, March 2024; 5 minutes of video on a H100 in an hour and the fact they state on the website that it takes 4x longer to generate 720p than 480p.
All numbers below are for a 5 second (81 frame) video clips. Feel free to copy this and past into an online markdown viewer - not possible to include screenshots in the comments):
| Rank ↑ | Model / Preset | GPU | Res. | Runtime (s) | **GPU Wh / clip** | **System Wh / clip (≈ 2×)** | Wh / frame (GPU) |
| ------ | ---------------------------- | --------- | -------- | ----------: | ----------------: | --------------------------: | ---------------: |
| 1 | **HunyuanVideo 480p (opt)** | H100 SXM5 | 960×544 | 18.5 | **3.6** | **7.2** | 0.045 |
| 2 | HunyuanVideo 480p (opt) | L40 | 960×544 | 43.5 | 3.63 | 7.26 | 0.045 |
| 3 | HunyuanVideo 480p (opt) | A100 SXM | 960×544 | 36.5 | 4.06 | 8.11 | 0.051 |
| 4 | HunyuanVideo 480p (opt) | RTX A5000 | 960×544 | 70 | 4.47 | 8.94 | 0.056 |
| 5 | HunyuanVideo 480p (opt) | RTX 4090 | 960×544 | 38.5 | 4.81 | 9.62 | 0.060 |
| 6 | HunyuanVideo 480p (baseline) | H100 SXM5 | 960×544 | 37 | 7.19 | 14.4 | 0.090 |
| 7 | **Sora 480p** | H100 SXM5 | 854×480 | 60 | 11.67 | 23.34 | 0.146 |
| 8 | Wan 2.1 480p (base) | H100 SXM5 | 480p | 80 | 15.56 | 31.12 | 0.194 |
| 9 | Wan 2.1 480p (opt) | RTX 4090 | 480p | 248 | 31.0 | 62.0 | 0.388 |
| 10 | **Sora 720p** | H100 SXM5 | 1280×720 | 240 | 46.67 | 93.34 | 0.583 |
| 11 | Wan 2.1 720p (base) | H100 SXM5 | 720p | 280 | 54.44 | 108.88 | 0.681 |
| 12 | **CogVideo X 768p** | H100 SXM5 | 1360×768 | 550 | 106.9 | 213.8 | 1.337 |
Not added to the table (because nobody uses it due to low quality and size): Open-Sora 2.0. 1651 seconds on H100 GPU for 5 seconds video - that is 322 Wh of GPU, rougly 600 Wh total and then about 2M joules: https://github.com/hpcaitech/Open-Sora?tab=readme-ov-file#computational-efficiency
Conclusion: ranging from roughly 7 Wh (25k joules) to 214 Wh (750k joules) the MIT article certainly doesn't give us a nuanced overview of the energy consumption of video models at all.
Details on Hunyuan inference: https://www.instasd.com/post/hunyuanvideo-performance-testing-across-gpus
Details on Wan2.1 inference: https://www.instasd.com/post/wan2-1-performance-testing-across-gpus
Factorial Funds Sora estimate: https://factorialfunds.com/blog/under-the-hood-how-openai-s-sora-model-works#anchor-5
I also do not get it.
When you create a five second video, with 16Hz that means 80 image frames to create. So I would expect such a video to use approximately 80 times more power than a image of similar resolution and quality. This would lead to 80*6706 or 537 k Joules per video.
Alternatively, let's consider the power usage by the computer running the AI. Let's consider this is 7 Kw compute node, running a CPU + 8 high end GPU) When you issue a prompt to such a system, and you get the answer in 20 seconds, it cannot use more than 20*7000 / 3600 = 39 Wh, or 140,000 Joules. Note that this excludes power usage of training a model and power usage of transferring the video through internet.
Both these estimates come to much lower numbers that the 944 Wh / 3.4 M Joules numbers stated above.
https://snipboard.io/gjE14D.jpg
The graph above should be very eye-opening for any non-vegans who voice their concerns over AI's damage to the environment.
Credit: https://x.com/beeaannZzZ/status/1922280522283094378/photo/1
I am so glad that someone else is doing energy use of AI so I don't have to.