AITemplate, a revolutionary new inference engine by Meta AI
73 points by azurezyq 2 years ago | 35 comments- haolu7 2 years agoAITemplate-PyTorch Stable Diffusion is the fastest stable diffusion inference solution by pushing image generation below one second on A100 (batch 1: 0.7s / 25 steps, 1.3s / 50 steps; batch 3: 1.6s / 25 steps, per image 0.55s; batch 16 7.9s / 25 steps, per image 0.49s) for the first time, 2.57X faster than Keras' XLA-based GPU compilation solution.
More benchmark numbers and repro at: https://github.com/facebookincubator/AITemplate/tree/main/ex...
- Llamamoe 2 years agoWow. Considering that with the better samplers you can reduce steps to 10-15, this is getting close to near-instant results.
One or two more optimizations and we're gonna have live-update results.
- tveita 2 years agoThis lists "OOM" for PyTorch on a RTX 3080-10GB, but I believe people have optimized the PyTorch SD model to run on even 6GiB GPUs.
Would AITemplate be able to run with those constraints?
- ipiszy 2 years agoRTX 3080-10GB should work. You could check https://github.com/facebookincubator/AITemplate/tree/main/ex..., and https://www.reddit.com/r/StableDiffusion/comments/xv7m89/met....
- ipiszy 2 years ago
- PresentHarmony 2 years agoOr if you count in another way. In one second, how many pictures it will be able to generate, with these parameters. It could be 1.05, 1.1, or say 1.5 or even 2 pictures. Thank you very much for your post! I will be very grateful for the answer!
- PresentHarmony 2 years agoCan you please eloborate, how many milliseconds does it take to generate 1 image with these wonderful improvements? I will be very grateful for your answer! Thank you very much!
- PresentHarmony 2 years agoDo I get it right, it takes 0.55 second or 0.49 second to generate an image depending on the batch?
Thank you so much for your post! I would be very grateful for the response!
- ipiszy 2 years agoYes this is correct. batch 16 7.9s / 25 steps, per image 0.49s: it generates 16 images for each prompt within 7.9s, so it's 0.49s per image.
- PresentHarmony 2 years agoOne more question, if you don't mind. 1 image is generated in 0.7 seconds (25 steps ) and the same single image with 50 steps will be generated in 1.3 seconds. So it's much cheaper to generate more images for the same promt. Am I right or am I missing something ? Thanks in advance for your answer.
P.S. Though it should be 1.4 seconds. 0.7*2=14.If you think twice the speps, twice the time.
- PresentHarmony 2 years agoThank you indeed, my friend!
- PresentHarmony 2 years ago
- ipiszy 2 years ago
- Llamamoe 2 years ago
- ghoomketu 2 years agoFor all the hate that Facebook gets their only redeeming quality is these open source projects they have been releasing all along.
Maybe this is to attract better engineers but all in all this has been a net postive for software development. So credit where it is due.
- version_five 2 years agoYes, it's hard to know in the balance whether the net contribution of these advertising companies (fb and google mainly) is a net positive, but their contribution to ML research is unmatched and has created an insane amount of value (I'd speculate rivaling their market caps but someone can probably prove me wrong) in business and research that uses the tools they've built.
- ETH_start 2 years agoThe net impact of these companies is massively positive. Facebook, with its trust-engendering social graph, enables huge numbers of businesses and social groups to exist that otherwise couldn't while Google has enabled so much information discovery that we just take for granted now.
Of course I would argue there's a better way to provide these kinds of services that concentrates power less, and that's decentralization with cryptoeconomic incentives to maintain consensus, but for their generation, they did well.
- ETH_start 2 years ago
- version_five 2 years ago
- azurezyq 2 years ago
- yinghai83 2 years agoVery impressive results!
- yinghai83 2 years ago
- ipiszy 2 years agotl;dr:
Meta is open sourcing AITemplate, an inference engine for both Nvidia and AMD GPUs. Code: https://github.com/facebookincubator/AITemplate.
AITemplate delivers much better perf (1.9x ~ 12.8x) compared to PyTorch eager on SOTA models, including Bert, ResNet, VIT and StableDiffusion.
AITemplate also delivers high perf numbers using AMD GPUs (MI-250). With AITemplate, MI-250 achieves 80% ~ 96% A100 perf on various ResNet / Bert / VIT models.
AITemplate uses sophisticated fusion techniques to optimize perf, including vertical, horizontal, and memory fusions.
btw, I'm one of the authors of AITemplate, happy to answer any questions.
- Narew 2 years agoHow did AITemplate performance to state of art inference engine like tvm or onnx runtime ? Did AITemplate optimize/quantify network?
Edit: link for TVM https://tvm.apache.org/
- ipiszy 2 years agoAITemplate only supports fp16 data types with fp16 or fp32 accumulation right now. We are working on supporting more data types and quantization.
We don't have an official comparison between AITemplate and tvm / onnx for now, but we do have perf numbers like https://github.com/facebookincubator/AITemplate/tree/main/ex..., https://github.com/facebookincubator/AITemplate/tree/main/ex.... Feel free to run these examples on other frameworks and compare perf.
- davidatbu 2 years agoI'd love to hear about this too: especially after running the model through an onnx optimizer, like this one [0].
- ipiszy 2 years ago
- throwaway81523 2 years agoThanks, that is very helpful. Do you have to train the model differently for use with AITemplate? Could it be helpful for Leela Chess Zero (LC0)? I think LC0 has a generic Pytorch backend, that is several times slower than its NVidia specific CUDA backend. I'm not very clueful about this stuff though.
- haolu7 2 years agoNo, you don't need to train the model differently to use it with AITemplate. Here is an intro example to do inference with AITemplate with a very simple PyTorch model: https://facebookincubator.github.io/AITemplate/tutorial/how_.... For more advanced examples, check out https://github.com/facebookincubator/AITemplate/tree/main/ex...
- ipiszy 2 years agoAs @haolu7 mentioned, you could take a pre-trained model and use AITemplate to do model inference. All you need to do is to re-write the model using AITemplate frontend and map PyTorch params to AITemplate params. Besides, AITemplate has a limited operator coverage compared to mature frameworks like PyTorch so you may need to implement your own kernels if necessary (though it already supports Bert, VIT, StableDiffusion, ResNet, Detectron, and general recommendation models).
- haolu7 2 years ago
- fooblaster 2 years agoHow does the performance compare with tensor rt? I didn't see any benchmarks comparing against that. I expect it to be lower for now, but excited for see what the future brings.
- upbeat_general 2 years agoDo you know of any good explanations of the techniques you used for those who only touch PyTorch Eager + occasionally torchscript?
- ipiszy 2 years agoYou could check "AITemplate optimizations" section in the blog (https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd...), and https://github.com/facebookincubator/AITemplate#more-about-a.... The basic idea is to do aggressive kernel fusions.
- ipiszy 2 years ago
- papersnake 2 years agoHave you tested this on big models involving multi-gpu communication, or any plans?
- ipiszy 2 years agoFor now it's for single GPU inference only.
- ipiszy 2 years ago
- pretty_dumm_guy 2 years agoHow do you verify the correctness of your fusion operation ?
- ipiszy 2 years agoWe have a bunch of unittests and E2E tests to compare numeric numbers between AITemplate and PyTorch eager.
- ipiszy 2 years ago
- Narew 2 years ago
- house_road 2 years agoIt supports both Nvidia and AMD, and both got pretty good speedup. This is a great achievement!
- enoch2090 2 years agoHow would this perform compared with Tensorflow?
- devcat 2 years agoSadly it doesn't have Apple GPU backend
- mbroncano 2 years agoIt mentions it is in the works
- mbroncano 2 years ago
- throwaway81523 2 years agoTldr?
- theflyingelvis 2 years agoUnfortunately your comment was too long. I didn’t read it. Try being more succinct next time.
- theflyingelvis 2 years ago