Discover how to build smarter, more efficient AI inference systems. Learn about quantization, sparsity, and advanced techniques like vLLM with Red Hat AI.
This e-book introduces the fundamentals of inference performance engineering and model optimization, with a focus on quantization, sparsity, and other techniques that help reduce compute and memory requirements, as well as runtime systems like Virtual Large Language Model (vLLM), which offer benefits for efficient inference.