Serving Quantized LLMs on NVIDIA H100 Tensor Core GPUsJanuary 30, 2024 by Nikhil Sardana, Julian Quevedo and Daya Khudia in Mosaic Research Quantization is a technique for making machine learning models smaller and faster. We quantize Llama2-70B-Chat, producing an equivalent-quality model that generates 2.2x more...
LLM Inference Performance Engineering: Best PracticesOctober 12, 2023 by Megha Agarwal, Asfandyar Qureshi, Nikhil Sardana, Linden Li, Julian Quevedo and Daya Khudia in Mosaic Research In this blog post, the MosaicML engineering team shares best practices for how to capitalize on popular open source large language models (LLMs)...