Introduction After a whirlwind year of developments in 2023, many enterprises are eager to adopt increasingly capable generative AI models to supercharge their...
Quantization is a technique for making machine learning models smaller and faster. We quantize Llama2-70B-Chat, producing an equivalent-quality model that generates 2.2x more...
Over the past six months, we've been working with NVIDIA to get the most out of their new TensorRT-LLM library. TensorRT-LLM provides an easy-to-use Python interface to integrate with a web server for fast, efficient inference performance with LLMs. In this post, we're highlighting some key areas where our collaboration with NVIDIA has been particularly important.
Llama2-70B-Chat is a leading AI model for text completion, comparable with ChatGPT in terms of quality. Today, organizations can leverage this state-of-the-art model...
Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave The research and engineering teams here at MosaicML collaborated with CoreWeave, one of...
MosaicML’s Open Division submission to the MLPerf Image Classification benchmark delivers a score of 23.8 minutes (4.5x speed-up relative to our baseline) on...