Databricks is proud to be a platinum sponsor of NeurIPS 2024. The conference runs from December 10 to 15 in Vancouver, British Columbia.
Visit our Booth
Stop by booth #591 in the Expo Hall from December 10-12 to meet members of the team and learn about our latest work.
Demo
- Observability and Automated Evaluation with DSPy, MLflow, and Mosaic AI Agent Framework
Join us as we demonstrate how MLflow Tracing and the Mosaic AI Agent Framework provide observability and automated evaluation as we iteratively improve the factuality and accuracy of a GenAI application with DSPy. MLflow’s Tracing feature captures detailed information about LLM and agent inputs and outputs, allowing developers to easily identify the source of bugs and unexpected behaviors. Additionally, the Mosaic AI Agent Framework, part of the Databricks Data Intelligence Platform, provides capabilities for improving the quality of GenAI applications through human feedback and automated evaluation.
Presentations and accepted publications
Talks
- Table Representation Learning Workshop, Saturday, December 14, 1:30-2:05 PM, Matei Zaharia
The Table Representation Learning (TRL) workshop is the premier venue for research into tabular data as a modality for representation learning and generative models. At this year’s workshop, Matei Zaharia is the featured speaker for the session focused on natural language interfaces to tables.
Workshop Accepted Papers
In this work, we compare the effectiveness of sparse upcycling against continued pretraining (CPT) across different model sizes, compute budgets, and pretraining durations. Our experiments show that sparse upcycling can achieve better quality, with improvements of over 20% relative to CPT in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models. Our findings highlight the trade-off between model quality and inference efficiency, offering insights for practitioners seeking to balance model quality and deployment constraints.
This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We run RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state-of-the-art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.
In this work, we explore the use of MixAttention, a model architecture modification that combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.
We introduce Critique-out-Loud (CLoud) RLHF reward models that reason explicitly about the quality of a response from an LLM assistant. CLoud reward models operate by first generating a natural language critique of the assistant’s response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models, CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.
Join our Team
Are you interested in working with us? We’re hiring! Check out our open jobs and join our growing research team.