The observation that "software is eating the world" has shaped the modern tech industry. Today, software is ubiquitous in our lives, from the watches we wear, to our houses, cars, factories and farms. At Databricks, we believe that soon, AI will eat all software. That is, the software built over the past decades will be intelligent, leveraging data, making it much smarter. The implications are vast and varied, impacting everything from customer support to healthcare and education.
In this blog, we give our view on how AI will change data platforms. We argue that the impact of AI on data platforms will not be incremental, but fundamental: massively democratizing access to data, automating manual administration, and enabling turnkey creation of custom AI applications. All this will be enabled by a new wave of unified platforms that deeply understand an organization's data. We call this new generation of systems Data Intelligence Platforms.
Data Platforms So Far and Their Challenges
Data warehouses emerged in the 1980s as a solution for organizing structured business data in enterprises. However, by 2010, organizations began accumulating a significant amount of unstructured data to support more varied use cases, such as AI. To address this, data lakes were introduced as an open, scalable system for any type of data. By 2015, it became common for most organizations to operate both data warehouses and data lakes. This dual-platform approach, however, presented significant challenges in governance, security, reliability and management.
Five years ago, Databricks pioneered the concept of the lakehouse to combine and unify the best of both worlds. Lakehouses store and govern all your data in open formats, and natively support workloads ranging from BI to AI. For the first time, lakehouses offered a unified system to (1) query all data sources in an organization together and (2) govern all the workloads that use data (BI, AI, etc.) in a unified way. Lakehouse became its own category of data platform and is now widely adopted by enterprises and incorporated into most vendors' stacks.
Despite the progress, all current data platforms in the market still face several major challenges:
- Technical Skill Barrier: Querying data requires specialized skills in SQL, Python or BI, creating a steep learning curve
- Data Accuracy and Curation: In large organizations, finding the right and accurate data is a challenge, requiring extensive curation and planning
- Management Complexity: Data platforms can skyrocket in costs and experience poor performance if not managed by highly technical personnel
- Governance and Privacy: Governance requirements across the world are rapidly evolving, and with the advent of AI, concerns around lineage, security and privacy are amplified
- Emerging AI Applications: In order to enable generative AI applications that answer domain-specific requests, organizations have to develop and tune LLMs in platforms that are separate from their data, and connect them to their data through manual engineering
Many of these issues arise because data platforms do not fundamentally understand the data in organizations and how it is used. Fortunately, generative AI presents a powerful new tool to address exactly these challenges.
The Core Idea Behind Data Intelligence Platforms
Data Intelligence Platforms revolutionize data management by employing AI models to deeply understand the semantics of enterprise data; we call this data intelligence. They build on the foundation of the lakehouse – a unified system to query and manage all data across the enterprise – but automatically analyze both the data (contents and metadata) and how it is used (queries, reports, lineage, etc.) to add new capabilities. Through this deep understanding of data, Data Intelligence Platforms enable:
- Natural Language Access: Leveraging AI models, DI Platforms enable working with data in natural language, tailored to each organization's jargon and acronyms. The platform observes how data is used in existing workloads to learn the organization's terms and offers a tailored natural language interface to all users – from nonexperts to data engineers.
- Semantic Cataloguing and Discovery: Generative AI can understand each organization's data model, metrics and KPIs to offer unparalleled discovery features or automatically identify discrepancies in how data is being used.
- Automated Management and Optimization: AI models can optimize data layout, partitioning and indexing based on data usage, reducing the need for manual tuning and knob configuration.
- Enhanced Governance and Privacy: DI Platforms can automatically detect, classify and prevent misuse of sensitive data, while simplifying management using natural language.
- First-Class Support for AI Workloads: DI Platforms can enhance any enterprise AI application by allowing it to connect to the relevant business data and leverage the semantics learned by the DI Platform (metrics, KPIs, etc.) to deliver accurate results. AI application developers no longer have to "hack" intelligence together through brittle prompt engineering.
Some might wonder how this is different from the natural language Q&A capabilities BI tools added over the last few years. BI tools only represent one narrow (although important) slice of the overall data workloads, and as a result do not have visibility into the vast majority of the workloads happening, or the data's lineage and uses before it reaches the BI layer. Without visibility into these workloads, they cannot develop the deep semantic understanding necessary. As a result, these natural language Q&A capabilities have yet to see widespread adoption. With data intelligence platforms, BI tools will be able to leverage the underlying AI models for much richer functionality. We, therefore, believe this core functionality will reside in data platforms.