Introduction to data streaming

Over the last several years, the need for real-time data has grown exponentially. Organizations are increasingly building applications and platforms that leverage data streams to deliver real-time analytics and machine learning to drive business growth. By continuously collecting, processing and analyzing data, leaders can gain immediate insights, enable faster decision-making and make more accurate predictions.

Companies may leverage real-time data streaming to track things like business transactions in operational systems and potential fraud, as well as inform dynamic pricing models. Meanwhile, the proliferation of the Internet of Things (IoT) means that everyday devices and sensors transmit enormous quantities of raw data, and immediate access to those datasets can help troubleshoot potential issues or make location-specific recommendations.

In short, real-time data has the potential to transform an organization by creating new and innovative opportunities or provide insight into ongoing datasets.

Here’s more to explore

Big Book of Data Engineering

Fast-track your expertise with this essential guide for the AI era.

Read now

Get Started With ETL

Learn about ETL pipelines with this O’Reilly technical guide.

Download now

Learn data engineering now

Watch 4 videos and pass a quiz to earn a badge.

Get started

Streaming vs. batch processing

To handle their data, organizations have traditionally relied on batch processing, which refers to the collection and processing of data in large chunks, or “batches,” at specified intervals. Today, companies may leverage batch processing when they require timely, but not real-time, data. This includes applications such as sales forecasting, inventory management, data ingestion from mainframes and even consumer survey processing.

However, to compete in today’s global business environment, organizations increasingly need access to data as it’s collected. Streaming data helps organizations make timely decisions by ensuring data is processed quickly, accurately and in near real time. By processing data within seconds or milliseconds, streaming is an ideal solution for use cases such as high-frequency trading, real-time bidding, log processing, real-time analytics or fraud detection.

While organizations may recognize their need for streaming data, it can be difficult to transition from batch to streaming data because of:

New APIs and languages to learn. It can be difficult to enable existing data teams with the languages and tools they already know.
Complex operational tooling to build. Organizations may find it hard to deploy and maintain streaming data pipelines that run reliably in their production environment.
Real-time and historical data in separate systems. There may be incompatible governance models that limit the ability to control access for the right users and groups.

Databricks is helping customers move beyond the traditional bifurcation of batch versus streaming data with the Data Intelligence Platform. By integrating real-time analytics, machine learning (ML) and applications on one platform, organizations benefit from simplified data processing in a singular platform that handles both batch and streaming data.

With the Databricks Data Intelligence Platform, users can:

Build streaming pipelines and applications faster. Customers can use the languages and tools they already know with unified batch and streaming APIs in SQL and Python. They can unlock real-time analytics, ML and applications for the entire organization.
Simplify operations with automated tooling. Easily deploy and manage your real-time pipelines and applications in production. Automated tooling simplifies task orchestration, fault tolerance/recovery, automatic checkpointing, performance optimization and autoscaling.
Unify governance for all of your real-time data across clouds. Unity Catalog delivers one consistent governance model for all your streaming and batch data, simplifying how you discover, access and share real-time data.

Databricks Processing

Streaming vs. real-time processing

Streaming and real-time processing are closely related concepts, and they are often used interchangeably. However, they do have subtle but important distinctions.

“Streaming data” refers to the continuous data streams generated by data in motion. It is a data pipeline approach where data is processed in small chunks or events as they are generated. “Real-time processing,” on the other hand, emphasizes the immediacy of analysis and response, aiming to deliver insights with minimal delay after data is received. In other words, a streaming data system ingests real-time data and processes it as it arrives.

It is important to note that, even within the scope of “real-time streaming,” there is a further distinction between “real time” and “near real time,” primarily with respect to latency. Real-time data refers to systems that analyze and act on data with negligible delays, usually within milliseconds of data generation. These systems are designed for scenarios where immediate action is critical, such as automated stock trading, medical monitoring systems or fraud detection in financial transactions.

Near real-time processing, on the other hand, involves a slight delay, usually measured in seconds. This approach is suitable for situations where an instantaneous response is not necessary, but timely updates are still preferred, such as social media feed updates, logistics tracking and aggregating data for operational dashboards.

Incremental processing in data pipelines

While stream processing can be the right choice for some organizations, it can be costly and resource-intensive to run. One way to gain the benefit of data streaming without continuous data processing is via incrementalization. This method processes only newly added, modified or changed data rather than a complete dataset.

One example of how incrementalization can be run is via materialized views in Databricks. A materialized view is a database object that stores the results of a query as a physical table. Unlike regular database views, which are virtual and derive their data from the underlying tables, materialized views contain precomputed data that is incrementally updated on a schedule or on demand. This precomputation of data allows for faster query response times and improved performance in certain scenarios.

Materialized views can be useful when processing smaller sets of data, rather than entire datasets. Overall, incrementalization of data within a pipeline can boost efficiency by reducing computational effort, time and resource consumption. This is especially ideal for large-scale pipelines, where processing updates can lead to faster analysis and decision-making.

Considerations and trade-offs in streaming

As organizations implement real-time data streams, there are some important factors to consider within the data processing architecture. How you design your system can introduce some important trade-offs and depends on your organization’s workload demands and business outcomes. Some features to consider include:

Latency: This refers to the time it takes for data to be processed and delivered from the moment it is received. Low-latency data is critical for real-time applications such as fraud detection or even live video streaming, but it can be costly to maintain.

Opting for higher latency in your data may be ideal for workflows that require only periodic reporting or where immediate processing and decision-making are not critical. Systems that store log data or generate daily or weekly sales reports usually leverage higher-latency data streams.

Throughput: This is a measure of the volume of data a system can process over time, usually expressed as events per second. High throughput is crucial for IoT since it handles massive data flows efficiently. But higher levels of throughput introduce some compromise in latency.

Cost: For many organizations, cost is the driving factor in determining the right level of latency and throughput for their systems. For some workloads that require timely data processing, it may be worth the investment to design a low-latency, high-throughput system. However, if your data needs are not immediate, or your workloads require larger batches of data, then a higher-latency system may be the right choice.

Not all streaming architectures are created equal, and it is important to find the right balance to meet the demands of your workload as well as your budget. Think of it as accessing your data at the right time — when you need it — instead of in real time.

Spark Streaming architecture

Apache Spark^TM Structured Streaming is the core technology that unlocks data streaming on the Databricks Data Intelligence Platform, providing a unified API for batch and stream processing. Spark is an open source project that divides continuous data streams into small, manageable batches for processing. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs and run them in a streaming fashion. This can reduce latency and allow for incremental processing, with latencies as low as 250ms.

In Structured Streaming, data is treated as an infinite table and processed incrementally. Spark collects incoming data over a short time interval, forms a batch and then processes it like traditional batch jobs. This approach combines the simplicity of batch processing with near real-time capabilities, and features checkpoints that enable fault tolerance and failure recovery.

Spark’s approach to the data pipeline is designed to efficiently use resources. The pipeline begins with the ingestion of raw data, which is then filtered, aggregated or mapped on its way to the data sink. However, each stage processes data incrementally as it moves through the pipeline, looking for any anomalies or errors before it’s stored in a database.

For workloads that demand high responsiveness, Spark features a Continuous Processing mode that offers real-time capabilities by processing each record individually as it arrives. You can learn more about managing streaming data on Databricks here.

Streaming ETL

Streaming ETL (extract, transform, load) helps organizations process and analyze data in real time or near real time to meet the demands of data-driven applications and workflows. ETL has usually been run in batches; however, streaming ETL ingests data as it is generated to ensure data is ready for analysis almost immediately.

Streaming ETL minimizes latency by processing data incrementally, allowing for continuous updates rather than waiting for a dataset to batch. It also reduces the risks associated with data that is out of date or irrelevant, ensuring decisions are based on the latest available information.

Inherent in any ETL tool must be the ability to scale as a business grows. Databricks launched DLT as the first ETL framework that uses a simple declarative approach to building reliable data pipelines. Your teams can use languages and tools they already know, such as SQL and Python, to build and run your batch and streaming data pipelines in one place with controllable and automated refresh settings. This not only saves time but also reduces operational complexity. No matter where you plan to send your data, building streaming data pipelines on the Databricks Data Intelligence Platform ensures you don’t lose time between raw and cleaned data.

Streaming analytics

As we’ve seen, data streaming offers continuous processing of data at low latency and the ability to transmit real-time analytics as events occur. Access to real-time (or near real-time) raw data can be critical for business operations, as it gives decision-makers access to the latest and most relevant data. Some of the advantages of streaming analytics include:

Data visualization. Keeping an eye on the most important company information can help organizations manage their key performance indicators (KPIs) on a daily basis. Streaming data can be monitored in real time, allowing companies to know what is occurring at every single moment.

Business insights. Real-time dashboards can help alert you when an out-of-the-ordinary business event occurs. For instance, they may be used to automate detection and response to a business threat or flag an area where abnormal behavior should be investigated.

Increased competitiveness. Businesses looking to gain a competitive advantage can use streaming data to quickly discern trends and set benchmarks. This can give them an edge over competitors relying on batch analysis.

Cutting preventable losses. With the help of streaming analytics, organizations can prevent or reduce the damage of incidents like security breaches, manufacturing issues or customer churn.

Analyzing routine business operations. Streaming analytics helps organizations ingest and obtain instant, actionable insights from data in real time. When leaders have access to relevant, timely and trusted data, they can be sure they are making sound decisions.

Streaming for AI/ML

As artificial intelligence (AI) and ML models develop and mature, traditional batch processing can struggle to keep pace with the size and diversity of data these applications require. Delays in data transmission can lead to inaccurate responses and an uptick in application inefficiency.

Streaming data provides a continuous flow of real-time information based on the most current available data, ensuring AI/ML models adapt and make predictions as events happen. There are two ways streaming data helps prepare AI models:

AI training: In the early stages of AI/ML development, streaming data is key to training models by providing large datasets of structured or unstructured data. The models are trained to recognize patterns or correlations and then make initial predictions based on random or predefined parameters. This process is repeated and refined with large amounts of data to improve the model’s accuracy and reliability over time. By learning from patterns and trends — as well as any deviations in those patterns — these models develop more-precise outputs and predictions.

AI inference: Once an AI/ML system has been trained, it can be deployed in a production environment where it uses the learned parameters from its training to make predictions (inferences) based on input data. Streaming data provides fresh and unseen data, while the models generate near-instant insights and predictions.

Organizations across sectors leverage the insights of AI built on streaming datasets. Health and wellness retailers leverage real-time reporting on customer data to help pharmacists provide personalized recommendations and advice. Telecommunications companies can use real-time machine learning to detect fraudulent activity like illegal device unlocks and identify theft. Meanwhile, retailers can leverage streaming data to automate real-time pricing based on inventory and market factors.

While streaming data is crucial for these models, it’s important to note that integrating AI/ML with data streaming presents a unique set of challenges. Some of these challenges include:

Data volume: Organizations have a deluge of data at their fingertips, such as customer information, transaction data, device usage data and more. Managing all of this data and integrating it into an AI/ML model requires a robust data architecture and processing capabilities that are scalable and resilient.

Data quality: While the amount of data is growing exponentially, not all data is high-quality and accurate. Data is often sourced from various systems, in disparate formats and may be incomplete or inconsistent. For AI/ML models to function well, data must be continuously tested and validated to ensure reliability.

Data pipelines: Building robust and efficient data pipelines to handle real-time data ingestion, transformation and delivery for AI/ML can be complex. It’s crucial that your organization invests in a scalable infrastructure to handle large models of data ingestion and data processing.

Databricks is addressing these problems through Mosaic AI, which provides customers with unified tooling to build, deploy, evaluate and govern AI and ML solutions. Users receive accurate outputs customized with enterprise data and can train and serve their own custom large learning models (LLMs) at 10x lower cost.

Streaming on Databricks

Deploying data streaming within your organization can require a good deal of effort. Databricks makes it easier by simplifying data streaming. The Databricks Data Intelligence Platform delivers real-time analytics, machine learning and applications — all on one platform. By building streaming applications on Databricks, you can:

Enable all your data teams to easily build streaming data workloads with the languages and tools they already know and the APIs they already use.
Simplify development and operations by leveraging out-of-the-box capabilities that automate many of the production aspects associated with building and maintaining real-time data pipelines.
Eliminate data silos and centralize your security and governance models with a single platform for streaming and batch data.

Additionally, with the help of DLT, customers receive automated tooling to simplify data ingestion and ETL, preparing datasets for deployment across real-time analytics, ML and operational applications.

Spark Structured Streaming lies at the heart of Databricks’ real-time capabilities. Widely adopted by hundreds of thousands of individuals and organizations, it provides a single and unified API for batch and stream processing, making it easy for data engineers and developers to build real-time applications without changing code or learning new skills.

Across the world, organizations have leveraged data streaming on the Databricks Data Intelligence Platform to optimize their operational systems, manage digital payment networks, explore new innovations in renewable energy and help protect consumers from fraud.

Databricks offers all of these tightly integrated capabilities to support your real-time use cases on one platform.

Back to Glossary