- Apache Kudu
Apache Kudu is a free and open source columnar storage system developed for the Apache Hadoop. It is an engine intended for structured data that supports low-latency random access millisecond-scale access to individual rows together with great analytical access patterns. It is a Big Data(...)
- Apache Kylin
Apache Kylin is a distributed open source online analytics processing (OLAP) engine for interactive analytics Big Data. Apache Kylin has been designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark. In addition, it easily integrates with BI tools via ODBC(...)
- Big Data Analytics
Before Hadoop, both storage and compute technology was limited; as a result, the analytics process was long and rigid.
In order to get every new data source ready to be stored it had to go through a lengthy process, usually known as ETL. Once the data was ready, it had to be stored in a(...)
- Catalyst Optimizer
At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer.
Catalyst is based on functional programming constructs in Scala and designed(...)
- Continuous Applications
Continuous applications are an end-to-end application that reacts to data in real-time. In particular, developers would like to use a single programming interface to support the facets of continuous applications that are currently handled in separate systems, such as query serving or(...)
- Data lake
A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data.
Compared to a hierarchical data warehouse which stores data in files or folders, a data lake uses a different approach; it uses(...)
- Databricks Runtime
Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics. The primary(...)
A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. The list of columns and the types in those columns the schema. A simple analogy would be a spreadsheet with named columns. The fundamental difference is that while a spreadsheet sits on(...)
Datasets are a type-safe version of Spark’s structured API for Java and Scala. This API is not available in Python and R, because those are dynamically typed languages, but it is a powerful tool for writing large applications in Scala and Java.
Recall that DataFrames are a distributed(...)
- Deep learning
Deep Learning is a subset of machine learning concerned large amounts of data. with algorithms that have been inspired by the structure and function of the human brain, which is why deep learning models are often referred to as deep neural networks. It is is a part of a broader family of(...)
- Dense Tensor
Dense tensors store values in a contiguous sequential block of memory where all values are represented.
Tensors or multi-dimensional arrays are used in a diverse set of multi-dimensional data analysis applications.
There are a number of software products that can perform tensor(...)
- Extract Transform Load
ETL stands for Extract-Transform-Load and it refers to the process used to collect data from numerous disparate databases, applications and systems, transforming the data so that it matches the target system’s required formatting and loading it into a destination database.
- Hadoop Ecosystem
Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software library; it includes open source projects as well as a complete range of complementary tools. Some of the most well-known tools of Hadoop ecosystem include HDFS, Hive, Pig, YARN, MapReduce, Spark, HBase(...)
- Hash Buckets
In computing, a hash table (hash map) is a data structure that provides virtually direct access to objects based on a key (a unique String or Integer). A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
- Hive Date Function
Hive provides many built-in functions to help us in the processing and querying of data. Some of the functionalities provided by these functions include string manipulation, date manipulation, type conversion, conditional operators, mathematical functions, and several others.
- Lambda Architecture
Lambda architecture is a way of processing massive quantities of data (i.e. “Big Data”) that provides access to batch-processing and stream-processing methods with a hybrid approach.
Lambda architecture is used to solve the problem of computing arbitrary functions. The lambda architecture(...)
- Machine Learning Library (MLlib)
Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and models instead of solving the complexities(...)
- ML Pipelines
Typically when running machine learning algorithms, it involves a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. For example, when classifying text documents might involve text segmentation and cleaning, extracting features, and training a(...)
Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files.
Parquet uses the record shredding and assembly(...)
- Predictive Analytics
Predictive analytics is a form of advanced analytics that uses both new and historical data to determine patterns and predict future outcomes and trends
How does predictive analytics work?
Predictive analytics uses many techniques such as statistical analysis techniques, analytical(...)
PyCharm is an integrated development environment (IDE) used in computer programming, created for the Python programming language. When using PyCharm on Databricks, by default PyCharm creates a Python Virtual Environment, but you can configure to create a Conda environment or use an existing(...)
Apache Spark is written in Scala programming language. PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and(...)
- Resilient Distributed Dataset (RDD)
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
- Spark API
If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs
RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault tolerant, immutable in nature. They can be operated on in parallel with(...)
- Spark Applications
Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and(...)
- Spark SQL
Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It(...)
- Spark Streaming
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources(...)
- Spark Tuning
What is Spark Performance Tuning?
Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in(...)
Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark’s capabilities in a modern R environment, due to Spark’s ability to interact with distributed data with little latency. Sparklyr is an effective tool for interfacing with large(...)
SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other language bindings. To use SparkR, we simply import it into our environment and run our code. It’s all very similar to the Python API except that it follows R’s syntax instead of Python. For the most(...)
- Sparse Tensor
Python offers an inbuilt library called numpy to manipulate multi-dimensional arrays. The organization and use of this library is a primary requirement for developing the pytensor library.
Sptensor is a class that represents the sparse tensor. A sparse tensor is a dataset in which most of(...)
- Streaming Analytics
How does Stream Analytics work?
Streaming analytics, also known as event stream processing, is the analysis of huge pools of current and “in-motion” data through the use of continuous queries, called event streams.
These streams are triggered by a specific event that happens as a direct(...)
- Structured Streaming
Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a streaming fashion. This can reduce latency and(...)
In November of 2015, Google released it's open-source framework for machine learning and named it TensorFlow. It supports deep-learning, neural networks, and general numerical computations on CPUs, GPUs, and clusters of GPUs. One of the biggest advantages of TensorFlow is its open-source(...)
- Tensorflow Estimator API
Estimators represent a complete model but also look intuitive enough to less user. The Estimator API provides methods to train the model, to judge the model’s accuracy, and to generate predictions.
TensorFlow provides a programming stack consisting of multiple API layers like in the below(...)
In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will have to instruct Spark how you would like to modify(...)
Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware. This effort includes the following(...)
- Unified AI Framework
Unified Artificial Intelligence or UAI was announced by Facebook during F8 this year. This brings together 2 specific deep learning frameworks that Facebook created and outsourced - PyTorch focused on research assuming access to large-scale compute resources while Caffe focused on model(...)
- Unified Analytics
Unified Analytics is a new category of solutions that unify data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Analytics makes it easier for enterprises to build data pipelines across(...)
- Unified Data Analytics Platform
Databricks' Unified Data Analytics Platform helps organizations accelerate innovation by unifying data science with engineering and business. With Databricks as your Unified Data Analytics Platform, you can quickly prepare and clean data at massive scale with no limitations. The platform also(...)