AJAX progress indicator
Search: (clear)
  • c

  • Catalyst Optimizer
    At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer. Catalyst is based on functional programming constructs in Scala and designed(...)
  • Continuous Applications
    Continuous applications are an end-to-end application that reacts to data in real-time. In particular, developers would like to use a single programming interface to support the facets of continuous applications that are currently handled in separate systems, such as query serving or(...)
  • d

  • Databricks Runtime
    Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics. The primary(...)
  • DataFrames
    A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. The list of columns and the types in those columns the schema. A simple analogy would be a spreadsheet with named columns. The fundamental difference is that while a spreadsheet sits on(...)
  • Datasets
    Datasets are a type-safe version of Spark’s structured API for Java and Scala. This API is not available in Python and R, because those are dynamically typed languages, but it is a powerful tool for writing large applications in Scala and Java. Recall that DataFrames are a distributed(...)
  • m

  • Machine Learning Library (MLlib)
    Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and models instead of solving the complexities(...)
  • ML Pipelines
    Typically when running machine learning algorithms, it involves a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. For example, when classifying text documents might involve text segmentation and cleaning, extracting features, and training a(...)
  • r

  • Resilient Distributed Dataset (RDD)
    RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. 5(...)
  • s

  • Spark Applications
    Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and(...)
  • Spark SQL
    Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. It(...)
  • Spark Streaming
    Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources(...)
  • SparkR
    SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other language bindings. To use SparkR, we simply import it into our environment and run our code. It’s all very similar to the Python API except that it follows R’s syntax instead of Python. For the most(...)
  • Structured Streaming
    Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a streaming fashion. This can reduce latency and(...)
  • t

  • TensorFlow
    In November of 2015, Google released it's open-source framework for machine learning and named it TensorFlow. It supports deep-learning, neural networks, and general numerical computations on CPUs, GPUs, and clusters of GPUs. One of the biggest advantages of TensorFlow is its open-source(...)
  • Transformations
    In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will have to instruct Spark how you would like to modify(...)
  • Tungsten
    Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware. This effort includes the following(...)
  • u

  • Unified AI Framework
    Unified Artificial Intelligence or UAI was announced by Facebook during F8 this year. This brings together 2 specific deep learning frameworks that Facebook created and outsourced - PyTorch focused on research assuming access to large-scale compute resources while Caffe focused on model(...)
  • Unified Analytics
    Unified Analytics is a new category of solutions that unify data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Analytics makes it easier for enterprises to build data pipelines across(...)