Skip to main content

Spark Elasticsearch

Try Databricks for free

What is Spark Elasticsearch?

Spark Elasticsearch is a NoSQL, distributed database that stores, retrieves, and manages document-oriented and semi-structured data. It is a GitHub open source, RESTful search engine built on top of Apache Lucene and released under the terms of the Apache License.

Elasticsearch is Java-based, thus available for many platforms that can search and index document files in diverse formats. The data stored in Elasticsearch is in the form of schema-less JSON documents; similar to NoSQL databases.

An image displaying the Spark Elasticsearch process. It starts with a Document with the word ‘Log’ on it, moves to ‘Collect & Transform,’ then ‘Search and Analyze’ with Elasticsearch, and finally progresses to ‘Visualize and Manage’ with Kibana.

The history of and an introduction to Elasticsearch

An integral aspect of a larger set of open-source tools known as the Elastic Stack, Elasticsearch is a popular full-text search engine, originally designed and developed by engineers at the Google Brain Team. It's widely used in a number of commercial applications, from Reddit, to YouTube to eBay. For many companies, text-based search has become an essential component of their business processes. In this way, Elasticsearch is similar to other search engines.

One key difference between Elasticsearch and other search engines is that Elasticsearch can store and manage distributed data. In other words, it's designed to deal with data that has a constantly varying size. This provides the capability for very complex queries, no matter how large a data set is. However, the potential impact of making a single database server handle data from multiple users can increase significantly.

Who uses Elasticsearch?

Thousands of top companies use Elasticsearch for both their online and offline data, including tech giants like Google, Oracle, Microsoft and many other household names.

But you don't have to be a tech giant to want an easy way to index structured data. You just have to know it exists and understand how it works.

But what is Elasticsearch used for, exactly?

Elasticsearch can serve a broad range of use cases, such as:

  • Logging and Log Analysis:  The ecosystem of complementary open source software and platforms built up around Elasticsearch has made it one of the easiest to implement and scale logging solutions.
  • Scraping and Combining Public Data: Elasticsearch is flexible enough to take in multiple different sources of data and keep it all manageable and searchable.
  • Full-Text Search: ElasticSearch is document-oriented. It stores and indexes documents. Indexing creates or updates documents. Once the indexing is finished, you can search, sort, and filter complete documents—not rows of columnar data.
  • Event Data and Metrics: Elasticsearch is also known for working great well on time-series data such as metrics and application events. No matter the technology you are using, Elasticsearch probably has the needed components to easily grab data for common applications; and in the rare case that it doesn't, adding that capability is quite easy.

Elasticsearch architecture: key components

To understand how Spark Elasticsearch works, when to use it and when not to use it, you have to first understand the infrastructure behind the Elasticsearch architecture. These key components include everything from the Elasticsearch cluster, ports 9200 and 9300, and Elasticsearch shards to Elasticsearch replicas, analyzers and documents.

Elasticsearch cluster

An elasticsearch cluster is a group of interconnected computing nodes, all of which store different pieces of cluster data. As a user, you can adjust the number of nodes each cluster is assigned to run by altering the "elasticsearch.yml" file found in the configurations folder. While it's possible to run as many clusters as you'd like, most users typically find one node is all it takes to achieve their desired results.

Elasticsearch node

An Elasticsearch node is a computing resource that is specifically tuned for searching, indexing and scaling the database. Since Elasticsearch is a distributed database, it uses a single source of truth, which is the Elasticsearch data node that holds all of your data. Each node in a cluster uses a different name. Typically, Elasticsearch nodes have about 10 to 50 million documents in each index.

Ports 9200 and 9300

There are two types of ports available on Elasticsearch shard. The first of the two shard ports is always open, and the second shard port is opened only when an Elasticsearch index is created and a cluster is initialized. 9200 is the default port to use for the primary shard and 9300 is the default port to use for its replica.

Elasticsearch shards

Elasticsearch shards are simply a collection of Kibana indexes inside one index. There are two types of index in Elasticsearch, Elasticsearch documents (doc) and Elasticsearch indexes. Documents are bound with identifiers and indexes with a unique name.

Elasticsearch replicas

A replica is a copy of a shard with all changes being reflected on the secondary replica but remaining transparent to the client. The primary replica is updated automatically when new data is added or when deleted, updated or modified.

Elasticsearch analyzers

An analyzer is a part of an Elasticsearch cluster that fetches data from the database and performs analysis with it. This makes it possible to filter and sort the data that's been returned to the user.

Elasticsearch document

The Elasticsearch documents are the primary index type in Elasticsearch. Each document is created as an ID in the data set and has a single column per document type. A simple example of a document ID for Elasticsearch is {doc id}. In general, each document in an Elasticsearch cluster has a shard ID, name and an array of indexes with all the fields having their own shard-wide identifiers.

How does Elasticsearch work?

In short, Elasticsearch works by taking data and publishing it on to every node in the cluster, and then scaling data up and down based on the current amount of data being stored. Elasticsearch benefits from being able to store all of your data in one database, with an elastic index container.

Because Elasticsearch is a document-oriented, RESTful search engine, it has a variety of useful tools and can work with large, and otherwise intimidating, data sets. Additionally, this software can be used as a complementary tool in addition to another. For instance, Elasticsearch + Spark.

An Elasticsearch query example

Let's say you wanted to search for the word "Telecommunications". The following simple search syntax will do the trick:

$"Telecommunications"

Since Elasticsearch works from documents, you can't simply search through a list of documents. You need to query a "document type". We'll use the phrase "type:Telecommunications" to ensure we only get documents that meet the search criteria.

To make this query, we pass the document ID number as a query parameter:

$"type:telecommunications"

To test this further, you could also create a simple example document by running the following:

create index:type:telecommunications create partition:type:telecommunications --data-urlencode /tasks --data-urlencode tasks/

What type of database does Elasticsearch use?

By combining the Lucene indexing build with a robust distribution database model, the Elasticsearch tool is able to fragment data sets into small components known as shards and distribute them across various nodes.

But where does Elasticsearch store data?

The data stored in Elasticsearch is either in JSON format or CSV format. Every index has its own template for documents stored in the index. The index is fully-replicated using a message bus to communicate with the secondary replication. Log files are written as Elasticsearch indexes.

These documents are stored as an array of key-value pairs in a data structure known as a "memcached set". A memcached set is a lightweight, low-memory, scalable data structure and has the ability to hold and process data with a large memory volume.

Elasticsearch's storage is optimized for ingest, indexing and search operations with files being written to the disk at regular intervals. In fact, the only way to change the index's size is to delete the last inserted document and replace it with a new one. This task is called "data migration" and refers to a new document being created from the new index, updated and then re-inserted.

What is Elasticsearch aggregation?

Elasticsearch aggregation, or allocating the same cluster to multiple endpoints, is a powerful feature that allows you to use the same Elasticsearch cluster for additional data and functionality without affecting the performance of your production cluster.

When aggregating clusters, each node is assigned one of three different workload types. The workload types are:

  1. Non-relational
  2. Online transaction processing (OLTP)
  3. Online analytical processing (OLAP)

Non-relational

All of the network requests generated by Elasticsearch are generated by queries that are being run against the Elasticsearch cluster.

When an Elasticsearch node is idle, it is the responsibility of the operating system to run queries on a background thread and continuously report on the results. When an Elasticsearch node is being used, it will participate in a failover mechanism (in the event of a node failure) or (in the case of node overload) it will pass the query requests through to a number of other nodes, waiting until one of the other nodes is free.

While the network traffic generated by Elasticsearch is most commonly querying related data, there are many other situations that can also take advantage of Elasticsearch.

OLTP

All of the network requests generated by Elasticsearch are still generated by queries that are running against the Elasticsearch cluster.

While this can be a full Elasticsearch cluster for a large system (and certainly a good start), there are times when it's desirable to combine Elasticsearch with a relational data source. In those cases, Elasticsearch will be running against a secondary relational data source for processing and will only keep track of some of the queries it has run. In this scenario, each node is assigned only one secondary source, with the other remaining idle.

OLAP

A key distinction between Elasticsearch aggregation and regular aggregation is that, while other aggregations can use the same Elasticsearch cluster for multiple purposes, Elasticsearch aggregation uses a secondary data source to store and process the aggregated data. This allows for Elasticsearch aggregation to store more data without creating additional queries on a primary dataset, such as SQL or NoSQL data sets.

How to install Elasticsearch and use it

Installation is fairly straightforward. It's possible to use default repositories for Elasticsearch and set a default environment for Elasticsearch, too.

Elasticsearch uses a configuration file called Kibana.yml as the basis for its configuration. You can modify the file to suit your needs. You can also use any of the more popular Elasticsearch plugin providers such as InfluxDB, Logstash, etc.

The steps to installing Elasticsearch:

  1. Install the Elasticsearch Development version as well as the server and install dependencies
  2. Install the BOSH Extension for Java. The BOSH extension helps you write HTML templates for your Elasticsearch to make data more accessible and readable for humans as well as data manipulation tools. The BOSH extension requires a Java runtime. You can use default repositories for Java on your operating system to install it.
  3. Launch Elasticsearch
  4. Install the Java plugin for BOSH
  5. That's it. Elasticsearch is up and running on your machine and you now have access to all of the data in an easy to read way.

Elasticsearch data visualization

Elasticsearch allows you to search and filter through all sorts of data via a simple API. The API is RESTful, so users can not only use it for data-analysis but can also use it in production for web-based applications. Currently, Elasticsearch includes faceted search, a functionality that allows you to compute aggregations of your data. Here are some of the most relevant features:

  • It provides a scalable search solution.
  • Performs near-real-time searches.
  • Provides support for multi-tenancy.
  • Streamlines backup processes and ensures data integrity.
  • An index can be recovered in case of a server crash.
  • Uses Javascript Object Notation (JSON) as well as Java application program interfaces (APIs).
  • Automatically indexes JSON documents.
  • Indexing uses unique type-level identifiers.
  • Each index can have its own settings.
  • Searches can be done with Lucene-based query strings.

Why use Elasticsearch instead of SQL?

The Elasticsearch service is by far the most widely adopted, powerful and useful search technology because when it comes to processing large amounts of data quickly and efficiently, it's notably better than most of its SQL counterparts.

Elasticsearch is purpose-built for enterprise search use, providing powerful features and ease of use tools to businesses that rely on data analytics. Thus offering them a more practical and flexible way to store, search and analyze batches of data in a less resource-intensive way.

How to check Elasticsearch version

There are two quick ways to check the version of Elasticsearch you're running. The first is to launch and login to your ElasticSearch console and view your software version. The second is to check your Elasticsearch official documentation.

Top three Elasticsearch alternatives on the market

When considering which software to use, there are three top Elasticsearch alternatives to take into account before making your decision:

AWS

Amazon Web Services (AWS) has become the top computing platform for startups, cutting-edge research and the biggest enterprises looking to enhance their computing infrastructure. With technology that allows customers to use and build their own virtual servers, along with the industry's widest set of cloud-computing services, AWS powers the so-called "cloud wars" between Microsoft's Azure and Google's GCP.

Solr

Apache Solr is an open source (BSD licensed) search analytics engine daemon written in Java and is one of the most popular open-source search engines. In fact, Solr powers the search functionality of many of the world's largest e-commerce sites and social media platforms including Twitter, Yahoo, Amazon, eBay and eBay Enterprise.

Solr uses a distributed architecture to provide rapid search, and features a unique unified Storage API that enables the search engine to integrate seamlessly with virtually any storage mechanism used by the enterprise.

ArangoDB

ArangoDB is a distributed, NoSQL document-oriented database and has become a popular choice due to its powerful data analytical processing and ease-of-use. It's an SQL-like language that operates over the ArangoDB key-value store, allowing users to create tables, joins and queries the same way they would in relational databases.

ArangoDB does a good job of keeping all of its code up to date, and the support pages are well designed. As the project matures and more people contribute, you can expect these pages to stay up to date and easy to navigate. Not to mention, it's compatible with all the major programming languages like Python and Javascript.

The three best Elasticsearch tools

To make the most of your data, we recommend using Elasticsearch in tandem with other tools and software—most notably Hevo Data, Logstash, and Apache Nifi.

Hevo Data

Hevo Data Elasticsearch is a free, open-source distributed search engine designed to ingest Elasticsearch data, parse it into queries and run them as event logs on the cluster nodes. The software lets you run analytics queries in real time on real-time data as well as backups of that data.

Logstash

Simply put, Logstash is an Elasticsearch tool that allows you to define rules that help manage incoming data as soon as it's extracted by Elasticsearch. By taking the data and instantly processing it, Logstash provides analytical and visualization tools perfect for making the most out of your data.

Apache NiFi

Apache Nifi is a set of libraries that enables "deep linking" between multiple data sources, including but not limited to popular Open Source APIs such as Facebook's Core Location API, Twitter's REST APIs, and even Yelp's In-App Feature API. With Apache NiFi, users are able to link their own APIs and make all of a dataset's information available to various other software.

Is Elasticsearch right for you?

With everything you know now about Elasticsearch, from its capabilities to its infrastructure and architecture, all that's left is deciding whether it's an ideal tool for your business.

Additional Resources

Back to Glossary