Introduction
We are thrilled to unveil the English SDK for Apache Spark, a transformative tool designed to enrich your Spark experience. Apache Spark™, celebrated globally with over a billion annual downloads from 208 countries and regions, has significantly advanced large-scale data analytics. With the innovative application of Generative AI, our English SDK seeks to expand this vibrant community by making Spark more user-friendly and approachable than ever!
Motivation
GitHub Copilot has revolutionized the field of AI-assisted code development. While it's powerful, it expects the users to understand the generated code to commit. The reviewers need to understand the code as well to review. This could be a limiting factor for its broader adoption. It also occasionally struggles with context, especially when dealing with Spark tables and DataFrames. The attached GIF illustrates this point, with Copilot proposing a window specification and referencing a non-existent 'dept_id' column, which requires some expertise to comprehend.
Instead of treating AI as the copilot, shall we make AI the chauffeur and we take the luxury backseat? This is where the English SDK comes in. We find that the state-of-the-art large language models know Spark really well, thanks to the great Spark community, who over the past ten years contributed tons of open and high-quality content like API documentation, open source projects, questions and answers, tutorials and books, etc. Now we bake Generative AI’s expert knowledge about Spark into the English SDK. Instead of having to understand the complex generated code, you could get the result with a simple instruction in English that many understand:
transformed_df = df.ai.transform('get 4 week moving average sales by dept')
The English SDK, with its understanding of Spark tables and DataFrames, handles the complexity, returning a DataFrame directly and correctly!
Our journey began with the vision of using English as a programming language, with Generative AI compiling these English instructions into PySpark and SQL code. This innovative approach is designed to lower the barriers to programming and simplify the learning curve. This vision is the driving force behind the English SDK and our goal is to broaden the reach of Spark, making this very successful project even more successful.
Features of the English SDK
The English SDK simplifies Spark development process by offering the following key features:
- Data Ingestion: The SDK can perform a web search using your provided description, utilize the LLM to determine the most appropriate result, and then smoothly incorporate this chosen web data into Spark—all accomplished in a single step.
- DataFrame Operations: The SDK provides functionalities on a given DataFrame that allow for transformation, plotting, and explanation based on your English description. These features significantly enhance the readability and efficiency of your code, making operations on DataFrames straightforward and intuitive.
- User-Defined Functions (UDFs): The SDK supports a streamlined process for creating UDFs. With a simple decorator, you only need to provide a docstring, and the AI handles the code completion. This feature simplifies the UDF creation process, letting you focus on function definition while the AI takes care of the rest.
- Caching: The SDK incorporates caching to boost execution speed, make reproducible results, and save cost.
Examples
To illustrate how the English SDK can be used, let's look at a few examples:
Data Ingestion
If you're a data scientist who needs to ingest 2022 USA national auto sales, you can do this with just two lines of code:
spark_ai = SparkAI()
auto_df = spark_ai.create_df("2022 USA national auto sales by brand")
DataFrame Operations
Given a DataFrame df, the SDK allows you to run methods starting with df.ai. This includes transformations, plotting, DataFrame explanation, and so on.
To active partial functions for PySpark DataFrame:
spark_ai.activate()
To take an overview of `auto_df`:
auto_df.ai.plot()
To view the market share distribution across automotive companies:
auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")
To get the brand with the highest growth:
auto_top_growth_df=auto_df.ai.transform("top brand with the highest growth")
auto_top_growth_df.show()
brand |
us_sales_2022 |
sales_change_vs_2021 |
Cadillac |
134726 |
14 |
To get the explanation of a DataFrame:
auto_top_growth_df.ai.explain()
In summary, this DataFrame is retrieving the brand with the highest sales change in 2022 compared to 2021. It presents the results sorted by sales change in descending order and only returns the top result.
User-Defined Functions (UDFs)
The SDK supports a simple and neat UDF creation process. With the @spark_ai.udf decorator, you only need to declare a function with a docstring, and the SDK will automatically generate the code behind the scene:
@spark_ai.udf
def convert_grades(grade_percent: float) -> str:
"""Convert the grade percent to a letter grade using standard cutoffs"""
...
Now you can use the UDF in SQL queries or DataFrames
SELECT student_id, convert_grades(grade_percent) FROM grade
Conclusion
The English SDK for Apache Spark is an extremely simple yet powerful tool that can significantly enhance your development process. It's designed to simplify complex tasks, reduce the amount of code required, and allow you to focus more on deriving insights from your data.
While the English SDK is in the early stages of development, we're very excited about its potential. We encourage you to explore this innovative tool, experience the benefits firsthand, and consider contributing to the project. Don't just observe the revolution—be a part of it. Explore and harness the power of the English SDK at pyspark.ai today. Your insights and participation will be invaluable in refining the English SDK and expanding the accessibility of Apache Spark.