Note: The solution in this blog is only known to work on AWS single tenant deployments and may not work on Azure Databricks. Please review production documentation https://docs.databricks.com/ to understand what may be available to align with your requirements
This spring, I worked as a software engineer intern in the Clusters Team at Databricks. My internship project was to enable NFS mounting for the Databricks product which lets you mount your own storage (AWS EFS, Azure File, or on-prem filesystem) using NFS protocol. In this blog, I will discuss how we integrated network file system in the Databricks product as well as my internship experience.
Network File System
Network File System is a distributed file system protocol allowing you to access files over a network similar to how you access local storage. NFS is widely used in cloud environments (AWS EFS and Azure File) and on-prem file storage. A large number of instances can share the same NFS server and interact with the same file system simultaneously.
However, NFS mounting was not supported by the Databricks product. Previously, if you wanted to access your file system, you had to use FUSE to manually mount your existing filesystems.
FUSE Limitations
A major limitation with FUSE is its performance and latency: FUSE context switches between user space and Linux kernel space, adding latencies. Another limitation is that many FUSE clients are not production-ready. If you use AWS EFS or have data on-prem, the experience is even worse since there is no suitable FUSE client available.
Databricks provides a local POSIX filesystem via FUSE mount into DBFS, backed by S3/Azure Blob Storage. Enabling NFS mounting also opens up the possibility of migrating to NFS to offer higher performance for DBFS in the future.
Based on the above limitations and the strong demand, we decided to add support for NFS mounting in Databricks.
How NFS on Databricks Works
As a qualified AWS customer, you can enable NFS mounting by turning on NFS configuration flag and mount NFS using the following initscript. With this init script, EFS will be mounted on each node of the cluster and you can access the filesystem under /efs. You can now read and write to the filesystem!
dbutils.fs.put("/home/bootstrap/install-efs.sh", """ #!/bin/bash#install the nfs package apt-get -y install nfs-common#create the mount directory and mount EFS mkdir /efs mount -t nfs4 -o nosuid,nodev fs-abcdefg.efs.us-REGION-2.amazonaws.com:/ /efs""", True) |
NFS Use Cases
NFS mounting solves the following use cases:
- Offers highly performant, low latency I/O operations.
- Enables usage of RStudio on Databricks
(RStudio relies on POSIX features which DBFS FUSE cannot easily support.).
- Ensures easy access to the datasets in the existing NFS deployments.
Conclusion
Enabling NFS mounting opens up new possibilities for the Databricks product and significantly improves the performance of storage latency-sensitive workloads in Databricks.
During this internship, I gained experience designing, implementing, and testing a feature in a production-scale system of the Databricks product. Also, I experienced real-world project management, where the project requirements change and we need to adapt to the new design and implementation in a short time. Interning at Databricks, I felt I was a part of the team and was given an opportunity to make a real impact on the Databricks product. More importantly, I learned a lot from other team members and grew technically from this project. Databricks engineers are very supportive, always open to discussion and feedback, and act with ownership and responsibility. Special thanks here to the Clusters team members for their support! I would like to thank my manager Ihor for always being there supporting me and caring for my life in and outside work, and my mentor Qian for helping me with problems I encountered and teaching me valuable lessons for writing robust code and building large-scale system!
It’s awesome to be part of the team where people believe in you and help you to become a better engineer! Thank Databricks and Clusters team for this wonderful time and working experience!