Tomer Shiran is the CEO and co-founder of Dremio. Prior to Dremio, he was VP Product and employee #5 at MapR, where he was responsible for product strategy, roadmap and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion – Israel Institute of Technology.
June 23, 2020 05:00 PM PT
In the era of microservices and cloud apps, it is often impractical for organizations to physically consolidate all data into one system. Apache Arrow is an open source, columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real-time, simplifying and accelerating data access, without having to copy all data into one location. As companies continue to embrace modern architectures based on microservices and cloud applications, it has become increasingly difficult to physically consolidate all data into a single system. In a world where data is extremely fragmented, and users expect instant gratification, the age-old approach of constructing and maintaining ETL pipelines can be prohibitively cumbersome and expensive. Apache Arrow is an open source project, initiated by over a dozen open source communities, which provides a standard columnar in-memory data representation and processing framework. Arrow has emerged as a popular way way to handle in-memory data for analytical purposes.
In the last year, Arrow has been embedded into a broad range of open source (and commercial) technologies, including GPU databases, machine learning libraries and tools, execution engines and visualization frameworks (e.g., Anaconda, Dremio, Graphistry, H2O, MapD, Pandas, R, Spark). In this talk, we provide an overview of Arrow, and outline how several open source projects are utilizing it to achieve high performance data processing and interoperability across systems. For example, we demonstrate a 50x speedup in PySpark (Spark-Pandas interoperability). We then show how companies can utilize Arrow to enable users to access and analyze data across disparate data sources without having to physically consolidate it into a centralized data repository.