Vinoo is CTO at Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Vinoo led the compute team at Palantir Technologies, tasked with managing Spark and its interaction with HDFS, S3, Parquet, YARN, and Kubernetes across the company (including a Datasource V2 implementation and the External Shuffle Service SPIP). Vinoo is also an experienced startup advisor, advising Databand on helping solve data observability problems across the stack, and advising Horangi on building and enhancing their existing best-in-class cybersecurity product, Warden.
May 28, 2021 11:40 AM PT
As the data-as-a-service ecosystem continues to evolve, data brokers are faced with an unprecedented challenge - demonstrating the value of their data. Successfully crafting and selling a compelling data product relies on a broker’s ability to differentiate their product from the rest of the market. In smaller or static datasets, measures like row count and cardinality can speak volumes. However, when datasets are in the terabytes or petabytes though - differentiation becomes much difficult. On top of that “data quality” is a somewhat ill-defined term and the definition of a “high quality dataset” can change daily or even hourly.
This breakout session will describe Veraset’s partnership with Databricks, and how we have white labeled Databricks to showcase and accelerate the value of our data. We’ll discuss the challenges that data brokers have faced to date and some of the primitives of our businesses that have guided our direction thus far. We will also actively demo our white label instance and notebook to show how we’ve been able to provide key insights to our customers and reduce the TTFB of data onboarding.
June 24, 2020 05:00 PM PT
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs. In reality, the choice of file format has drastic implications to everything from the ongoing stability to compute cost of compute jobs. These file formats also employ a number of optimization techniques to minimize data exchange, permit predicate pushdown, and prune unnecessary partitions. This session aims to introduce and concisely explain the key concepts behind some of the most widely used file formats in the Spark ecosystem - namely Parquet, ORC, and Avro. We'll discuss the history of the advent of these file formats from their origins in the Hadoop / Hive ecosystems to their functionality and use today. We'll then deep dive into the core data structures that back these formats, covering specifics around the row groups of Parquet (including the recently deprecated summary metadata files), stripes and footers of ORC, and the schema evolution capabilities of Avro. We'll continue to describe the specific SparkConf / SQLConf settings that developers can use to tune the settings behind these file formats. We'll conclude with specific industry examples of the impact of the file on the performance of the job or the stability of a job (with examples around incorrect partition pruning introduced by a Parquet bug), and look forward to emerging technologies (Apache Arrow).
After this presentation, attendees should understand the core concepts behind the prevalent file formats, the relevant file-format specific settings, and finally how to select the correct file format for their jobs. This presentation is relevant to Spark+AI summit because as more AI/ML workflows move into the Spark ecosystem (especially IO intensive deep learning) leveraging the correct file format is paramount in performant model training.