There’s a new version of Dremio, an open-source project designed to give business analysts and data scientists a way to explore and analyze data no matter what its structure or size. New in this release are a data catalog, prioritized workload management, and Kubernetes support.
The developers of Dremio describe it as a data virtualization platform. The software is based on Apache Arrow, Apache Parquet, and Apache Calcite, and the company behind Dremio is a major contributor to Arrow. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data. Apache Parquet offers similar features for file-based storage. uses Apache Calcite is used for SQL parsing and query optimization.
Dremio builds Arrow-based structures called Reflections. These are optimized copies of data based on queries against data sources. Dremio also has a query optimizer that uses Apache Arrow to work out the best representation of data to make the query faster. This might mean that a query against an ElasticSearch cluster (for example) would use the Arrow representation of the data instead.
Dremio also has a built-in SQL based query language that provides similar features to those of cost-based optimizers such as SparkSQL, but with the addition of Reflections to take the idea further by providing the optimized copy of the data.
The new version of Dremio adds a data catalog with the idea that users will be able to carry out a simple Google-like search to find datasets. Under the covers, Dremio administrators tag datasets to organize them so they can be discovered by data consumers. The catalog includes built-in wiki pages where information can be stored such as who to ask questions, how often the data is updated, what sources of data make up the dataset, and screen shots of reports and visualizations that use the dataset.
This release also includes support for Gandiva, a new execution kernel for Arrow that is based on LLVM. Gandiva provides performance improvements for low-level operations on Arrow buffers. The developers say in the right circumstances, using Gandiva can improve query performance dramatically – some early testers have reported improvements of over 70x.
Security has been improved with native integration with Apache Ranger for centralized access control. In addition, Dremio 3.0 now supports end-to-end TLS encryption.
New multi-tenant workload controls have been added so that administrators can control resource allocation based on user, group membership, time of day, data source, and query type using standard SQL.
The Kubernetes support comes via an official Docker image and templates for elastic, highly available deployments using the Kubernetes orchestration framework.
Elsewhere there’s a new declarative engine for relational database sources that is designed to provide more efficient processing on systems such as Postgres, SQL Server, Oracle, and Teradata; and support for new daa sources including Azure Data Lake Store, Elasticsearch 6, AWS S3 GovCloud, and Teradata.