1,336 reads

What the Heck is GlareDB?

by Shawn GordonSeptember 20th, 2023

Too Long; Didn't Read

GlareDB is an open-source query engine written in Rust instead of C++. It supports data located on GCS or S3 of the following types: BigQuery, MongoDB, Postgres, Snowflake, ClickHouse and Redshift. GlareDB also supports cloud storage and a new hybrid execution feature.

featured image - What the Heck is GlareDB?

Introduction

It has been a while since my last “What the heck is?” article, and I’ve recently seen some rapid growth from GlareDB and wanted to learn more. What really piqued my interest was the recent announcements of support for and a new model. So, what the heck is GlareDB? Let’s take a look!

Overview

GlareDB is an project utilizing the project, part of the Apache Arrow project. DataFusion is a fast, extensible query engine for building high-quality data-centric systems in , using the in-memory format. It offers SQL and Dataframe APIs and built-in support for CSV, Parquet, JSON, and Avro. There are also as well as extensive customization possibilities. GlareDB is adding many features on top of it, such as cloud storage and the aforementioned hybrid execution feature, providing a layer on top of various compute engines that can:

Query local and remote files
Query other databases and data sources
Store data and queries (as views)
Copy data from sources to destinations
Interop with DataFrame libraries in Python
Run one-off queries from the command line

They describe how it fits in the stack in this diagram:

It supports data located on GCS or S3 of the following types:

BigQuery
MongoDB (early release)
MySQL
Postgres
Snowflake
Preliminary Iceberg support
Redshift (coming soon)
ClickHouse (coming soon)

They are quickly adding support for various engines, so this list could be incomplete by the time you read this.

What can I do with it?

At first blush, you look at this and think, hey, this seems a lot like in that it is a federated query engine. On second glance, it seems kind of like for a couple of reasons. The first is that, like DuckDB, GlareDB is a single, tight executable but written in Rust instead of C++. Second, they also support having this model (MotherDuck did it first), which I’ll cover shortly.

Given that Trino is written in Java, that means there is a lot of Java ecosystem you need to deal with if you want to use it. Sure, there are pre-built Docker containers around that can shorten this path, but generally, if you are “just trying to do something,” then you have a heavy lift to install and set up Trino. With GlareDB, you have a single executable to download and use or make use of their SaaS product, which looks like this when you first use it:

Now to Hybrid Execution. I’ll paraphrase some of what GlareDB had to say in their blog post on the topic. Say you have a CSV list of user IDs that had gotten extracted from some other tool from your database. Now, you want to enrich that data with some of the user's demographic information from your database. We’ll say our table name is user_demo and our CSV file is user_id.csv, and our query would look something like this:

SELECT
   m.user_id,
   m.first_name,
   m.last_name,
   m.birth_date
FROM
   user_demo m
INNER JOIN '/user_id.csv' u on m.user_id = u.id
GROUP BY m.user_id;

Clearly, this is a simple example, but you could enhance it to get information out of other joined tables as well. You can also go in the other direction, where you have some local file with a key field and some data you are interested in that you can join to a table in a database where that extra data in the file doesn’t exist in the database. This has the advantage of not having to go through the process of creating a new table and loading it for this ad-hoc report, thus saving a lot of time.

That’s all just meant to give you a quick tickle about what GlareDB can do and where it is at currently. The docs and blogs on their site are well done, making it pretty quick to jump in.

Summary

GlareDB is very interesting, and I appreciate how quickly they are iterating and updating the software. I need to spend some more time thinking about how it plays in the , or space. Between the speed and the federated queries, there are some exciting possibilities. I really like the new hybrid execution, which could shortcut work in various situations. Try out a free account yourself if you’d like to give it a spin at .

You can read the other “What the heck” articles at these links:

What The Heck Is DuckDB? (I was pretty out front on this one.)

What the Heck Is Malloy? (I was out front on this one, too.)

What the Heck is PRQL? (slower, but also growing)