visit
I like to spend time on Linkedin reading through posts from companies about news, new releases, funding, new products, etc.. During a recent perusal, I saw something about “
DuckDB is briefly described as SQLite for analytic workloads. While SQLite is an embeddable, row-based, and b-tree indexed data store well suited for OLTP workloads, DuckDB is an embeddable column-based data store that uses vectorized processing to optimize OLAP workloads, you could also think of it as an embedded
OLTP | OLAP |
---|---|
For your operation workloads | For your analytic workloads |
Shorter queries | Longer queries for complex questions |
Tables are more highly normalized | Tables are de-normalized |
Typically implemented as row-oriented data stores | Typically implemented as column-oriented data stores |
I decided to use their very clever WASM
Then I did some basic SQL to check it out:
duckdb> select count(*) from userdata1.parquet;
┌──────────────┐
│ count_star() │
╞══════════════╡
│ 1000 │
└──────────────┘
Elapsed: 1 ms
duckdb>
duckdb> select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌────────────┬───────────┬────────────────────────────┐
│ first_name ┆ last_name ┆ email │
╞════════════╪═══════════╪════════════════════════════╡
│ Emily ┆ Stewart ┆ [email protected] │
│ Annie ┆ Torres ┆ [email protected] │
│ William ┆ Green ┆ [email protected] │
│ Jack ┆ Medina ┆ [email protected] │
│ Jeremy ┆ Bennett ┆ [email protected] │
│ Carlos ┆ Day ┆ [email protected] │
│ Ryan ┆ Mills ┆ [email protected] │
│ Betty ┆ Gibson ┆ [email protected] │
│ Wanda ┆ Stanley ┆ [email protected] │
│ Evelyn ┆ Spencer ┆ [email protected] │
│ George ┆ Howard ┆ [email protected] │
└────────────┴───────────┴────────────────────────────┘
Elapsed: 2 ms
duckdb> explain select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌───────────────────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ first_name │
│ last_name │
│ email │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ country │
│ first_name │
│ last_name │
│ email │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: country=Nigeria │
│ AND country IS NOT NULL │
└───────────────────────────┘
import duckdb
myconnector = duckdb.connect('myduckdb.duckdb')
cursor = myconnector.cursor()
cursor.execute("""
CREATE TABLE userdata(
registration_dttm date,
Id int,
first_name varchar,
Last_name varchar,
email varchar,
gender varchar,
ip_address varchar,
cc varchar,
country varchar,
birthdate varchar,
salary float,
title varchar,
comments
)
"""
)
cursor.execute("COPY userdata FROM 'userdata1.parquet' (HEADER)")
print(cursor.execute('select count(*) from userdata).fetchall())
cursor.close()
conn.close()
In the code snippet shown above, we connect to the ‘myduckdb.duckdb’ database, create a table that matches our parquet file, copy the data into it and then perform a simple count query.
This is a really cool project. While I’ve been aware of the advantages of columnar data stores for about eight years because of Sisense initially, I only started working with them more extensively in the past year. I’m especially excited by their WASM implementation and the clever things they did with