visit
Data engineers and scientists often work with a variety of tools to handle different types of data operations—from large-scale distributed processing to in-memory data manipulation. The alexmerced/spark35nb
Docker image simplifies this by offering a pre-configured environment where you can experiment with multiple popular data tools, including PySpark, Pandas, DuckDB, Polars, and DataFusion.
To get started, you'll first need to pull the alexmerced/spark35nb
Docker image from Docker Hub. This image comes with a pre-configured environment that includes Spark 3.5.2, JupyterLab, and many popular data manipulation libraries like Pandas, DuckDB, and Polars.
docker pull alexmerced/spark35nb
Next, run the container using the following command:
docker run -p 8888:8888 -p 4040:4040 -p 7077:7077 -p 8080:8080 -p 18080:18080 -p 6066:6066 -p 7078:7078 -p 8081:8081 alexmerced/spark35nb
Once the container is up and running, open your browser and navigate to localhost:8888 to access JupyterLab, where you will perform all your data operations.
Now that you have your environment set up, we can move on to performing some basic data operations using PySpark, Pandas, DuckDB, Polars, and DataFusion.
from pyspark.sql import SparkSession
# Initialize the Spark session
spark = SparkSession.builder.appName("PySpark Example").getOrCreate()
# Sample data: a list of tuples containing names and ages
data = [("Alice", 34), ("Bob", 45), ("Catherine", 29)]
# Create a DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])
# Show the DataFrame
df.show()
In this example, we created a DataFrame with three rows of data, representing people's names and ages. The df.show() function allows us to display the contents of the DataFrame, making it easy to inspect the data we just created.
# Load a CSV file into a DataFrame
df_csv = spark.read.csv("data/people.csv", header=True, inferSchema=True)
# Show the first few rows of the DataFrame
df_csv.show()
# Filter the data to only include people older than 30
df_filtered = df_csv.filter(df_csv["Age"] > 30)
# Show the filtered DataFrame
df_filtered.show()
# Group by Age and count the number of people in each age group
df_grouped = df_csv.groupBy("Age").count()
# Show the result of the grouping
df_grouped.show()
In this example, we loaded a CSV file into a PySpark DataFrame using spark.read.csv()
. Then, we applied two different operations:
import pandas as pd
# Load a CSV file into a Pandas DataFrame
df_pandas = pd.read_csv("data/people.csv")
# Display the first few rows of the DataFrame
print(df_pandas.head())
In this example, we read the CSV file people.csv using pd.read_csv() and loaded it into a Pandas DataFrame. The head() method lets you view the first few rows of the DataFrame, which is useful for quickly inspecting the data.
# Filter the data to show only people older than 30
df_filtered = df_pandas[df_pandas["Age"] > 30]
# Display the filtered data
print(df_filtered)
# Group the data by 'Age' and count the number of people in each age group
df_grouped = df_pandas.groupby("Age").count()
# Display the grouped data
print(df_grouped)
Here, we filtered the data to include only people older than 30 using a simple boolean expression. Then, we used the groupby() function to group the DataFrame by age and count the number of people in each age group.
Pandas is incredibly efficient for in-memory data operations, making it a go-to tool for smaller datasets that can fit in your machine's memory. In the next section, we’ll explore DuckDB, a SQL-based tool that enables fast querying over in-memory data.
import duckdb
# Connect to an in-memory DuckDB instance
conn = duckdb.connect()
# Create a table in DuckDB from the Pandas DataFrame
conn.execute("CREATE TABLE people AS SELECT * FROM df_pandas")
# Show the content of the 'people' table
conn.execute("SELECT * FROM people").df()
In this example, we connected to DuckDB and created a new table people from the Pandas DataFrame df_pandas. DuckDB’s execute()
function allows you to run SQL commands, making it easy to interact with data using SQL queries.
# Query to select people older than 30
result = conn.execute("SELECT Name, Age FROM people WHERE Age > 30").df()
# Display the result of the query
print(result)
# Query to group people by age and count the number of people in each age group
result_grouped = conn.execute("SELECT Age, COUNT(*) as count FROM people GROUP BY Age").df()
# Display the grouped result
print(result_grouped)
In this example, we used SQL to filter the people table, selecting only those who are older than 30. We then ran a grouping query to count the number of people in each age group.
DuckDB is an excellent choice when you need SQL-like functionality directly in your Python environment. It allows you to leverage the power of SQL without the overhead of setting up and managing a database server. In the next section, we will explore Polars, a DataFrame library known for its speed and efficiency.
import polars as pl
# Create a Polars DataFrame
df_polars = pl.DataFrame({
"Name": ["Alice", "Bob", "Catherine"],
"Age": [34, 45, 29]
})
# Display the Polars DataFrame
print(df_polars)
In this example, we created a Polars DataFrame using a Python dictionary. The syntax is similar to Pandas, but the operations are optimized for speed. Polars offers lazy evaluation, which means it can optimize the execution of multiple operations at once, reducing computation time.
# Filter the DataFrame to show only people older than 30
df_filtered = df_polars.filter(pl.col("Age") > 30)
# Display the filtered DataFrame
print(df_filtered)
# Group by 'Age' and count the number of people in each age group
df_grouped = df_polars.groupby("Age").count()
# Display the grouped result
print(df_grouped)
In this example, we filtered the data to show only rows where the age is greater than 30, and then we grouped the data by age to count how many people are in each group. These operations are highly efficient in Polars due to its optimized memory management and query execution engine.
Polars is ideal when you need the speed of a DataFrame library for both small and large datasets, and when performance is a key requirement. Next, we will explore DataFusion, a tool for SQL-based querying over Apache Arrow data.
from datafusion import SessionContext
# Initialize a DataFusion session
ctx = SessionContext()
# Create a DataFrame with some data
data = [
{"Name": "Alice", "Age": 34},
{"Name": "Bob", "Age": 45},
{"Name": "Catherine", "Age": 29}
]
# Register the DataFrame as a table
df = ctx.create_dataframe(data)
ctx.register_table("people", df)
# Query the data to select people older than 30
result = ctx.sql("SELECT Name, Age FROM people WHERE Age > 30").collect()
# Display the result
print(result)
In this example, we used DataFusion’s SessionContext to create a DataFrame and registered it as a table. We then performed a simple SQL query to filter the data for people older than 30. DataFusion allows you to combine the power of SQL with the speed and efficiency of Apache Arrow’s in-memory format.
# Group by 'Age' and count the number of people in each age group
result_grouped = ctx.sql("SELECT Age, COUNT(*) as count FROM people GROUP BY Age").collect()
# Display the grouped result
print(result_grouped)
In this query, we grouped the data by the 'Age' column and counted how many people were in each age group. DataFusion’s SQL execution engine ensures that queries run efficiently, even on large datasets stored in-memory.
DataFusion is a great tool for users who need fast, SQL-based querying of large in-memory datasets and want to take advantage of Apache Arrow’s high-performance columnar data format. It’s particularly useful for building analytical pipelines that involve heavy querying of structured data.
In this section, we will demonstrate how to set up Dremio in a Docker container and use Python to query Dremio's data sources using the dremio-simple-query
library.
docker run -p 9047:9047 -p 31010:31010 -p 45678:45678 -p 32010:32010 -e DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist --name try-dremio dremio/dremio-oss
Once Dremio is up and running, navigate to //localhost:9047 in your browser to access the Dremio UI. Here, you can configure your data sources, create virtual datasets, and explore the platform's capabilities.
pip install dremio-simple-query
from dremio_simple_query.connect import get_token, DremioConnection
from os import getenv
from dotenv import load_dotenv
# Load environment variables (TOKEN and ARROW_ENDPOINT)
load_dotenv()
# Login to Dremio and get a token
login_endpoint = "//{host}:9047/apiv2/login"
payload = {
"userName": "your_username",
"password": "your_password"
}
token = get_token(uri=login_endpoint, payload=payload)
# Dremio Arrow Flight endpoint, make sure to put in the right host for your Dremio instance
arrow_endpoint = "grpc://{host}:32010"
# Establish connection to Dremio using Arrow Flight
dremio = DremioConnection(token, arrow_endpoint)
If you are running this locally using the docker run command, the host should be the IP address of the Dremio container on the docker network which you can find by running docker inspect
.
In this code, we use the get_token
function to retrieve an authentication token from Dremio's REST API and establish a connection to Dremio's Arrow Flight endpoint.
# Query Dremio and return data as an Apache Arrow Table
stream = dremio.toArrow("SELECT * FROM my_table;")
arrow_table = stream.read_all()
# Display Arrow Table
print(arrow_table)
# Query Dremio and return data as a Pandas DataFrame
df = dremio.toPandas("SELECT * FROM my_table;")
print(df)
# Query Dremio and return data as a Polars DataFrame
df_polars = dremio.toPolars("SELECT * FROM my_table;")
print(df_polars)
# Query Dremio and return as a DuckDB relation
duck_rel = dremio.toDuckDB("SELECT * FROM my_table")
# Perform a query on the DuckDB relation
result = duck_rel.query("my_table", "SELECT * FROM my_table WHERE Age > 30").fetchall()
# Display results
print(result)
With the dremio-simple-query
library, you can efficiently query large datasets from Dremio and immediately start analyzing them with various tools like Pandas, Polars, and DuckDB, all while leveraging the high-performance Apache Arrow format under the hood.
In this blog, we explored how to use a variety of powerful tools for data operations within a Python notebook environment. Starting with the alexmerced/spark35nb
Docker image, we demonstrated how to set up a development environment that includes PySpark, Pandas, DuckDB, Polars, and DataFusion—each optimized for different data processing needs. We showcased basic operations like writing, querying, and aggregating data using each tool’s unique strengths.
Finally, we introduced Dremio, which integrates with Apache Arrow to enable lightning-fast queries across a range of data sources. With the dremio-simple-query
library, Dremio allows analysts to quickly fetch and analyze data using tools like Pandas, Polars, and DuckDB, ensuring that data is available when and where it's needed without the overhead of traditional data warehouses.