visit
This is the fifth post in my series, Towards Open Options Chains: A Data Pipeline for Collecting Options Data at Scale:
In this post, we’ll be using Docker. I understand that this might be a big jump for some: from running services through the shell to deploying containers. It is a different paradigm, and requires you to learn a new toolkit. This was why I summarised the launch / teardown code in the penultimate section in Part IV: Building the DAG - it’s for readers who are happy with the pipeline we have built so far and don’t feel the need to venture further. That’s perfectly fine.
If you’ve ever built Lego, you’d know that the first two options are out of the question. Option 1 sounds plain silly. We need to know where exactly each brick goes! Option 2 is a slight improvement, but it’s not guaranteed that we can reproduce the product. What if the customer builds a section of the house, only to find that there are insufficient bricks to create a ceiling?
Option 3 is what we expect from a Lego product. We have everything we need to re-assemble the house. It doesn’t matter who is assembling it or where it is shipped from - it works!
root
├── dags # Contains DAGs to be made accessible to Airflow
│ └── open_options_chains.py # Our dynamic DAG script
├── db # Contains scripts for the Postgres instance
│ └── init-postgres.sh # Shell script to set up Postgres
├── logs # Contains logs from the Airflow instance
├── pgdata # Contains data from the Postgres instance
├── scripts # Contains scripts to initialise Airflow and run its services
│ ├── init-entrypoint.sh # Shell script to initialise Airflow
│ ├── scheduler-entrypoint.sh # Shell script to launch scheduler
│ └── webserver-postgres.sh # Shell script to launch webserver
├── .env # File containing environment variables
└── docker-compose.yml # File specifying what services to run, along with the configs
version: '3.8'
services:
postgres:
image: postgres
env_file:
- .env
volumes:
- ./db:/docker-entrypoint-initdb.d/
- ./pgdata:/var/lib/postgresql/data
airflow-init:
image: apache/airflow
entrypoint: ./scripts/init-entrypoint.sh
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
restart:
on-failure:10
depends_on:
- postgres
scheduler:
image: apache/airflow
entrypoint: ./scripts/scheduler-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
webserver:
image: apache/airflow
entrypoint: ./scripts/webserver-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
- scheduler
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- "8081:8080"
This file (.env
) defines variables that our containers can use. The contents should be as shown below. Do remember to input your TD Ameritrade (TDA) API key.
# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
APP_DB_USER=openoptions
APP_DB_PASS=openoptions
APP_DB_NAME=optionsdata
# Airflow
APP_AIRFLOW_USERNAME=admin
APP_AIRFLOW_PASSWORD=password
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
AIRFLOW__CORE__EXECUTOR=LocalExecutor
# TDA API
API_KEY=<your TDA API key here>
postgres
: An instance for the Postgres databaseairflow-init
: A short-lived instance to initialize Airflowscheduler
: An instance for the Airflow schedulerwebserver
: An instance for the Airflow webserver
postgres:
image: postgres
env_file:
- .env
volumes:
- ./db:/docker-entrypoint-initdb.d/
- ./pgdata:/var/lib/postgresql/data
ports:
- "8081:8080"
image
tag defines which pre-existing image on Docker Hub (see below) the container should be run on. In this case, we’re using the .env_file
tag points to the environment file we created earlier. This allows us to define variables in one place. Our containers can then use them.volumes
tag indicates which local directories are mounted into the container. Any data that is generated in the specified folder inside the container is persisted in the local directory. For our app, we mount two volumes:
.db/
on our local machine, which contains the initialization scripts for the database, to /docker-entrypoint-initdb.d/
inside the container. The scripts in the latter are run when the container is launched./pgdata
, on our local machine to persist Postgres data, to /var/lib/postgresql/data
inside the container, where Postgres data is stored.ports
tag exposes a port to our local machine. The first port number is the port on our local machine, and the second one is the port inside the container for the Postgres service.
Note: Docker Hub is an open online repository for creating, managing, and sharing container images with others.
airflow-init:
image: apache/airflow
entrypoint: ./scripts/init-entrypoint.sh
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
restart:
on-failure:10
depends_on:
- postgres
image
used is the .entrypoint
tag specifies commands/executables that will always run when the container is launched. We will examine the script below.restart
tag specifies when to restart the container. We request that this service restart on failure, up to 10 times.depends_on
tag expresses the dependency between services. This affects the order in which services are started. This airflow-init
service will start only after the Postgres instance has. Note that the Postgres service need not be ready for this service to start, and this causes problems that our entrypoint script will resolve.profiles
tag helps to group services together. It comes in handy when we want to choose specific related services to run.env_file
and volumes
tags do the same thing as they did for the Postgres service.
The entrypoint script (below) for this service does the following every time the airflow-init
container is run:
#!/usr/bin/env bash
# Wait for Postgres service to be ready
until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do
>&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..."
sleep 5
done
>&2 echo "PostgreSQL service started successfully. Initialising Airflow..."
# Initialise database
airflow db init
# Create account
airflow users create -u "$APP_AIRFLOW_USERNAME" -p "$APP_AIRFLOW_PASSWORD" -f Firstname -l Lastname -r Admin -e [email protected]
# Add connection
airflow connections add 'postgres_optionsdata' \
--conn-type 'postgres' \
--conn-login '$APP_DB_USER' \
--conn-password '$APP_DB_PASS' \
--conn-host 'postgres' \
--conn-port '5432' \
--conn-schema '$APP_DB_NAME' \
.env
file
scheduler:
image: apache/airflow
entrypoint: ./scripts/scheduler-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./scripts:/opt/airflow/scripts
- ./logs:/opt/airflow/logs
image
used is the - the same as the Airflow initialization service defined previously.entrypoint
script. We’ll discuss this along with the entrypoint script for the Webserver below.restart
, depends_on
, env_file
and volumes
tags do the same thing as they did for the other services.volumes
tag, we mount the DAGs, scripts, and logs folders from our local machine to the Docker container. This allows us to persist our DAGs, write scripts, and monitor logs on our local disk.
webserver:
image: apache/airflow
entrypoint: ./scripts/webserver-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
- scheduler
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- "8081:8080"
The ports
tag specifies the port on our local machine (8081) that we can use to access the webserver (8080 inside container).
#!/usr/bin/env bash
# Wait for Postgres service to be ready
until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do
>&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..."
sleep 5
done
>&2 echo "PostgreSQL service started successfully. Launching Airflow Scheduler..."
# Launch scheduler
airflow scheduler
chmod -R 777 dags
chmod -R 777 logs
cd <root folder for app>
docker-compose up airflow-init
There will be some activity in the terminal as Docker Compose launches Postgres, and then the Airflow initialization services. Once the initialization is complete, the Airflow initialization service will shut down. However, the Postgres service will continue running. You can stop it with docker-compose stop
if you want.
docker-compose up -d
The -d
flag is to run the containers in detached mode, in the background. This allows you to close your terminal. In a browser, proceed to //localhost:8081 to access the UI for your Airflow instance.
docker-compose stop
This will stop the containers, and will not delete them. Airflow will remain initialized, and when you launch it again with docker-compose up
, all your data and settings will be intact.
Instead, if you happened to do a complete shutdown with docker-compose down
, don’t sweat it. This command will stop the running containers, remove the containers (not the images), and the networks defined for the app. Fortunately, because we persisted the data, scripts, logs, and DAGs on our local disk, re-launching the app with docker-compose up
(1) will not re-initialize the app, and (2) will keep all your data and settings intact.
To completely reset the app, do a complete shut down of the app, then delete everything inside the logs
and pgdata
folders on your local disk:
# Completely shut down app
docker-compose down
# Remove contents of mounted volumes
rm -rf logs/*
rm -rf pgdata/*
There are several ways to extract data from the app. First, we can export the tables to a CSV file. We would need to (1) extract the ID of the running Postgres container, (2) navigate to a folder for storing the CSV files, and (3) run a command in the container to save the data to a file. For (3), we run a command psql -U airflow -d optionsdata -c "..."
inside the Postgres container using docker exec -t
, and write the output to our target file ... > filename.csv
.
# Check Postgres container ID
docker ps
# Go to folder to store files
cd <directory-to-store-csv-files>
# Extract CSV file
docker exec -t <first-few-characters-of-docker-container> psql -U airflow -d optionsdata -c "COPY table_name to STDOUT WITH CSV HEADER" > "filename.csv"
The second way to extract data is to connect directly to the Postgres instance. Notice that we used the ports
tag in the Docker Compose file to expose a port so we could access Postgres from our host machine. We use psycopg2
to establish the connection in Python.
import psycopg2 as pg2
# Connect to database
conn = pg2.connect(host='localhost', database='optionsdata', user='airflow', password='airflow', port='5432')
You can then query the data using the connection object conn
as you would for any other database.