visit
Weaviate is a pioneering, open-source vector database, designed to enhance semantic search through the utilization of machine learning models. Unlike traditional search engines that rely on keyword matching, Weaviate employs semantic similarity principles. This innovative approach transforms various forms of data (texts, images, and more) into vector representations, numerical forms that capture the essence of the data’s context and meaning. By analyzing the similarities between these vectors, Weaviate delivers search results that truly understand the user’s intent, offering a significant leap beyond the limitations of keyword-based searches.
This guide aims to demonstrate the seamless integration of MinIO and Weaviate, leveraging the best of Kubernetes-native object storage and AI-powered semantic search capabilities. Leveraging Docker Compose for container orchestration, this guide provides a strategic approach to building a robust, scalable, and efficient data management system. Aimed at how we store, access, and manage data, this setup is a game-changer for developers, DevOps engineers, and data scientists seeking to harness the power of modern storage solutions and AI-driven data retrieval.
In this demonstration, we'll be focusing on backing up Weaviate with MinIO buckets using Docker. This setup ensures data integrity and accessibility in our AI-enhanced search and analysis projects.
The docker-compose.yaml
file provided here is crafted to establish a seamless setup for Weaviate, highlighting our commitment to streamlined and efficient data management. This configuration enables a robust environment where MinIO acts as a secure storage service and Weaviate leverages this storage for advanced vector search capabilities.
The
version: '3.8'
services:
weaviate:
container_name: weaviate_server
image: semitechnologies/weaviate:latest
ports:
- "8080:8080"
environment:
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
ENABLE_MODULES: 'backup-s3'
BACKUP_S3_BUCKET: 'weaviate-backups'
BACKUP_S3_ENDPOINT: 'play.min.io:443'
BACKUP_S3_ACCESS_KEY_ID: 'minioadmin'
BACKUP_S3_SECRET_ACCESS_KEY: 'minioadmin'
BACKUP_S3_USE_SSL: 'true'
CLUSTER_HOSTNAME: 'node1'
volumes:
- ./weaviate/data:/var/lib/weaviate
Docker-Compose: Deploy Weaviate with backups-s3
module enabled and play.min.io
MinIO server
With the above docker-compose.yaml, Weaviate is intricately configured to utilize MinIO for backups, ensuring data integrity and accessibility. This setup involves essential environment variables such as ENABLE_MODULES
set to backup-s3
, and various settings for the S3 bucket, endpoint, access keys, and SSL usage. Additionally, the PERSISTENCE_DATA_PATH
is set to ensure data is persistently stored, and CLUSTER_NAME
for node identification.
ENABLE_MODULES
: 'backup-s3'BACKUP_S3_BUCKET
: 'weaviate-backups'BACKUP_S3_ENDPOINT
: 'play.min.io:443'BACKUP_S3_ACCESS_KEY_ID
: 'minioadmin'BACKUP_S3_SECRET_ACCESS_KEY
: 'minioadmin'BACKUP_S3_USE_SSL
: 'true'PERSISTENCE_DATA_PATH
: '/var/lib/weaviate'CLUSTER_NAME
: 'node1'
Note: The MinIO bucket needs to exist beforehand, Weaviate will not create the bucket for you.
Saving or Updating the Docker Compose File
docker-compose up -d --build
During the build and execution process, Docker Compose will create a persistent directory as specified in the docker-compose.yaml file. This directory (./weaviate/data
for Weaviate) is used for storing data persistently, ensuring that your data remains intact across container restarts and deployments.
The persistent storage allows for a more stable environment where data is not lost when the container is restarted.
Once you’ve deployed your docker-compose you can visit your Weaviate server’s URL in a browser, followed by /v1/meta
to examine if your deployment configurations are correct.
The first line of the JSON payload at //localhost:8080/v1/meta
should look like this:
{"hostname":"//[::]:8080","modules":{"backup-s3":{"bucketName":"weaviate-backups","endpoint":"play.min.io:443","useSSL":true}...[truncated]...}
weaviate-backups
BucketTo integrate Weaviate with MinIO, the backup bucket in MinIO appropriately needs the Access Policy of the designated backup bucket, namely weaviate-backups
, to Public. This adjustment is necessary to grant the Weaviate backup-s3 module the required permissions to successfully interact with the MinIO bucket for backup operations.
Note: In a production environment you probably need to lock this down, which is beyond the scope of this tutorial.
It's essential to approach this configuration with a clear understanding of the security implications of setting a bucket to “public”. While this setup facilitates the backup process in a development environment, alternative approaches should be considered for production systems to maintain data security and integrity. Employing fine-grained access controls, such as IAM policies or “presigned” URLs.
By the end of this demonstration you will be able to see the bucket objects that Weaviate creates throughout the process when utilizing the backup-s3
module.
Before diving into the technical operations I would like to state that I am demonstrating the following steps in a JupyterLab environment for the added benefit of encapsulating our pipeline in a notebook, available .
The first step involves setting up the environment by installing the weaviate-client
library for python with pip
. This Python package is essential for interfacing with Weaviate's RESTful API in a more Pythonic way, allowing for seamless interaction with the database for operations such as schema creation, data indexing, backup, and restoration. For the demonstration, we’ll illustrate using the Weaviate Python client library.
In this demonstration we are using Weaviate V3 API so you might see message like the one below when you run the python script:
`DeprecationWarning: Dep016: You are using the Weaviate v3 client, which is deprecated.
Consider upgrading to the new and improved v4 client instead!
See here for usage: //weaviate.io/developers/weaviate/client-libraries/python
warnings.warn(`
This message is a warning banner and can be ignored, for more information you can visit this
!pip install weaviate-client
This section introduces the data structure and schema for 'Article' and 'Author' classes, laying the foundation for how data will be organized. It demonstrates how to programmatically define and manage the schema within Weaviate, showcasing the flexibility and power of Weaviate to adapt to various data models tailored to specific application needs.
import weaviate
client = weaviate.Client("//localhost:8080")
# Schema classes to be created
schema = {
"classes": [
{
"class": "Article",
"description": "A class to store articles",
"properties": [
{"name": "title", "dataType": ["string"], "description": "The title of the article"},
{"name": "content", "dataType": ["text"], "description": "The content of the article"},
{"name": "datePublished", "dataType": ["date"], "description": "The date the article was published"},
{"name": "url", "dataType": ["string"], "description": "The URL of the article"},
{"name": "customEmbeddings", "dataType": ["number[]"], "description": "Custom vector embeddings of the article"}
]
},
{
"class": "Author",
"description": "A class to store authors",
"properties": [
{"name": "name", "dataType": ["string"], "description": "The name of the author"},
{"name": "articles", "dataType": ["Article"], "description": "The articles written by the author"}
]
}
]
}
client.schema.delete_class('Article')
client.schema.delete_class('Author')
client.schema.create(schema)
Python: create schema classes
# JSON data to be Ingested
data = [
{
"class": "Article",
"properties": {
"title": "LangChain: OpenAI + S3 Loader",
"content": "This article discusses the integration of LangChain with OpenAI and S3 Loader...",
"url": "//blog.min.io/langchain-openai-s3-loader/",
"customEmbeddings": [0.4, 0.3, 0.2, 0.1]
}
},
{
"class": "Article",
"properties": {
"title": "MinIO Webhook Event Notifications",
"content": "Exploring the webhook event notification system in MinIO...",
"url": "//blog.min.io/minio-webhook-event-notifications/",
"customEmbeddings": [0.1, 0.2, 0.3, 0.4]
}
},
{
"class": "Article",
"properties": {
"title": "MinIO Postgres Event Notifications",
"content": "An in-depth look at Postgres event notifications in MinIO...",
"url": "//blog.min.io/minio-postgres-event-notifications/",
"customEmbeddings": [0.3, 0.4, 0.1, 0.2]
}
},
{
"class": "Article",
"properties": {
"title": "From Docker to Localhost",
"content": "A guide on transitioning from Docker to localhost environments...",
"url": "//blog.min.io/from-docker-to-localhost/",
"customEmbeddings": [0.4, 0.1, 0.2, 0.3]
}
}
]
for item in data:
client.data_object.create(
data_object=item["properties"],
class_name=item["class"]
)
Python: index data by class
result = client.backup.create(
backup_id="backup-id",
backend="s3",
include_classes=["Article", "Author"], # specify classes to include or omit this for all classes
wait_for_completion=True,
)
print(result)
Python: create backup
Expect:
{'backend': 's3', 'classes': ['Article', 'Author'], 'id': 'backup-id-2', 'path': 's3://weaviate-backups/backup-id-2', 'status': 'SUCCESS'}
Successful Backup Response
client.schema.delete_class("Article")
client.schema.delete_class("Author")
result = client.backup.restore(
backup_id="backup-id",
backend="s3",
wait_for_completion=True,
)
print(result)
Python: restore backup
Expect:
{'backend': 's3', 'classes': ['Article', 'Author'], 'id': 'backup-id', 'path': 's3://weaviate-backups/backup-id', 'status': 'SUCCESS'}
Successful Backup-S3 Response
from weaviate.exceptions import BackupFailedError
try:
result = client.backup.restore(
backup_id="backup-id",
backend="s3",
wait_for_completion=True,
)
print("Backup restored successfully:", result)
except BackupFailedError as e:
print("Backup restore failed with error:", e)
# Here you can add logic to handle the failure, such as retrying the operation or logging the error.
Expect:
Backup restored successfully: {'backend': 's3', 'classes': ['Author', 'Article'], 'id': 'backup-id', 'path': 's3://weaviate-backups/backup-id', 'status': 'SUCCESS'}
Successful Backup Restoration
client.schema.get("Article")
Returns the Article class as a JSON object
Expect:
{'class': 'Article', 'description': 'A class to store articles'... [Truncated]...}
So far we’ve shown you how to do this the Pythonic way. We thought it would be helpful to show internally via CURL
how the same operations could be achieved without writing a script.
Backups are triggered through a POST request to the backups endpoint, and restoration is done via a POST request to the restore endpoint. Each of these operations requires the appropriate JSON payload, typically provided as a file reference in the curl command using the @
symbol.
I’ve included the following:
schema.json
outlines the structure of the data we want to index.
data.json
is where our actual data comes into play, its structure aligns with the classes in the schema.json file.
The schema.json and data.json files are available in the MinIO blog-assets repository located
{
"classes": [
{
"class": "Article",
"description": "A class to store articles",
"properties": [
{"name": "title", "dataType": ["string"], "description": "The title of the article"},
{"name": "content", "dataType": ["text"], "description": "The content of the article"},
{"name": "datePublished", "dataType": ["date"], "description": "The date the article was published"},
{"name": "url", "dataType": ["string"], "description": "The URL of the article"},
{"name": "customEmbeddings", "dataType": ["number[]"], "description": "Custom vector embeddings of the article"}
]
},
{
"class": "Author",
"description": "A class to store authors",
"properties": [
{"name": "name", "dataType": ["string"], "description": "The name of the author"},
{"name": "articles", "dataType": ["Article"], "description": "The articles written by the author"}
]
}
]
}
Example schema classes for Article and Author
The
On the other hand, the
[
{
"class": "Article",
"properties": {
"title": "LangChain: OpenAI + S3 Loader",
"content": "This article discusses the integration of LangChain with OpenAI and S3 Loader...",
"url": "//blog.min.io/langchain-openai-s3-loader/",
"customEmbeddings": [0.4, 0.3, 0.2, 0.1]
}
},
{
"class": "Article",
"properties": {
"title": "MinIO Webhook Event Notifications",
"content": "Exploring the webhook event notification system in MinIO...",
"url": "//blog.min.io/minio-webhook-event-notifications/",
"customEmbeddings": [0.1, 0.2, 0.3, 0.4]
}
},
{
"class": "Article",
"properties": {
"title": "MinIO Postgres Event Notifications",
"content": "An in-depth look at Postgres event notifications in MinIO...",
"url": "//blog.min.io/minio-postgres-event-notifications/",
"customEmbeddings": [0.3, 0.4, 0.1, 0.2]
}
},
{
"class": "Article",
"properties": {
"title": "From Docker to Localhost",
"content": "A guide on transitioning from Docker to localhost environments...",
"url": "//blog.min.io/from-docker-to-localhost/",
"customEmbeddings": [0.4, 0.1, 0.2, 0.3]
}
}
]
Sample data containing articles
curl -X POST -H "Content-Type: application/json" \
--data @schema.json //localhost:8080/v1/schema
CURL: create
curl -X POST -H "Content-Type: application/json" \
--data @data.json //localhost:8080/v1/objects
CURL: index
curl -X POST '//localhost:8080/v1/backups/s3' -H 'Content-Type:application/json' -d '{
"id": "backup-id",
"include": [
"Article",
"Author"
]
}'
CURL: backup-s3
Expect:
{'backend': 's3', 'classes': ['Article', 'Author'], 'id': 'backup-id', 'path': 's3://weaviate-backups/backup-id', 'status': 'SUCCESS'}
Successful Backup-S3 Response
This output is formatted as a JSON object. It includes the backend used (in this case, ‘s3’
), a list of classes that were included in the backup ('Article'
, 'Author'
), the unique identifier id given to the backup ('backup-id'
), the path indicating where the backup is stored within the S3 bucket (s3://weaviate-backups/backup-id
), and the status of the operation ('SUCCESS'
).
curl -X POST '//localhost:8080/v1/backups/s3/backup-id/restore' \
-H 'Content-Type:application/json' \
-d '{
"id": "backup-id",
"exclude": ["Author"]
}'
CURL: restore
Expect:
{
"backend": "s3",
"classes": ["Article"],
"id": "backup-id",
"path": "s3://weaviate-backups/backup-id",
"status": "SUCCESS"
}
Successful restoration response
Multi-node Backups: For multi-node setups, especially in Kubernetes environments, ensure that your configuration correctly specifies the backup module (like backup-s3 for MinIO) and the related environment variables.
If you encounter issues during backup or restore, check your environment variable configurations, especially related to SSL settings for S3-compatible storage like MinIO. Disabling SSL (BACKUP_S3_USE_SSL: false
) might resolve certain connection issues.
We are truly inspired by the remarkable innovation that springs from the minds of dedicated and passionate developers like you. It excites us to offer our support and be part of your journey towards exploring advanced solutions and reaching new heights in your data-driven projects. Please, don't hesitate to reach out to us on