5,882 reads

PGSync Introduction: Real-time Integration Tool For PostgreSQL And Elasticsearch

by Tolu AinaSeptember 17th, 2020

Too Long; Didn't Read

PGSync is a change data capture tool for moving data from Postgres to Elasticsearch. It allows you to keep Postgres as your source-of-truth and expose structured denormalized documents in Elasticsearch. The tool can be useful for building Backend services for driving text search applications or building real-time dashboard applications. PGSsync’s advanced query builder then generates queries based on your schema. It operates both a polling and an event-driven model to capture changes made to date and notification for changes.

Companies Mentioned

Coin Mentioned

featured image - PGSync Introduction: Real-time Integration Tool For PostgreSQL And Elasticsearch

is a change data capture tool for moving data from to . It allows you to keep Postgres as your source-of-truth and expose structured denormalized documents in .This can be useful for building Backend services for driving text search applications or building real-time dashboard applications. Changes to nested entities are propagated to . PGSync’s advanced query builder then generates SQL queries based on your schema.Simply describe your schema in JSON and PGSync will continuously capture changes in your data and load it into . PGSync provides a self-managed solution for change data capture.

Why?

At a high level, you have data in a Postgres database and you want to expose it in . This means every change to your data needs to be replicated to Elasticsearch.At first, this seems easy and then it’s not. Simply add some code to copy the data to after updating the database or perform the so-called dual writes at your application level. Writing SQL queries spanning multiple tables and involving multiple relationships can be nontrivial.

Detecting changes within nested documents can also be quite hard.
Of course, if your data never changed, then you could just take a snapshot in time and load it into . Keep in mind, you shouldn’t really store your primary data in . Elasticsearch is more suited as a secondary denormalized search engine to be used alongside a traditional normalized datastore.

One of the challenges is getting the data out of the source of truth and into the secondary store in a reasonable timeframe. Existing tools such as Apaches’ Kafka, Amazons’ Kinesis or Elastics’ Logstash require a fair amount of engineering and expertise.

How it works

leverages the feature of Postgres introduced in PostgreSQL 9.4 to capture a continuous stream of change events.PGSync’s query builder is capable of building advanced relational queries dynamically from your schema.Simply, define a schema (JSON) describing the structure of your data in Elasticsearch, bootstrap the databases and start the PGSync daemon.It operates both a polling and an event-driven model to capture changes made to date and notification for changes that occur at a point in time. The initial sync polls the database for changes since the last iteration and thereafter reverts to event notifications (based on triggers and handled by ) for changes to the database.

There is no need to pollute your database with fields such as `updated_at`, `timestamp` or `status` flags to detect and track row-level changes.

When to use PGSync

reduces the complexity of most application stacks.

Postgres is your read/write source of truth whilst Elasticsearch is your read-only search layer.
You have data that is constantly changing.
You have data in an existing relational database such as Postgres and you need a secondary NoSQL database like Elasticsearch for text-based or auto-complete queries.
You want to avoid the development overhead and complexity mandated by other tools.

Installing PGSync

Prerequisites: Python 3.6+, Redis 3.1.0+, Elasticsearch 5.0+, PostgreSQL 9.4+.

Install from .

 $ pip install pgsync

Create a schema configuration file in JSON formatThis should mirror the structure of the resulting document in .Here is an example schema:

{
    "nodes": [
        {
            "table": "book",
            "schema": "public",
            "columns": [
                "isbn",
                "title",
                "description"
            ]
        }
    ]
}

Usage

$ pgsync --config <absolute path to JSON schema config> --daemon

More details about can be found on its .

Market Alternatives

: This is an Apache project which is an open-source stream-processing software platform.
: Amazon managed service similar to Kafka.
: This is a product of that collects data from various sources, then parses, transforms and ships to various destinations.
: Postgres extension that provides full-text search via the use of Elasticsearch indices.

Also published at

L O A D I N G
. . . comments & more!