visit
is a big data and AI start-up that uses geospatial data to monitor environmental parameters. Our goal is to become the Bloomberg of Environmental data for environmental monitoring, ESG (environment, social, and governance) due-diligence and climate risk assessment.
In other words, we aim to be the go-to source for all environmental data to drive sustainable decision making. We believe that safeguarding the global market against climate risk and other climate-change-induced threats requires temporally and spatially continuous high-resolution datasets that inform us of pollution levels, water quality, emissions, fires, changes in soil composition, etc.We asked ourselves, “what if there was a platform that presented environmental data the way we have access to financial data?” Sounds like an impossible dream, right? That’s why we need a technology stack that’s built to power our mission and a team too young to not know what’s impossible (our average age is 24.8).…and, our air quality and farm fire crisis solutions have won us (Healthy Cities Solutions) and the, so there’s power in believing in and pursuing what others may deem impossible.In this post, we’ll outline our technology — which handles historical as well as continuous data — and talk a bit about how we’ve architected our platform to give spatial-temporal context to pre-existing data, what we’ve learned along the way, and what’s next. In short, you’ll get a glimpse of how we are making sense of terabytes of raw data and turning them into environmental insights. Our goal? You’re inspired to follow your dreams and build your own “impossible” projects.
What we do (and why we do it)
We started this journey by creating a high-resolution database for monitoring air pollution in India, called , and we aim to cover surface water monitoring in 2020, land composition and urban informatics in 2021, and other environmental parameters in the coming years.
Currently, despite the alarming air pollution levels in India, monitoring is inadequate. There are , which are unequally distributed, and, as a result, large parts of the country go unmonitored.Our products aim to bridge this data gap by using satellite data and public monitors to provide a more comprehensive picture of the air pollution crisis.We make our data available to customers via accessible APIs and platforms; our approach centers on ensuring public access (democratising high-resolution environmental information) and facilitating policy-making and enterprise decisions (allowing corporations and government entities to understand current environmental situations as they craft initiatives).But, what we can’t monitor, we can’t solve. Climate change is a global challenge that we must come together to fix; it poses a huge risk and will have grave effects on public health, ecology and the financial system. Thus, it’s vital for governments, industries, and scientists to collaborate and come up with innovative solutions — leveraging AI, big data, and space technology — to prevent and mitigate negative outcomes.That’s where we come in: we create a platform that brings the much-needed high-quality data individuals need to make data-driven sustainable decision making, such as ways to transition from carbon-intensive economies to low-carbon approaches.The Tech Stack: How we do it and why our time-series database matters to us
In order to create a large-scale environmental monitoring platform from geospatial data, we needed a database (specifically a time-series database (TSDB)) that could handle enormous quantities of spatial time-series data. We looked at various options, and landed on .The most common way to solve our “problem” is to use NoSQL databases, which can be treated as streams. However, that would mean that we would need to write our logic to implement the spatial and temporal context into our data.As we monitor environmental parameters using satellite data, the ability to handle and fill null values is especially critical. We had a lot of hacked together ways to query data with null filling and deal with data irregularities, using multiple cron jobs for each source and code at the application layer to do spatial queries. Essentially, we were spending a lot of time managing our time-series data and not enough time analysing it.For example: we wrote one script for null-filling, which was very slow to begin with — and we had to write it in bash, separate from our main app, to ensure it was performant enough and didn’t stop the main thread.So, NoSQL databases weren’t enough. But, what could we use? We looked at options like Amazon Timestream, but found that, while it works for IoT data, it didn’t work for our scenario: we handle not just IoT, but satellite data as well.For us, TimescaleDB proved to be particularly unique among TSDBs, allowing us to power spatial queries on time-series data and having its roots in Postgres. Some important features for our adoption were continuous aggregation, gap-filling null values, time bucketing, and, most importantly SQL support.Timescale has brought a change in our development paradigm, as it serializes everything in a timely order, and we can store, query, and analyze our temporal and spatial data using SQL (directly from the database).3.3*1000000/10*10 = 33,000
Measurements (Raw satellite data), from multiple satellite sources downsampled to common denominator
-- Measurements table
CREATE TABLE measurements (
u_wind NUMERIC(10, 4),
v_wind NUMERIC(10, 4),
albedo NUMERIC(10, 4),
aod469 NUMERIC(10, 4),
aod550 NUMERIC(10, 4),
aod670 NUMERIC(10, 4),
aod865 NUMERIC(10, 4),
aod1240 NUMERIC(10, 4),
aod_s5p NUMERIC(10, 4),
blh NUMERIC(10, 4),
temperature NUMERIC(10, 4),
recorded_at DATE,
grid GEOMETRY
);
-- Enum for shape types
CREATE TYPE shapes_type AS ENUM ('Country', 'State', 'District', 'Region');
-- Create shape table
CREATE TABLE shapes ( id UUID NO-T NULL, name VARCHAR(255), TYPE shapes_type, shape GEOMETRY );
Crunching 33,000,330 data points
SELECT
name,
-- Aggregates all the points in one district
json_agg(json_build_object('datetime', datetime, 'u_wind', u_wind, 'v_wind', v_wind,
'albedo', albedo, 'aod469', aod469, 'aod550', aod550, 'aod670',
aod670, 'aod865', aod865, 'aod1240', aod1240, 'aod_s5p', aod_s5p, 'temperature',
temperature, 'blh', blh)) AS pollutants
FROM (
-- Selects all the districts in Punjab
SELECT
name,
shape
FROM
"shapes"
WHERE
TYPE = 'District'
-- Selects all the `Districts` in `Punjab` state
AND ST_WITHIN(shape, (
SELECT
shape FROM "shapes"
WHERE
name = 'Punjab'))) AS Districts
LEFT JOIN (
-- Select all the points in last week with daily average and within Punjab
SELECT
time_bucket_gapfill ('1 day', recorded_at, NOW() - interval '1 week',
NOW()) AS datetime, grid, avg(u_wind) AS u_wind, avg(v_wind) AS v_wind,
avg(albedo) AS albedo, avg(aod469) AS aod469, avg(aod550) AS aod550,
avg(aod670) AS aod670, avg(aod865) AS aod865, avg(aod1240) AS aod1240,
avg(aod_s5p) AS aod_s5p, avg(temperature) AS temperature, avg(blh) AS blh
FROM
"measurements"
WHERE
-- Get the points within 1 week
recorded_at < NOW()
AND recorded_at > NOW() - interval '1 week'
-- Get the point only in Punjab
AND ST_WITHIN(grid, (
SELECT
shape FROM "shapes"
WHERE
name = 'Punjab'))
GROUP BY
grid, datetime) AS Records
-- Join the points based on geometry
ON ST_Within(grid, Districts.shape)
-- finally group them together
GROUP BY
name
As mentioned previously, we collect an enormous amount of data from various sources, namely 1000+ ground monitors across India and satellite missions, like . To power our platform, we use TimescaleDB built-in functions to build chunks that allow us to store historical data in an easily accessible, scalable way. For example, we might have 12 values for March 1, 2014, but we only need one value for that day for historical analysis.
When we started:
Post the TimescaleDB transition:
Previously published at