visit
Learn how time-weighted averages are calculated, why they’re so powerful for data analysis, and how to use TimescaleDB hyperfunctions to calculate them faster – all using SQL.
Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.
Industrial IoT, where teams “compress” data by only sending points when the value changes;
Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations;
Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered)
If you prefer to manage your own database instances, you can on GitHub, after which you’ll be able to use time_weight
and other hyperfunctions.
That’s a battery discharge curve, which describes how long a battery can power something. The x-axis shows capacity in Amp-hours, and since this is a constant current discharge, the x-axis is really just a proxy for time. The y-axis displays voltage, which determines the battery’s power output; as you continue to discharge the battery, the voltage drops until it gets to a point where it needs to be recharged.
These are the parts at the beginning and end of the discharge where the voltage changes rapidly. Between these two sections, there’s that long period in the middle, where the voltage hardly changes at all:
Now, when I said before that I was an electrochemist, I will admit that I was exaggerating a little bit. I knew enough about electrochemistry to be dangerous, but I worked with folks with PhDs who knew a lot more than I did.
By sampling the data in this way, we'd get more data during the interesting parts and less data during the boring middle section. That’s great!
It let us answer more interesting questions about the quickly changing parts of the curve and gave us all the information we needed about the slowly changing sections – without storing gobs of redundant data. But, here’s a question: given this dataset, how do we find the average voltage during the discharge?
If we just took a naive average over the whole curve, adding the value at each point and dividing by the number of points, it would mean that a change to our sampling rate could change our calculated average...even though the underlying effect was really the same!
So the big question is: how do we get a representative average when we’re working with irregularly spaced data points?
Then, the normal average would be the sum of the values, divided by the total number of points:
But, because they’re irregularly spaced, we need some way to account for that.
(In this case, we’re doing a linear interpolation between the points). So, let’s focus on finding that area. The area between the first two points is a trapezoid:
Okay, let’s calculate that area:
So just to be clear, that’s:
Okay. So now if we notice that:
We can simplify this equation pretty nicely. Start with:
Factor out:
Simplify:
One cool thing to note is that this gives us a new way to think about this solution: it’s the average of each pair of adjacent values, weighted by the time between them:
It’s also equal to the area of the rectangle drawn to the midpoint between v1 and v2:
Now that we’ve derived the formula for two adjacent points, we can repeat this for every pair of adjacent points in the dataset. Then all we need to do is sum that up, and that will be the time-weighted sum, which is equal to the area under the curve. (Folks who have studied calculus may actually remember some of this from when they were learning about integrals and integral approximations!)
Let’s consider the scenario of an ice cream manufacturer or shop owner who is monitoring their freezers. It turns out that ice cream needs to stay in a relatively narrow range of temperatures (~0-10℉)(Read the footnote 1) so that it doesn’t melt and re-freeze, causing those weird crystals that no one likes. Similarly, if ice cream gets too cold, it’s too hard to scoop.
CREATE TABLE freezer_temps (
freezer_id int,
ts timestamptz,
temperature float);
INSERT INTO freezer_temps VALUES
( 1, '2020-01-01 00:00:00+00', 4.0),
( 1, '2020-01-01 00:05:00+00', 5.5),
( 1, '2020-01-01 00:10:00+00', 3.0),
( 1, '2020-01-01 00:15:00+00', 4.0),
( 1, '2020-01-01 00:20:00+00', 3.5),
( 1, '2020-01-01 00:25:00+00', 8.0),
( 1, '2020-01-01 00:30:00+00', 9.0),
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0),
( 1, '2020-01-01 00:32:00+00', 15.0),
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5),
( 1, '2020-01-01 00:33:30+00', 17.0),
( 1, '2020-01-01 00:34:00+00', 15.5),
( 1, '2020-01-01 00:34:30+00', 14.0),
( 1, '2020-01-01 00:35:00+00', 12.5),
( 1, '2020-01-01 00:35:30+00', 11.0),
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);
WITH setup AS (
SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp,
extract('epoch' FROM ts) as ts_e,
extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e,
*
FROM freezer_temps),
nextstep AS (
SELECT CASE WHEN prev_temp is NULL THEN NULL
ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum,
*
FROM setup)
SELECT freezer_id,
avg(temperature), -- the regular average
sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average -- our derived average
FROM nextstep
GROUP BY freezer_id;
freezer_id | avg | time_weighted_average
------------+-------+-----------------------
1 | 10.2 | 6.6361
We’ve got a few window functions in there, some case statements to deal with nulls, and several CTEs to try to make it reasonably clear what’s going on. This is the kind of thing that can really lead to code maintenance issues when people try to figure out what’s going on and tweak it.
This is why we created , to make complicated time-series data analysis less complex. Let’s look at what the time-weighted average freezer temperature query looks like if we use the :
SELECT freezer_id,
avg(temperature),
average(time_weight('Linear', ts, temperature)) as time_weighted_average
FROM freezer_temps
GROUP BY freezer_id;
freezer_id | avg | time_weighted_average
------------+-------+-----------------------
1 | 10.2 | 6.6361
Isn’t that so much more concise?! Calculate a time_weight
with a ´Linear´
weighting method (that’s the kind of weighting derived above (Read the footnote 2)), then take the average of the weighted values, and we’re done. I like that API much better (and I’d better, because I designed it!).
What’s more, not only do we save ourselves from writing all that SQL, but it also becomes far, far easier to compose (build up more complex analyses over top of the time-weighted average). This is a huge part of the design philosophy behind hyperfunctions; we want to make fundamental things simple so that you can easily use them to build more complex, application-specific analyses.
SELECT time_bucket('10 mins'::interval, ts) as bucket,
freezer_id,
avg(temperature),
average(time_weight('Linear', ts, temperature)) as time_weighted_average
FROM freezer_temps
GROUP BY bucket, freezer_id;
We added a time_bucket
, grouped by it, and done! Let’s look at some other kinds of sophisticated analysis that hyperfunctions enable.
SELECT *,
average(time_weight('Linear', ts, temperature) OVER fifteen_min) as rolling_twa
FROM freezer_temps
WINDOW fifteen_min AS
(PARTITION BY freezer_id ORDER BY ts RANGE '15 minutes'::interval PRECEDING)
ORDER BY freezer_id, ts;
freezer_id | ts | temperature | rolling_twa
------------+------------------------+-------------+--------------------
1 | 2020-01-01 00:00:00+00 | 4 |
1 | 2020-01-01 00:05:00+00 | 5.5 | 4.75
1 | 2020-01-01 00:10:00+00 | 3 | 4.5
1 | 2020-01-01 00:15:00+00 | 4 | 4.6667
1 | 2020-01-01 00:20:00+00 | 3.5 | 3.8333333333333335
1 | 2020-01-01 00:25:00+00 | 8 | 4.333333333333333
1 | 2020-01-01 00:30:00+00 | 9 | 6
1 | 2020-01-01 00:31:00+00 | 10.5 | 7.363636363636363
1 | 2020-01-01 00:31:30+00 | 11 | 7.5392
1 | 2020-01-01 00:32:00+00 | 15 | 7.739583333333333
1 | 2020-01-01 00:32:30+00 | 20 | 8.13
1 | 2020-01-01 00:33:00+00 | 18.5 | 8.557692307692308
1 | 2020-01-01 00:33:30+00 | 17 | 8.8989
1 | 2020-01-01 00:34:00+00 | 15.5 | 9.4286
1 | 2020-01-01 00:34:30+00 | 14 | 9.35344827586207
1 | 2020-01-01 00:35:00+00 | 12.5 | 9.483333333333333
1 | 2020-01-01 00:35:30+00 | 11 | 11.369047619047619
1 | 2020-01-01 00:36:00+00 | 10 | 11.329545454545455
1 | 2020-01-01 00:40:00+00 | 7 | 10.575
1 | 2020-01-01 00:45:00+00 | 5 | 9.7467
We also provide a special rollup
function so you can re-aggregate time-weighted values from subqueries. For instance:
SELECT average(rollup(time_weight)) as time_weighted_average
FROM (SELECT time_bucket('10 mins'::interval, ts) as bucket,
freezer_id,
time_weight('Linear', ts, temperature)
FROM freezer_temps
GROUP BY bucket, freezer_id) t;
time_weighted_average
-----------------------
6.6361
WITH t as (SELECT time_bucket('10 mins'::interval, ts) as bucket,
freezer_id,
time_weight('Linear', ts, temperature)
FROM freezer_temps
GROUP BY bucket, freezer_id)
SELECT bucket,
freezer_id,
average(time_weight) as bucketed_twa,
(SELECT average(rollup(time_weight)) FROM t) as overall_twa,
average(time_weight) / (SELECT average(rollup(time_weight)) FROM t) as normalized_twa
FROM t;
Rather than:
In general, linear weighting is appropriate for cases where the sampling rate is variable, but there are no guarantees provided by the system about only providing data when it changes. LOCF works best when there’s some guarantee that your system will provide data only when it changes, and you can accurately carry the old value until you receive a new one.
Previously published at