visit
Sometimes, all it takes to handle those pesky NAs is dropping them i.e. removing the rows containing missing data. However, this might not always be optimal, especially for time series data, and even more so for financial. Of course, this problem is well-studied, so many alternatives to dropping exist.
I will look into a few of those (listed below) and discuss the pros and cons:
Note: if one wants to be pedantic, all of the methods 2-4 are examples of some imputation.
np.random.seed(10) # needed for reproducibility
price_moves = 3*pd.Series(np.random.randn(100)) # generating random "steps" with 0 mean
price_vec = 100 + price_moves.cumsum() # generating brownian motion
price_vec.plot()
price_vec.diff().mean() #sample mean
>0.20030544816842052
Obviously non-zero, unlike the generating series - but this is just sample noise. So far so good.
price_vec_na_simple = price_vec.copy()
price_vec_na_simple.iloc[90:95] = np.array([np.NaN, np.NaN, np.NaN,
np.NaN, np.NaN]) #
price_vec_na_simple.diff().mean()
>0.83252
The mean is somehow non-NA even though the diff
vector clearly will contain NAs
Now, #1 is quite easy - pd.mean
automatically remove NA by default.
It’s easy to show that at least without NAs, the mean price difference should simply be (price_vec[99]-price_vec[0])/99
- indeed, when we sum price differences, all the “intermediate” pieces cancel out, like so (price_vec[1] - price_vec[0]) + (price_vec[2] - price_vec[1]) + ..
!
Now, with the missing data inserted, if we first take differences and then drop NA
s, this cancellation is broken - some easy math shows that you’re now computing (price_vec[99] - price_vec[0] - price_vec[95] + price_vec[89])/93
.
To show this, notice that the following two terms are now omitted - price_vec[95] - price_vec[94]
and price_vec[90] - price_vec[89]
, since (NA - any number)
evaluates to NA and is then dropped.
Let’s verify this:
(price_vec[99] - price_vec[0])/99
>0.20030544816842052
(price_vec[99] - price_vec[0] - price_vec[95] + price_vec[89])/93
>0.83252
Now, it becomes clearer how we can fix things - we need to first drop NAs and then diff
-
price_vec_na_simple.dropna().diff().mean()
>0.276203
The mean is almost back to where it should be - a small discrepancy happens because we now have fewer terms in the mean - 94 instead of 99.
Ok, looks like if we only care about the mean, we’re fine just using dropna
(as long as we do it right)? After all, the difference between 0.2
and 0.21
is clearly within our noise tolerance in the first place. Well, not quite - let’s see why.
Suppose you record multiple quantities at once, especially those that don’t usually change too much too quickly - like hourly recordings of temperature and humidity. Suppose you have both values for 10:00, 11:00, and 12:00, but only humidity at 13:00. Do you just delete that “row” - i.e. pretend you don’t have a reading for 13:00? Well, that’s ok if you just have two variables - even though you just removed some potentially valuable info (the 13:00 humidity). But if you have many such occurrences or many variables at once, dropping might leave you with virtually no data at all!
A very attractive alternative is to just assume that nothing has changed for the temperature between 12:00 and 13:00. After all, if someone came to us at 12:30 and asked us - “what’s the current temperature”, we would have rightfully replied with the 12:00 reading (if we’re not able to get a new reading immediately, of course). Why not use the same logic for the 13:00 value?
price_vec_na_simple.ffill().diff().mean() # ffill performs LOCF by default
>0.20030544816842052
Looks like we recovered our old value precisely! In addition, if you want to do further research on the price difference data - it looks more “orderly” now as it has an entry for every day, even though five of those entries are now 0 (why? try to run price_vec_na_simple.ffill().diff().iloc[90:95]
to see for yourself).
In addition to that, in finance, missing data and outlier data often come together. Let me illustrate that:
#inflate two observations, delete three next ones
price_moves_na[90] += 20
price_moves_na[91] += 30
price_moves_na[92] -= 50 # to "deflate" the price shock back
price_vec_na = (100 + price_moves_na.cumsum())
price_vec_na[92:95] = [np.NaN, np.NaN, np.NaN]
price_vec_na.tail(20).plot()
price_vec_na.diff().dropna().mean()
>0.7093365245831178
We can see that after a sharp price increase, the data is simply not available for 3 days straight. This is not such an "artificial” example as it might sound! Imagine the trading halted after the spike, at least on this particular exchange. Then things settled down a bit, so the price went back to the normal regime. Maybe something gradual was going on behind the scenes that actually “connected” the dots between the spike and the post-spike calm down. But you don’t know that and don’t have any data for it!
What is the most natural assumption if we don’t have any new data? Well, recall that our data-generating model was fundamentally based on price changes. So if there’s no new data, perhaps the price is not changing at all? This is exactly what LOCF (Last Observation Carried Forward) assumes.
A side note for an inquisitive reader - perhaps a more fundamental view on why LOCF is particularly suitable for stock price data is that it’s usually modeled as a . Roughly speaking, a martingale is something where our best guess for tomorrow is what we see today, or E[x_{t+1} | x_t] = x_t
price_vec_na.ffill().tail(20).plot()
price_vec_na.ffill().diff().mean()
>0.20030544816842052
price_vec_na_impute = price_vec_na.copy()
price_vec_na_impute[price_vec_na_impute.isna()] = price_vec_na.iloc[:90].mean()
price_vec_na_impute.diff().mean()
>0.20030544816842052
We recover the “correct” price change mean, same as LOCF. BUT we introduce an artificial price drop between days 91 and 92 which in some ways is even worse than the one we had before. After all, that one happened when or after things likely calmed down, whereas this one just assumes everything goes back to normal right away. Aside from that, in practice it can be somewhat challenging to balance the look-back window such that we a) capture recent trends but also b) capture long-term tendencies (the usual bias-variance tradeoff).
np.random.seed(2) # needed to ensure a fixed second series
price_moves_2 = pd.Series(np.random.randn(100))
price_vec_2 = 50+(0.4*price_moves/3 + np.sqrt(1-0.4**2)*price_moves_2).cumsum() # all this math to ensure we get a 0.4 "theoretical" correlation with the first one
pd.concat([price_vec, price_vec_2], axis = 1).diff().corr().iloc[0,1]
>0.4866403018044526
As a next step, we’ll examine the case with NAs, but no outliers. We’ll also compare what happens if we dropna
before and after diff
pd.concat([price_vec_na_simple, price_vec_2], axis = 1).diff().corr().iloc[0,1] # implicit dropna after diff by default in corr
>0.5022675176281746
pd.concat([price_vec_na_simple, price_vec_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.5287405341268966
pd.concat([price_vec_na_simple, price_vec_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.5049380499525835
price_vec_na_simple_impute = price_vec_na_simple.copy()
price_vec_na_simple_impute[price_vec_na_simple_impute.isna()] = price_vec_na_simple_impute.iloc[:90].mean()
pd.concat([price_vec_na_simple_impute, price_vec_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.4866728183859715
Remember, to stay consistent we need to expose the second price series to the same price shocks the first one experienced, but without the following NAs. Going back to the example above - imagine some major event causing a spike in the first risk factor that eventually halts trading in the first asset. The second asset will experience those too, sure, but perhaps to a lesser extent, and so no halting would take place and thus no NAs.
price_vec_na_2 = 50+(0.4*price_moves_na/3 + np.sqrt(1-0.4**2)*price_moves_2).cumsum()
pd.concat([price_vec_na, price_vec_na_2], axis = 1).diff().corr().iloc[0,1]
>0.65274
pd.concat([price_vec_na, price_vec_na_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.79506
pd.concat([price_vec_na, price_vec_na_2], axis = 1).ffill().diff().corr().iloc[0,1]
>0.339984
pd.concat([price_vec_na_impute, price_vec_na_2], axis = 1).dropna().diff().corr().iloc[0,1]
>0.7280990594963112
import statsmodels.formula.api as smf
smf.ols('y ~ x', # this includes intercept by default
pd.concat([price_vec, price_vec_2], axis = 1, keys = ['x', 'y']).diff()).fit().params[1]
>0.036273 # empirical beta
0.4/3
>0.333333 # "theoretical" regression beta
The beta is biased, but if one observes the confidence interval for it using the summary()
function, the two values are still in agreement. How about adding the NAs?
smf.ols('y ~ x',
pd.concat([price_vec_na_simple, price_vec_2], axis = 1, keys = ['x', 'y']).dropna().diff()).fit().params[1]
>0.170504
smf.ols('y ~ x',
pd.concat([price_vec_na_simple, price_vec_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.45616
Pretty smooth so far. But what changes if we add outliers?
smf.ols('y ~ x',
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).dropna().diff().dropna()).fit().params[1]
>0.912401
smf.ols('y ~ x',
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.06785360083909411
Well, that’s surprising! Why is LOCF underperforming so much here? Well, to understand, it’s helpful to examine some values:
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff().loc[95]
> x -47.556730
y 1.321239
pd.concat([price_vec_na_alt, price_vec_na_2], axis = 1, keys = ['x', 'y']).dropna().diff().loc[95]
> x -47.556730
y -5.503415
What's going on here? By doing LOCF, we’re effectively comparing price_1[95] - price_1[91]
vs price_2[95] - price_2[94]
. Hence, one of the price moves does contain a jump in it whereas the other does not. When you do dropping instead, this does not happen - you compare price_1[95] - price_1[91]
vs price_2[95] - price_2[91]
- now the second price move also contains the jump! The mean imputation works just as well here:
smf.ols('y ~ x',
pd.concat([price_vec_na_impute, price_vec_na_2], axis = 1, keys = ['x', 'y']).ffill().diff()).fit().params[1]
>0.955972
So how come things seemed to work with correlation? Well, secretly, it didn’t! The point is, the original claim of 0.4 “theoretical” correlation also relied on the standard deviation of price_moves/3
being 1 - this was the reason for the 1/3
factor in the definition. The vector price_moves
had st.dev = 1, but it was broken by introducing the price jumps! This messes up the “theoretical” correlation computation - recall that corr(x,y) = cov(x,y)/(std(x) * std(y))
. So changing the standard deviation changes all our conclusions in the previous chapter. A curious reader can go back to that computation to convince themselves that the correlation results for dropping was actually “correct” in that sense - as also evidenced by beta!
Of course, this shows that LOCF is not 100% robust. For one, by doing LOCF we introduce a large price drop when the missing data ends. If it coincides with some outliers on the second price vector, the results could change by quite a bit. (*An exercise for the reader - flip the sign on the price move of price_vec_na_2[95]
and check how it affects the results). It’s not quite clear whether it’s “clean” to just introduce this price drop as opposed to e.g. interpolating between the price peak price_vec_na[91]
and the “normal” value afterward price_vec_na[95]
. However, especially for a “live” usage, interpolation is not really possible! After all, if today is day #93, how can we interpolate using a future value recorded at the end of day #95? For a historical study - sure, that remains an option, but then it remains unclear how to interpret and use it for actual forecasting! In conclusion, interpolation across the time dimension is possible, but somewhat more questionable.