Inspecting Transformed Data
In this example, we will use the same store sales dataset from other examples.
Instead of making predictions, we will use some internal properties of the forecasting models to inspect the transformed training data.
It is good practice to manually inspect the transformed dataset before inferencing to ensure it aligns with user expectations.
Although this model uses the DirectForecaster class, the same process can be completed for RecursiveForecaster.
Data Preparation
# imports
from clustercast.datasets import load_store_sales
from clustercast import DirectForecaster, RecursiveForecaster
# load store sales data
data = load_store_sales()
print(data)
# keep only certain data for training
data_train = data.loc[
data['YM'] < dt.datetime(year=2018, month=1, day=1)
]
ID YM Region Category Sales
0 1 2015-01-01 Central Furniture 506.358
1 1 2015-02-01 Central Furniture 439.310
2 1 2015-03-01 Central Furniture 3639.290
3 1 2015-04-01 Central Furniture 1468.218
4 1 2015-05-01 Central Furniture 2304.382
.. .. ... ... ... ...
568 12 2018-08-01 West Technology 6230.788
569 12 2018-09-01 West Technology 5045.440
570 12 2018-10-01 West Technology 4651.807
571 12 2018-11-01 West Technology 7584.580
572 12 2018-12-01 West Technology 8064.524
[573 rows x 5 columns]
Direct Forecaster
First, we can create the direct forecaster model. We will use the following parameters. Although these are likely not optimal, they will be useful for demonstrating the feature transformations.
- A Box-Cox transform parameter of 0.5
- Differencing
- Inclusion of a level feature
- 3 lag features
- Two ordinal seasonality features: one for 12 months (yearly), and one for 3 months (quarterly)
# create the forecasting model
model = DirectForecaster(
data=data_train,
endog_var='Sales',
id_var='ID',
group_vars=['Region', 'Category'],
boxcox=0.5,
differencing=True,
include_level=True,
timestep_var='YM',
lags=3,
seasonality_ordinal=[12, 3],
)
# fit the model; data transformation is performed within the fit method
model.fit(max_steps=3)
First, let's take a look at the timestep inferred by the model. The forecasting classes work with both datetime and non-datetime timesteps, but it is good practice to inspect. In this case, we are working with monthly data. It appears that the timestep delta inferred by the model is correct.
# show the inferred timestep
print(model._inferred_timestep)
<DateOffset: months=1>
Next, let's look at the transformed data. We will sort it by series ID and timestep in order to make it easier to understand, since there are multiple time series in the dataset.
There are several things to note about the transformed data:
- There are multiple columns that begin with an underscore. These will not be used for training, but are included to make it easier for the user to double-check the transformations. In this case, the
Salescolumn is put through a Box-Cox transformation (_endog_boxcoxcolumn). Then, differencing is applied (_endog_differencedcolumn). The final transformed endogenous variable is stored in theendogcolumn, which will be used for training. - The endogenous variable after being Box-Cox transformed is stored in the
endog_levelcolumn, as specified by theinclude_levelargument. - There are 3 lag variables included in the transformed data:
endog_lag_1,endog_lag_2, andendog_lag_3. These are lags of the finalendogcolumn. - Because we specified for the
DirectForecasterto fit 3 lookahead models (max_steps=3), there were 3 target columns calculated:endog_lookahead_1,endog_lookahead_2, andendog_lookahead_3. - There are two seasonality columns:
season_ordinal_p12andseason_ordinal_p3. - There are onehot-encoded columns for both
RegionandCategory(the two grouping variables).
# display the transformed data
print(model._data_trans.sort_values(by=['ID', 'YM']))
YM ID Region Category Sales endog _endog_boxcox endog_level _endog_differenced endog_lookahead_1 endog_lookahead_2 endog_lookahead_3 endog_lag_1 endog_lag_2 endog_lag_3 season_ordinal_p12 season_ordinal_p3 Region_Central Region_East Region_South Region_West Category_Furniture Category_Office Supplies Category_Technology
0 2015-01-01 1 Central Furniture 506.358 NaN 43.049218 43.049218 NaN -3.082088 75.620414 31.611542 NaN NaN NaN 1 1 1 0 0 0 1 0 0
12 2015-02-01 1 Central Furniture 439.310 -3.082088 39.967130 39.967130 -3.082088 78.702502 34.693629 54.061657 NaN NaN NaN 2 2 1 0 0 0 1 0 0
24 2015-03-01 1 Central Furniture 3639.290 78.702502 118.669632 118.669632 78.702502 -44.008872 -24.640844 9.307329 -3.082088 NaN NaN 3 3 1 0 0 0 1 0 0
36 2015-04-01 1 Central Furniture 1468.218 -44.008872 74.660759 74.660759 -44.008872 19.368028 53.316202 16.867761 78.702502 -3.082088 NaN 4 1 1 0 0 0 1 0 0
48 2015-05-01 1 Central Furniture 2304.382 19.368028 94.028787 94.028787 19.368028 33.948174 -2.500268 -33.884375 -44.008872 78.702502 -3.082088 5 2 1 0 0 0 1 0 0
.. ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
383 2017-08-01 12 West Technology 4075.004 -29.493989 125.687180 125.687180 -29.493989 52.117225 -14.274518 18.847880 -39.848737 60.500935 63.291658 8 2 0 0 0 1 0 0 1
395 2017-09-01 12 West Technology 8081.406 52.117225 177.804405 177.804405 52.117225 -66.391742 -33.269344 5.086028 -29.493989 -39.848737 60.500935 9 3 0 0 0 1 0 0 1
407 2017-10-01 12 West Technology 3214.608 -66.391742 111.412662 111.412662 -66.391742 33.122398 71.477770 NaN 52.117225 -29.493989 -39.848737 10 1 0 0 0 1 0 0 1
419 2017-11-01 12 West Technology 5367.131 33.122398 144.535061 144.535061 33.122398 38.355372 NaN NaN -66.391742 52.117225 -29.493989 11 2 0 0 0 1 0 0 1
431 2017-12-01 12 West Technology 8545.118 38.355372 182.890432 182.890432 38.355372 NaN NaN NaN 33.122398 -66.391742 52.117225 12 3 0 0 0 1 0 0 1
[432 rows x 24 columns]
As previously mentioned, not all columns in the transformed data will be used for training. If we want to see the columns that will be used for training in the X data, we can do the following. Note that the timestep variable, series IDs, original group variables, and intermediate endogenous variable transformations are not included.
# display the columns that will be used for training
for c in model._X_cols:
print(c)
endog
endog_level
endog_lag_1
endog_lag_2
endog_lag_3
season_ordinal_p12
season_ordinal_p3
Region_Central
Region_East
Region_South
Region_West
Category_Furniture
Category_Office Supplies
Category_Technology