The typical five-day week for an industrial company can be identified. Occasionally, for example in May, July or at the end of December, different structures can be identified. We first want to explore and explain these and then take them into account in the forecast for 2017.
First, the data is read from the database as a time series (see following figure) and converted into a data frame. At the same time, the power consumption values are given the name ‘Values’. The second component plots the data.
We assign the corresponding month, day of the week and hour of the day to each consumption value. For a more precise analysis of the underlying time data, we choose a representation using sine and cosine. This makes it possible to specifically consider cyclical dependencies between the respective time specifications. We also decide for each value whether it falls on a weekend, a public holiday or a school vacation day. The data set generated in this way forms the basis for developing a model to predict electricity consumption for May 25, 2017.
The data frame generated above is first extended to include the cyclical representation of the time information (Circular Representation of Time Components). Then the information is added as to whether a consumption value falls on a weekend (Weekend). The three other components decide whether a value falls on a school vacation day, a public holiday, or both a weekend and a public holiday. These three components can be used for any geographical region, with specific inputs in each case. For our company from Switzerland, for example, we select “country = CH, province = AG, state = None, year = 2016”.
In addition to the regression of the consumption values, the underlying company is primarily interested in how well the model depicts the load profile. We analyze the quality of the model using the R² value. We also output the coefficients of the influencing variables to get an impression of which of these factors have a particularly strong influence on electricity consumption.
The processed data is first prepared for the regression (‘train, test, split’). For this purpose, the data is split into training and test data. We pass the ‘Values’ column as the target variable (label) of the regression. We select 20 percent of the data (test_size = 0.2) as test data and train the model on the remaining 80 percent. This training takes place in the next step (‘Linear Regression – Trained Model’). In the upper strand of the workflow, predicted values are then generated on the trained model (‘Predit Sklearn Trained Model’). The regression can occasionally predict negative values. As this makes no sense in terms of content, we set these values to zero (‘Negative to Zero’). Finally, the predicted values are visualized together with the test data. The lower two components generate the coefficients (‘Linear Regression – Coefficients’) and the R² value (‘Linear Regression – Goodness of Fit’) of the linear regression.
In a second step, we pass the processed data set from 2016 to a random forest algorithm. Once again, we analyze the quality of the model using the R² value. We also output the percentage influence of the influencing variables in descending order.
The components have the same functionalities as in linear regression. However, whereas previously the coefficients of the influencing variables themselves were output, the random forest provides specific information on how strong their influence is on the target variable, i.e. electricity consumption.
The day of the week has by far the greatest influence on electricity consumption. However, weekends, school vacations and public holidays also have significant influences that need to be taken into account when making a prediction. The R² value is now 0.96 and could therefore be significantly improved by switching from linear regression to the random forest. The random forest is therefore a suitable basis for looking into the future.
If we zoom in on the week around 25 May in the visualization of the load profile from 2016, we get a familiar view. A five-day week with a constant structure, framed by weekends with lower consumption.
For the prediction based on the random forest algorithm, we first generate a data set for 2017, in which dependencies on the day of the week, time, public holiday and school vacation are integrated. Now we use the Random Forest Algorithm, trained on the 2016 data, to predict the 2017 data.
As above, the random forest algorithm is trained on the data for 2016. In parallel, a data set for 2017 is generated in the “Time Data” component, for our company using the input “country = CH, province = AG, state = None, year = 2017”. This is then passed to the random forest for prediction.
A look at the calendar reveals that May 25, 2017 is a public holiday, Ascension Day. The company’s machines were at a standstill on this day. Work resumes on the following Friday. However, it can be assumed that many employees take vacation on this bridge day and the company is not running at full capacity, which leads to significantly lower forecast values! The Random Forest algorithm recognized this situation and automatically adjusted its energy forecast accordingly.