Traffic flow forecasting is an acknowledged time series problem whose solutions have been essentially grounded on statistical-based models. Recent times came, however, with promising results regarding the use of Recurrent Neural Networks (RNNs), such as Long Short-Term Memory networks (LSTMs), to accurately address time series problems. Literature is, however, evasive in regard to several aspects of the conceived models and often exhibits misconceptions that may lead to important pitfalls. This study aims to conceive and find the best possible LSTM model for traffic flow forecasting while addressing several important aspects of such models such as the multitude of input features, the time frames used by the model and the employed approach for multi-step forecasting. To overcome the spatial problem of open source datasets, this study presents and describes a new dataset collected by the authors of this work. After several weeks of model fitting, Recursive Multi-Step Multi-Variate models were the ones showing better performance, strengthening the perception that LSTMs can be used to accurately forecast the traffic flow for several future timesteps.

Recent years have been undoubtedly beneficial to the machine learning community. Deep learning, in particular, has assumed a prominent position in many distinct fields such as computer vision (Karpathy

Time series problems are among those who have benefited from the progress of deep learning. In its essence, a time series problem consists in forecasting future values based on a set of previous observations, ordered in time. Indeed, there are an endless number of time series problems, with the major constraint being the availability of past observations. Until recently, the main focus was on using statistical-based models such as the AutoRegressive Integrated Moving Average (ARIMA) or ARIMA with Explanatory Variable (ARIMAX) to forecast future points (Babu and Reddy,

An important time series problem is related to traffic flow forecasting. Indeed, in our days, road safety has become a major concern of our society. This fact is easily explained with the substantial number of deaths happening on the roads every day. Hence, the possibility of knowing beforehand how the traffic flow will change in the next minutes or hours would enable a driver to opt for a different road, a cyclist to opt for another hour to go cycling or a pedestrian to choose a less polluted zone. Due to the difficulty of having real contemporary data, the great majority of traffic flow forecasting studies uses open datasets such as PeMS, a dataset describing the occupancy rate of different car lanes of the San Francisco bay area. However, the use of such datasets raises some constraints. The first one is related to the different characteristics of traffic in distinct countries, cities and roads, i.e. a model that has been conceived over California roads is likely to behave poorly in other countries and cities. Secondly, the lack of real time data prevents the models from being, indeed, deployed and used to forecast the traffic flow in a real life scenario. Differently, we developed a software artefact, entitled as

Do LSTM networks have better accuracy than statistical-based models for traffic flow forecasting?

Are LSTM networks capable of accurately forecasting the traffic flow of a road for multiple future timesteps (multi-step)?

Do recursive multi-step LSTM networks have better accuracy than multi-step vector output ones?

Do multi-variate LSTM networks have better accuracy than uni-variate ones?

The remainder of this paper is structured as follows, i.e. the next section describes the current state of the art, the ARIMA and LSTMs models as well as the importance of spatial-temporal dependencies in time series; the subsequent section includes a description of the used materials, the implemented methods and the software developed for data collection; later, the performed experiments are explained, with results being gathered and analysed; finally, conclusions are drawn.

Traffic flow forecasting has been in the mind of researchers for the last decades, with the initial approach being focused on statistical-based models. However, recent times have come with very promising results in regard to the use of RNNs to accurately address time series problems.

ARIMA is a forecasting algorithm originally developed by Box and Jenkins (

The Auto-Regressive Integrated Moving Average with Explanatory Variable (ARIMAX) model is considered an ARIMA extension to a multiple variable (multi-variate) problem. It is a multiple regression model with autoregressive (AR) and moving average (MA) terms. Taking in consideration a different representation of the ARIMA

The expression

On the other hand, recent times come with very promising results in regard to the use of RNNs (Ma

The forget gate,

The input gate,

The output gate,

From this base some variants have emerged (Greff

Intuition is enough to understand how space affects traffic flow forecasts, a stochastic non-linear time series problem. No two countries are the same, no two cities are the same and no two roads are the same. A model will lack the ability to generalize to roads that have distinct traffic patterns. Hence, it becomes extremely difficult, not to say unfeasible, to deploy a model to predict the traffic of several roads of a specific city when the model was trained on data from California roads. To deploy accurate solutions that make real-time predictions, the model needs to have information on the space it is operating. Therefore, possible solutions would include the conception of road-specific models or the categorization of roads that share similar patterns. Another possibility, even though computationally heavier, would be to create a multi-variate multi-road model receiving, as input, multiple observations from different roads at the same timestep. This work, as explained ahead, presents road-specific models that are able to provide, at any time, traffic flow forecasts of a road for each one of the subsequent twelve hours.

LSTMs are specially useful for time series problems (Gers

There are essentially two main parameters to consider. The first is the number of timesteps that will compose an input sequence. This assumes critical importance when performing backpropagation through time (BPTT). BPTT is a gradient-based technique that unfolds a RNN in time to find the gradient of the cost, allowing LSTMs to learn from input sequences of timesteps (Werbos,

Finally, it should be noted that the goal of any model is to provide accurate and reliable forecasts. Hence, models could be conceived to be single-step, i.e. provide a single forecast for the next immediate timestep, or multi-step, i.e. provide a set of forecasts for several future timesteps. Using the hourly dataset example, a single-step model that receives twelve input timesteps will give a prediction for the thirteenth timestep. A multi-step model would give forecasts for the thirteenth, fourteenth and fifteenth timesteps, for example. If single-step models are fairly easy to conceive and evaluate, multi-step models are, on the other hand, harder. In essence, there are two main options to consider when conceiving multi-step models. The first is to have as many neurons in the output layer of the model as timesteps to forecast (Multi-Step Vector-Output). Hence, if one aims to forecast three future timesteps the model would have three output neurons. The second option would be to conceive a single-step model and recursively call it for as many future timesteps as one aims to forecast (Recursive Multi-Step). Again, as an example, let us consider the hourly dataset and an input sequence of twelve timesteps. First, we would conceive the model as single-step, i.e. the model would receive twelve hours as input and would output the thirteenth hour. Then, we would evaluate its performance, i.e. how accurate is the model in forecasting the thirteenth timestep. We would then push the predicted value of the thirteenth timestep to the input sequence, remove the oldest timestep (to ensure that the input sequence consists of twelve timesteps) and give the updated input sequence to the model to get the forecast of the fourteenth timestep. We would keep repeating this process until the forecasting horizon is reached. This is what we call a

Recent studies have emerged where LSTM networks are used for traffic flow forecasting. However, it is possible to verify that many exhibit flaws and leave untreated several important aspects of LSTMs.

In Fu

In Tian and Pan (

In Cui

Another work, performed by Zheng

Other studies focus on different deep learning models for time series forecasting. Indeed, new trends are emerging regarding the use of Convolutional Neural Networks (CNNs) (Cai

The next lines describe the materials and methods used in this work, including the collected dataset and all the applied treatments, the evaluation metrics, and the used technologies. The dataset used in this study is available, in its raw state, in an online repository (github.com/brunofmf/Datasets4SocialGood), under a MIT license.

The dataset used in this work was created from scratch and contains real world data. A software artefact, entitled as

Availablefeatures in the traffic and weather datasets.

# | Features | Description | |

Traffic dataset | 1 | Name of the city the road belongs to | |

2 | Road identification number | ||

3 | Road name | ||

4 | Road category description | ||

5 | Current speed observed at the road (km/h) | ||

6 | Speed expected under ideal conditions (km/h) | ||

7 | Speed difference (#6–#5) | ||

8 | Current travel time observed at the road (s) | ||

9 | Travel time expected under ideal conditions (s) | ||

10 | Time difference (#9–#8) | ||

11 | Timestamp (YYYY-MM-DD HH24:MI:SS) | ||

Weather dataset | 1 | Name of the city the road belongs to | |

2 | Textual description of the weather | ||

3 | In celsius | ||

4 | Atmospheric pressure on the sea level (hPa) | ||

5 | In percentage | ||

6 | In meter/second | ||

7 | In percentage | ||

8 | Precipitation volume for the last hour (mm) | ||

9 | Current luminosity (categorical) | ||

10 | Sunrise time (unix, UTC) | ||

11 | Sunset time (unix, UTC) | ||

12 | Timestamp (YYYY-MM-DD HH24:MI:SS) |

The software went live on 24th July 2018 and has been collecting data uninterruptedly. It works by making an API call every twenty minutes using an HTTP Get request. It parses the received JSON object and saves the records on the database. The software was made so that any road of any city or country can be easily added to the fetch list. As of June 2019, the database contains, approximately, ten million records for several roads of several cities.

Two distinct datasets were available, namely the traffic and the weather ones (Table

The first step to prepare both datasets was to focus on a specific road of the city of Braga, in Portugal, and filter out the remaining cities and roads. Then, in regard to the traffic dataset, features that were static, such as the

Features present in the final dataset.

# | Features | Observation example |

0 | 2019-04-23 17:40 | |

1 | 32 | |

2 | 10 | |

3 | 0 | |

4 | 1 | |

5 | 17 |

No missing values were present. However, due to API limitations or to the fact that

Computation of missing timesteps.

Considering that LSTMs work internally with the hyperbolic tangent function (tanh), the decision was to normalize all features so that they would fit into the interval

Finally, due to computational constraints, the dataset was grouped by hour. The final multi-variate dataset contained 5050 observations, with a shape of

The dataset was now a single sequence of 5050 ordered timesteps. However, it was still not ready to be used by LSTM models. Indeed, in order to create such models one is required to set the problem as a supervised one, with inputs and corresponding labels. Hence, data was divided into smaller sequences, with its length depending on the number of input timesteps to be used. The label of each new sequence was the next immediate timestep, or a sequence of timesteps, depending if the model was to be Single-Step and Recursive Multi-Step, or Multi-Step Vector-Output, respectively (Algorithm

From an unsupervised to a supervised problem.

Consider the scenario of a single-step model that receives the last twenty-four hours of traffic and predicts how the traffic will change in the next hour. In this scenario, the initial dataset with 5050 ordered timesteps will be divided into smaller sequences of 24 timesteps, i.e. the length of the input sequence. Being this a single-step model, or a recursive multi-step one, the label will be composed of just one timestep and will have the value of the timestep that follows the twenty-fourth one. A sliding window over the initial dataset is used to create batches of sequences and labels. At the end, the dataframe (a Pandas’ object), will have a shape of (5025, 24, 5), where the first element dimension corresponds to the number of samples, the second, to the number of input timesteps, and the third, to the number of features.

Python, version 3.6, was the used programming language for data preparation and pre-processing as well as for model development and evaluation. Pandas, NumPy, scikit-learn, matplotlib and statsmodels were the used libraries. Tensorflow v1.12.0 was the machine learning library used to conceive the deep learning models. tf.keras, TensorFlow’s implementation of the Keras API specification, was also used. For increased performance, and to fit the models in reasonable times, Tesla T4 GPUs were used as well as an optimized version of LSTMs using CuDNN (CudnnLSTM), NVIDIA’s deep neural network library of primitives for deep neural networks. It is worth mentioning that this hardware was made available by Google’s Colaboratory, a free python environment that requires minimal setup and runs entirely in the cloud.

To evaluate the effectiveness of the candidate models, two error metrics were used. The first one corresponds the Root Mean Squared Error (RMSE). This allows us to penalize outliers and easily interpret the obtained results since they are in the same unit of the feature that is being predicted by the model. Its formula is as follows:

The second error metric corresponds to the Mean Absolute Error (MAE). It was mainly used to complement and strengthen the confidence on the obtained values. Its formulas is as follows:

All candidate models were evaluated in regard to the mentioned metrics. A time series cross-validator was also used to provide train/test indices to split time series data samples. In this case, the

The main goal of this study is to develop and tune a deep learning model, in particular a LSTM neural network, to forecast the traffic flow. This is achieved by forecasting the

Due to the random initialization of LSTM weights, three repetitions of each combination of parameters were performed on incremental sets of data, taking the mean of RMSE and MAE to verify the model’s quality. Several experiments were performed to find the best set of parameters for the final model. More than finding the optimized set of hyperparameters, such experiments aim to study and evaluate the nature and architecture of LSTMs in regard to the shape of its output, the multitude of used features, and the ability to forecast multiple future timesteps. An experiment was also set where ARIMA models are deployed, evaluated and later compared with LSTMs.

Algorithm

Function for LSTM model’s conception and compilation.

When compared to LSTMs, ARIMA and ARIMAX are considered more traditional approaches to time series forecasting. In truth, they are older and based on classical mathematical regression operations. Nevertheless, there are several examples in the literature of their usage, especially in the same area of study as the one presented in this paper. Therefore, ARIMA and ARIMAX models shall be used as benchmarks for the conceived LSTMs. Hence, using the collected data, models were trained in order to mimic the conditions in which LSTM were trained, i.e. using ARIMA for uni-variate and ARIMAX for multi-variate problems. Considering both of these models, data preparation is less strict, as there is no need to turn the data into sequences. The algorithms automatically process data from the ordered dataset. The remaining data pre-processing operations were already explained.

The experiments made with the ARIMA model used a grid search approach to fine tune the

In order to tackle multi-step predictions, a forecast method was used, which allows to directly forecast a number of instances from the last point in the trained dataset. This is consistent with the blind multi-step forecasting approach. For a multi-step approach incorporating real observations, the model needs to be retrained after each dataset change. The time used to train an ARIMA or ARIMAX model is usually faster than LSTM, but in the case of forecasting one step ahead with real observations, the need for retrain at each timestep, and the number of experiments being performed, made the ARIMA and ARIMAX slower than the worst performing LSTM. For simplicity, only blind multi-step forecasting models were considered.

The goal of this experiment was to compare the performance, both computationally and in terms of accuracy, of two distinct multi-step approaches. It soon became obvious that the Multi-Step Vector Output model would require a more complex architecture. That came, however, with a significantly higher computational and training time cost. Therefore, for this experiment, only uni-variate models were considered. Both models receive input sequences of one feature, i.e.

A Recursive Multi-Step LSTM model was conceived and tuned in regard to a set of parameters. As explained before, such a model forecasts the next immediate timestep, being then called recursively twelve times. RMSE and MAE values are gathered both for blind and non-blind forecasts. Non-blind forecasts use real values when iterating recursively. Obviously, non-blind forecasts produce a better result than blind ones since this last approach suffers from the accumulation of errors with higher forecasting horizons. This serves the purpose of showing that misconceptions may lead researchers to present models that have been inappropriately evaluated. Nonetheless, blind forecasts are the ones to be considered. Table

Recursive Multi-Step Uni-Variate parameters’ searching space.

Parameter | Searched values | Rationale |

Epochs | – | |

Timesteps | Input of 0.5 to 4 days | |

Batch size | 1 day to 4 weeks | |

LSTM layers | Number of LSTM layers | |

Dense layers | 1 | Number of dense layers |

Dense activation | [ReLU, tanh] | Activation function |

Neurons | For dense and LSTM layers | |

Dropout rate | For dense and LSTM layers | |

Learning rate | Tuned via callback | Keras callback |

Multisteps | 12 | 12 hours |

Features | speed_diff | Uni-variate |

CV Splits | 3 | Time series cross-validator |

As soon as the first Multi-Step Vector Output candidate model started its training, it became obvious that, computationally, it would be much expensive and it would take a considerable amount of time to go through all searching space. Therefore, a smaller parameter space was considered (Table

Multi-Step Vector Output parameters’ searching space.

Parameter | Searched values | Rationale |

Epochs | – | |

Timesteps | Input of 2 and 4 days | |

Batch size | 1, 2 and 4 weeks | |

LSTM layers | Number of LSTM layers | |

Dense layers | Number of dense layers | |

Dense activation | [ReLU, tanh] | Activation function |

Neurons | For dense and LSTM layers | |

Dropout rate | For dense and LSTM layers | |

Learning rate | Tuned via callback | Keras callback |

Multisteps | 12 | 12 hours |

Features | speed_diff | Uni-variate |

CV Splits | 3 | Time series cross-validator |

This experiment aimed to compare the performance, both computationally and in terms of accuracy, of uni-variate and multi-variate LSTM models. The first type, uni-variate, refers to models that use a single feature, while multi-variate ones use multiple features with the expectation of accurately generalizing the problem in hands. Both uni-variate and multi-variate models were conceived as Recursive Multi-Step. This decision was based on the fact that Recursive Multi-Step forecasting was shown to be less expensive and more accurate than Multi-Step Vector Output. Intuition and random search were, again, used to reduce the parameter’s space size and to find the best set of parameters for each model.

The model conceived for the uni-variate experiment is the same as the Recursive Multi-Step one presented in Section 4.2. Indeed, such model was already a uni-variate one, using only the

For the multi-variate experiments, from the features considered in Table

Recursive Multi-Step Multi-Variate parameters’ searching space.

Parameter | Searched values | Rationale |

Epochs | – | |

Timesteps | Input of 0.5 to 4 days | |

Batch size | 0.5 to 4 weeks | |

LSTM layers | Number of LSTM layers | |

Dense layers | Number of dense layers | |

Dense activation | [ReLU, tanh] | Activation function |

Neurons | For dense and LSTM layers | |

Dropout rate | For dense and LSTM layers | |

Learning rate | Tuned via callback | Keras callback |

Multisteps | 12 | 12 hours |

Features | speed_diff, precipitation, week_day and hour | |

CV Splits | 3 | Time series cross-validator |

The conceived models were evaluated in regard to the RMSE and MAE error metrics. A time series cross-validator was used to provide train and test indices to split time series data samples. The test sets were used to evaluate the model. Each training set was further split into training and validation sets in a ratio of 9:1. The results presented in the following lines are the output of a forecast function that was developed to plot and predict three entire days of the test set (72 timesteps). Mean RMSE and MAE values were computed for each prediction of each split of the time series cross-validator.

ARIMA based-models were used as a benchmark. During our analysis, it was possible to observe that ARIMA and ARIMAX models tend to be unreliable when addressing a dynamic problem such as traffic flow in an open data stream format. A different approach to data modelling could be designed, nonetheless there is no assumption or guarantee that the data will follow any specific distribution or will not be affected by outlier events.

There are issues with model training, being often impossible to train the model. Also, if the dataset of past observations is constantly updated, then there is a need to re-train models every time such events happen. So, taking into consideration the blind and real observations forecast, the latter implies that for each prediction the model needs to be re-trained. On the other hand, blind prediction needs to re-train the model each time the starting point of the prediction changes, which decreases the number of times the model needs to be re-trained even though it remains high.

Two random blind multi-step forecasts for the best ARIMA and the best ARIMAX model.

In terms of accuracy, both ARIMA and ARIMAX can yield good results for few timestep forecasts. These models behave especially badly when the series incurs, for some time, in stagnant data, i.e. when there is no speed variation for several timesteps. Figure

Top-four ARIMA and ARIMAX candidate models.

Moving average |
Dif. Operators |
Auto-regressive |
RMSE | MAE |

12 | 1 | 2 | ||

12 | 1 | 1 | 6.452 | 5.200 |

8 | 1 | 2 | 7.967 | 6.275 |

8 | 1 | 1 | 8.869 | 7.474 |

12 | 1 | 2 | ||

12 | 1 | 1 | 6.171 | 4.972 |

8 | 1 | 2 | 8.301 | 6.822 |

8 | 1 | 1 | 8.325 | 6.736 |

ARIMAX behaves similarly, however the presence of contextual data makes the model slightly more accurate for each parameter configuration. On the other hand, ARIMAX models are significantly more costly to train as the computational overhead of the contextual data is directly translated to an increase in the training time. Contrary to what may be expected, ARIMAX breaks less often than the standard uni-variate ARIMA, which means that it is also more reliable for blind multi-step predictions.

When comparing ARIMA models to LSTM ones, there are some considerations to take into account. Firstly, LSTMs can achieve better forecasting performance and produce stable results despite data properties. On the other hand, ARIMA models need to enforce data to be stationary, which might not be possible in continuous streams of data. Regarding training times, ARIMA and ARIMAX have short training times but require re-training when new observations are added to the dataset and the forecast should occur after the latest observation. LSTMs do not share this problem as they can predict the next sequence of data without the need to retrain the network regardless of the starting point. This detail makes LSTMs better suited for real time applications, where trained models can deliver high accuracy results without constant need for re-train.

Other aspects, such as the presence of contextual data, benefits both approaches. Results indicate that the presence of contextual data does increase the performance of forecasts. From an analytical point of view, the major advantages of LSTM over ARIMA and ARIMAX models are its accuracy, performance, ability to train models under any data constraints, and the need to train models only once.

Both error metrics are clear when the mission is to find the best multi-step model. Recursive Multi-Step LSTM models had significantly lower RMSE and MAE values when compared to Multi-Step Vector Output ones. As depicted in Table

Recursive Multi-Step vs Multi-Step Vector Output LSTMs top-five results.

# | Timesteps | Batch | Layers | Neurons | Dropout | Act. | RMSE | MAE |

209 | 96 | 672 | 5 | 64 | 0.2 | relu | ||

188 | 96 | 672 | 4 | 64 | 0.2 | tanh | 3.518 | 1.567 |

95 | 96 | 252 | 5 | 32 | 0.2 | relu | 3.555 | 1.592 |

195 | 96 | 672 | 4 | 128 | 0.5 | tanh | 3.583 | 1.598 |

12 | 48 | 252 | 3 | 64 | 0.5 | relu | 3.649 | 1.629 |

24 | 96 | 672 | 4 | 128 | 0.5 | relu | ||

31 | 96 | 672 | 5 | 128 | 0.2 | relu | 5.134 | 1.839 |

29 | 96 | 672 | 5 | 128 | 0.2 | tanh | 5.272 | 1.852 |

20 | 48 | 168 | 5 | 512 | 0.5 | relu | 5.279 | 1.880 |

30 | 96 | 672 | 5 | 128 | 0.5 | tanh | 5.325 | 1.885 |

It is worth highlighting that vector output models have as many neurons in the output layer as timesteps to forecasts. Intuitively, this would lead the model to demand a more complex architecture when compared to those that have a single neuron as output. This can be indeed verified in the performed experiments. The first round of experiments with vector output models limited the maximum number of neurons to 128 and the number of fully-connected layers to 1. This maximum number was the one used by all the best candidates. Then, the second round of experiments with vector output models considered a higher number of neurons, layers and training epochs. However, very few experiments were performed with combinations of 256 and 512 neurons, 4 and 5 hidden LSTM layers and 2 fully-connected layers. This is related to the fact that each candidate model was taking more than eight hours to train.

Six random multi-step vector output forecasts of the best Multi-Step Vector Output Uni-Variate LSTM model (#24). Comparison of real values vs predicted ones using vector output forecasting.

It is our conclusion that deeper and more complex architectures could improve the performance of Multi-Step Vector Output models. This comes, however, with a significant increase of training times. On the other hand, the Recursive Multi-Step LSTM models were between eight and sixteen times faster to train, required a shallower architecture and produced results that are more than 40% better. It is also interesting to note that the best models were the ones using a bigger batch size combined with input sequences of four entire days (96 timesteps). There was no clear distinction between using the rectified linear units or the hyperbolic tangent. The same argument is applied for dropout values.

Six random multi-step forecasts of the best Recursive Multi-Step Uni-Variate LSTM model (#209). Comparison of real values vs predicted ones using blind forecasting vs predicted ones using known observations.

Table

An increase of the multitude of input features led to a decrease of the RMSE and MAE values. As depicted in Table

Uni-Variate vs Multi-Variate LSTMs top-five results.

# | Timesteps | Batch | Layers | Neurons | Dropout | Act. | RMSE | MAE |

209 | 96 | 672 | 5 | 64 | 0.2 | relu | ||

188 | 96 | 672 | 4 | 64 | 0.2 | tanh | 3.518 | 1.567 |

95 | 96 | 252 | 5 | 32 | 0.2 | relu | 3.555 | 1.592 |

195 | 96 | 672 | 4 | 128 | 0.5 | tanh | 3.583 | 1.598 |

12 | 48 | 252 | 3 | 64 | 0.5 | relu | 3.649 | 1.629 |

53* | 24 | 168 | 4 | 64 | 0.5 | tanh | ||

24* | 24 | 96 | 4 | 32 | 0.5 | tanh | 3.006 | 1.412 |

16* | 24 | 48 | 4 | 32 | 0.5 | tanh | 3.031 | 1.419 |

17* | 24 | 168 | 5 | 64 | 0.5 | tanh | 3.037 | 1.402 |

37* | 48 | 84 | 2 | 64 | 0.5 | tanh | 3.038 | 1.425 |

* Used features:

It is worth noting that including the hour of the day as input feature allowed the model to make more accurate forecasts. On the other hand, the presence of the precipitation feature led to worst results. Moreover, it is interesting to note that the presence of more input features led to a decrease in the number of input timesteps. Indeed, the best uni-variate candidates required 96 input timesteps, while the best multi-variate ones require just 24 input timesteps. In simple terms, while the uni-variate models require four days of input to forecast the next twelve hours, the multi-variate ones require just a single day as input to forecast the same amount of hours. The batch size required by multi-variate models is also substantially lower when compared to uni-variate ones. Regarding the activation function, while there is no clear distinction between relu and tanh in uni-variate candidates, it is clear that tanh performs better in multi-variate ones with a dropout of 50%.

Even though multi-variate models took longer to train than uni-variate ones, such variation is negligible (a few tens of minutes). Moreover, the addition of more features to the model allowed it to perform better than an uni-variate one. Indeed, the addition of the hour of the day was the factor that allowed the candidate models to achieve lower error values. Table

Six random multi-step forecasts of the best Recursive Multi-Step Multi-Variate LSTM model (#53). Comparison of real values vs predicted ones using blind forecasting vs predicted ones using known observations.

Traffic flow forecasting has been assuming a prominent position with the rise of deep learning. This is essentially happening due to the fact that state-of-the-art RNN models have now overcome previous statistical-based models, both in terms of performance as well as accuracy. From all the candidate models, and after several weeks of combined training times, the model that was able to forecast more accurately the traffic flow of a road for multiple future timesteps was the fifty-third Recursive Multi-Step Multi-Variate candidate model, with a RMSE and MAE of 2.907 and 1.346, respectively. The Uni-Variate models also presented interesting results, with its best model being 20% worst than the best Multi-Variate one. On the other hand, the best Vector Output model had a RMSE that was more than 73% worst than the best Multi-Variate model and 40% worst than the best Uni-Variate. In addition, Vector Output models took significantly more time to train when compared to the other models. The training time difference between the Uni-Variate and the Multi-Variate models can be considered negligible. On the other hand, ARIMA and ARIMAX presented results that are two times worse than the best LSTM model. In addition, the fit frequency of ARIMA models is significantly higher when compared to LSTM ones, which may be problematic in open data stream scenarios. All this is in agreement to what the literature has been presenting, confirming that these statistical-based models have now been superseded by deep learning ones. Table

Summary results for the best model of each approach.

Model | RMSE | MAE |

Recursive Multi-Step Multi-Variate | ||

Recursive Multi-Step Uni-Variate | 3.496 | 1.567 |

Multi-Step Vector Output Uni-Variate | 5.040 | 1.846 |

ARIMAX (Multi-Variate) | 6.110 | 4.918 |

ARIMA (Uni-Variate) | 6.336 | 5.088 |

It should be noted that, as expected, time frames impact the model’s accuracy. The obtained results show that the number of input timesteps, as well as the batch size, may affect accuracy significantly. It was interesting to note that the presence of more input features led to a decrease in the number of input timesteps required by the model. While the best uni-variate models required four days of input, the multi-variate ones require just a single day. The batch size required by multi-variate models is also substantially lower when compared to uni-variate ones.

As direct answers to the elicited research questions, it can be said that (RQ1) LSTM networks have significantly better forecasting accuracy than ARIMA models; (RQ2) LSTM networks are able to accurately forecast several future timesteps; (RQ3) it has been demonstrated that recursive multi-step LSTM networks have better accuracy than multi-step vector output ones, with these last ones requiring a more complex and deeper architecture, which, in turn, increases training times; and (RQ4) the addition of more input features, namely the day of the week and the hour of the day, allowed LSTM models to behave 20% better when compared to the uni-variate ones.

It should be noted that contextual data, such as holiday periods, events or heavy precipitation, may impact traffic flow patterns. Such data may be seasonal, cyclic or episodic. In the performed experiments, such contextual data was not directly inputted to the candidate models. Nonetheless, all candidate models were conceived under the same conditions. It is also worth highlighting that, in practice, deep learning models are not deployed as standalone products. Instead, models are usually deployed within, for example, rule based systems which have, per se, the ability to penalize, or compensate, the output of the network upon the presence of events such as football games and concerts, school holidays and heavy precipitation, just to name a few. This allows a system to be able to, in real time, adjust, for example, to accidents or sudden intense precipitation.