She is a full researcher at the Andalusian Research Institute in Data Science and Computational Intelligence. Her current research interests include the application of applied mathematics to the resolution of real problems, decision making, theoretical computer science and operational research. To this regard, her mathematical baggage (from pure algebra to applied mathematics) makes Dr. García Cabello’s research characterized by using a wide range of mathematical tools, from stochastic processes to dynamic systems. Dr. Julia García Cabello is also a regular reviewer of journals

The quality of the input data is amongst the decisive factors affecting the speed and effectiveness of recurrent neural network (RNN) learning. We present here a novel methodology to select optimal training data (those with the highest learning capacity) by approaching the problem from a decision making point of view. The key idea, which underpins the design of the mathematical structure that supports the selection, is to define first a binary relation that gives preference to inputs with higher estimator abilities. The Von Newman Morgenstern theorem (VNM), a cornerstone of decision theory, is then applied to determine the level of efficiency of the training dataset based on the probability of success derived from a purpose-designed framework based on Markov networks. To the best of the author’s knowledge, this is the first time that this result has been applied to data selection tasks. Hence, it is shown that Markov Networks, mainly known as generative models, can successfully participate in discriminative tasks when used in conjunction with the VNM theorem.

The simplicity of our design allows the selection to be carried out alongside the training. Hence, since learning progresses with only the optimal inputs, the data noise gradually disappears: the result is an improvement in the performance while minimising the likelihood of overfitting.

The superiority of artificial neural networks (ANNs) in various tasks (classification, pattern identification, prediction, etc.) has led researchers to focus much of their efforts on the study of the functioning of their components from a theoretical perspective, see Higham and Higham (

The main objective of this paper is to provide a robust methodology to select optimal training datasets (those with the highest learning capacity) that can be used in any context to maximise the performance of the trained models. This methodology has been designed to run in parallel with RNN learning so that, while the RNN learning evolves progressively only with the optimal training inputs, the data noise gradually disappears. This has a positive impact on the quality of the RNN results while minimising the likelihood of occurrence of overfitting.

Overfitting in data-driven learning models (which extract a predictive model from a data set) is the flaw of failing to generalise the features/patterns present in the training dataset. It occurs in models which extract features from datasets having too much noise.

The key idea in the design of the mathematical structure that supports this selection is to define a binary relation that gives preference to those datasets with higher estimator abilities by using Utility theory. A second contribution of our work is to have designed our methodology based on tools that have not been used previously for this. This novelty lies in showing Markov Networks (MNs), widely known as generative models (Gordon and Hernandez-Lobato,Markov Random Fields (MRFs) are also known as Markov Networks (MNs) in those contexts that require highlighting the undirected graph condition (Dynkin,

Regarding the literature review, the selection of optimal training sets has not been studied in a general framework so far. To this author’s knowledge, this is the first analysis that aims to provide guidance for a general context. Published papers have studied this issue either only in contexts of ANN classification tasks or in very precise scenarios (electrical, financial or chemical engineering) taking advantage of their specific techniques. Within the first category, the genetic algorithm (GA) is widely used as a tool to create high-quality training sets as a the first step in designing robust ANN classifiers, see Reeves and Taylor (

Within the second category, in the paper (Zapf and Wallek,

As for the use of MNs/MRFs (prior probability) for problems which involve probability a posteriori, in the literature the terms “MNs/MRFs” and “discriminative” appear together only and exclusively to refer to discriminative random fields (DRFs) or equivalently conditional random fields (CRFs), both type of random fields which provide by definition a posterior probability.

The rest of the paper is structured as follows: preliminaries of Section

When facing a situation of uncertainty (known as lottery), there is a set

Mathematically, a preference relation is a binary relation ⪰ in a set

completeness: for all

transitivity: for all

The Von Neumann-Morgenstern expected utility theorem (VNM theorem), (Yang and Qiu,

Many authors have shown, however, that in practice the axiom of independence is not fulfilled (the top paper (Machina,

In the paper (Machina,

Let

We are using the term “site” instead of node for its additional connotation of location, which allows us to simulate that each of the nodes is a different geographical place that could generate its own estimate of the variable as explained below, in Proposition

Graphical Models are commonly used to visually describe the probabilistic relationships amongst stochastic variables. Basic knowledge on GMs comprises the concepts of neighbourhood of a site and clique: sites

Dynamic graphical models (DGMs) are the time-varying version of GMs. The set of dynamic stochastic variables will be denoted by

Neural networks (NNs) have a general functional definition as composition of parametric functions which disaggregates the linear component of the non-linear activation function (see García Cabello,

The loss function

RNN learning process.

In this section we will develop the theoretical framework that will capacitate Markov Network-methodology to perform discriminative tasks based on prior probability and the VNM Theorem

Recall that, in the networks as a whole, Recurrent Neural Networks, RNNs, stand out for their predictive abilities based on their potential in processing temporal data. Thus, we are facing some time-dependent learning process by using RNNs where the superscript

As is well known, in RNN prediction tasks, training sets are fed with temporal sequences composed with pairs of previous data of the form (input, labels) with the objective of predicting the future, referred to as future target value,

Let

The objective is to select from the

We will list the steps to be taken:

Main lottery (in

Consider the RNN process as composed by several uncertain processes,

Secondary lottery (for each

To ensure that our selection rule is applied, we define a preference relation on

For

We thus apply Theorem

In order to compute the (prior) probabilities

In order to compute the utility

Here, we first design an abstract graph-based framework that shall provide a model for a dynamic context of

Let

Remember that two random variables are equivalent if they have identical distribution. Then, we will define the edges by equivalently defining the neighbourhood

A DGM

It is worth highlighting that Definition

Recall that marginal distribution is also known as prior probability in contrast with the posterior distribution (the conditional one). From the former definition, sites in the same neighbourhood have the same prior probability. Moreover,

Let

The following theorem proves then that the DGM defined in Definition

We shall prove that the local dynamic Markov property is verified, i.e. the probability of

Insofar as the distribution function is the tool used to make estimates, previous Theorem

Under Definition

Each clique has its own common estimation function:

It is straightforward from definition of clique. □

In discrimination/classification works, the commonly used probability is the conditional or posterior probability

Inputs for a TD-MFRs are specific values of the stochastic variable

TD-MRF operational process.

Figure

Moreover, the operation of an MRF is equally applied in the calculation of following probabilities:

Our goal here is to provide a graph-structure (which will become TD-MRF structure according to Theorem

To achieve this, the steps to follow are listed below:

sites

The neighbourhood of a site is defined as

Note that the deterministic nature of RNN, which always assigns the same

In this section, we will further develop equation (

Recall that in the preceding sections we have considered a second lottery which arises naturally when we test how good the ouput

First of all, it has to be shown that it is a preference relation:

The standard axiom for a preference relation is rationality which includes both completeness and transitivity.

Completeness: for all

Transitivity: for all

Moreover, former preference relation satisfies the following conditions:

Continuity: let us prove that if each element

Let us suppose that

By assuming that both limits exist, the inequality

Fréchet differentiable. Recall that the Fréchet derivative in finite-dimensional spaces is the usual derivative. Thus, the loss function

Therefore, the preference relation stated in Definition

The objective of this section is to highlight (after proving) the main results of our proposal. First of all, we define the level of efficiency of an input set

We define the level of efficiency of an input set

Following Proposition

We start from expression

This section is aimed at developing an example of the method application. In order to make a choice between two real data sets, their level of efficiency will be computed. As stated in Definition

The data sets we shall use here contain prices (€/1 kg) over time for the most common olive oil varieties. These data are available on the websites of the Government of Spain:

where prices are published on a weekly basis. Specifically, we will consider data sets corresponding to weeks 28/2023 and 39/2022 (the reasons for this choice will be explained later). We shall thus apply the above formula taking into account the following considerations. On one hand, note that the choice of the threshold

From the above formula, since

We assume that the energy functions

In order to achieve our goal, each input

From the TD-MRF structure proved in Theorem

As discussed before, there are multiple factors (physical and socio-economic) that influence the price. Such factors are the features

In dry seasons, water shortages lead to a drop in the olive production and, therefore, in the olive oil production. Hence, under severe drought conditions, olive oil prices skyrocket.

) whileThe computation of level of efficiency

Mean and variance of cliques

Mean, Variance | |||||||

3.988 | 3.884 | 3.851 | 3.858 | 4.343 | 3.878 | 3.916 | |

0.172 | 0.156 | 0.143 | 0.108 | 0.472 | 0.134 | 0.117 |

From the information provided by the above tables, finally the required probability is computed (see Table

Aggregated probability for

Similarly, the computation of

Finally, the required probability is computed (see Table

Aggregated probability for

Hence, as expected (since the inputs of

This paper deals with the selection of optimal training sets (those that have a higher capacity as estimators) in Recurrent Neural Networks under prediction tasks (or pattern recognition with time series as inputs), although this may also apply to other data-driven models regulated by dynamic systems. Our objective is to fill the existing gap of clear guidelines to follow for selecting optimal training sets in a general context.

We design here a novel methodology to select optimal training data sets that can be used in any context. The key idea, which underpins the design of the mathematical structure that supports the selection, is a binary relation that gives preference to inputs with higher estimator abilities. A second novelty of our approach is to use dynamic tools that have not been used previously for this purposes: dynamic Markov Networks, which are widely regarded as generative models, successfully compute the prior probabilities involved in the formula for calculating the degree of efficiency of the training set (Theorem

The simplicity of this calculation allows it to be carried out in parallel with the learning process without adding computational cost. Thus the optimal sets are selected as the learning process evolves, therefore the data noise gradually disappears which decreases the likelihood of overfitting occurring.