1 Introduction
With the rapid development of deep learning (DL), the issue of insufficient datasets has become increasingly prominent. The performance of deep neural networks (DNN) is particularly dependent on the quantity, quality, and variety of training data (Sarker,
2021). This problem is particularly evident in mechanical engineering, where machine learning (ML) is widely used. In experimental scenarios, obtaining sufficient data to train robust models can be difficult and expensive. The reliance on large datasets for training DL models highlights the need for obtaining sufficient data while lowering data collection costs.
Our research paper focuses on applying DL models to the classification of conveyor belt (CB) states (damaged CB and loaded with 0.5 kg, 1 kg, 2 kg, 3 kg, or 5 kg), specifically using belt tension time series signals. DL methods for the classification of CB states using images have been widely researched already, but the methods for classifying CB tension signals remain limited. Previous studies have demonstrated the potential of DL models like LSTM for this purpose (Žvirblis
et al.,
2022).
In the industrial sector, CB systems is one of the essential elements of production processes, enabling smooth transportation of various items. Depending on specific industrial application, CB systems must meet certain criteria and requirements, such as sterility in the food industry (Klištincová
et al.,
2024) or high wear resistance and durability (Bortnowski
et al.,
2022b). The reliability and efficiency of these conveyors are important to optimize work processes and avoid unplanned stops. An integral aspect of the maintenance of conveyor systems is monitoring their operational status to ensure correct functioning of the system and timely detection of potential faults (Dąbek
et al.,
2023). Traditional CB monitoring methods, such as manual, spectral, or radiographic damage detection (Li
et al.,
2011), are usually too expensive or require a lot of manual labour and are prone to human error.
Monitoring the status of CB systems is a critical aspect of their operational efficiency and safety. In the past, classification tasks in this area were performed using conventional ML algorithms and shallow models such as logistic regression and decision trees (Andrejiova
et al.,
2021). Over the last decade, DL methods have been applied increasingly widely because of their higher accuracy and efficiency. Santos
et al. (
2020) introduced binary classification models which used CB images. Classification was performed using deep convolutional neural networks (CNN) such as the visual geometry group (VGG) network, residual network (ResNet), and densely connected convolutional network (DenseNet). The highest average classification accuracy (89.8%) for particular data was achieved using DenseNet model. Zhang
et al. (
2021) performed a detailed analysis of ML algorithms and a comparison of DL models such as region-based CNN (R-CNN), single-shot detector (SSD), receptive field block net (RFBNet), Yolov3, and Yolov4 for the classification of CB damage images. Improved by the latter authors, Yolov3 architecture achieved an average classification accuracy of 97.3% for four damage classes.
Recent research has further revealed the application potential of DNNs in CB monitoring systems. Wang
et al. (
2023) presented a computer vision model capable of identifying CB defects with 94% accuracy, but this model was very sensitive to environmental effects and image quality. In another study, Bortnowski
et al. (
2022a) presented a long short-term memory (LSTM) autoencoder for automatization of damage detection by using recorded CB vibration signals. However, this model was not adapted to detect different types of CB damage. In addition, the vibration signals used in the study may be volatile due to various factors, such as load conditions or conveyor operating speed, which may affect the accuracy of the monitoring system.
CB tension signal data was applied to train various ML and DL models by Žvirblis
et al. (
2022) to determine the minimum signal length while maintaining high classification accuracy. However, in the study, DL models were only applied to classify two CB states (loaded with a 2 kg weight and unloaded). The authors did not include the detection of CB damage. Also, in the study, the initial dataset of CB tension signals was insufficient to train the models, so the authors performed two data augmentations methods like addition of random Laplace noise and drifted Gaussian noise. However, the main aim of the above-mentioned study was to develop high-accuracy classification models.
Achieving high classification accuracy with certain DL models requires collecting a sufficiently large and diverse dataset. Collecting large amounts of CB tension time series data can be difficult and expensive, therefore data augmentation methods can be used to increase the amount and variety of data. Data augmentation involves the creation of new data that is modified or synthesized from the original dataset. This enables the model to better generalize data and recognize features in unseen data. Data augmentation techniques are widely studied in computer vision and natural language processing, but their application to time series data is still being developed.
For time series data, traditional image augmentation methods such as scaling, rotating, or cropping often are not suitable due to the time dependence of time series data. Improper time series augmentations can negatively affect the accuracy and robustness of the model in real-world scenarios. In research works, some of the most effective time series data augmentation methods were the application of a sliding window, the addition of noise, and the synthesis of data using variational autoencoders (VAE) (Kingma and Welling,
2019) or generative adversarial networks (GAN) (Goodfellow
et al.,
2014).
Data augmentation methods in DL are used to find effective strategies for improvement of the accuracy and robustness of models while having limited or unbalanced datasets. Recently, some scientific works related to data augmentation were published (Chlap
et al.,
2021; Wang
et al.,
2017a). However, most of these works provided applications for the areas related to image augmentation in computer vision models.
Further below, this work reviews various time series data augmentation methods that could augment the time series dataset and potentially improve the classification accuracy and robustness of DL models.
Raw time series data usually forms one long time series. The sliding window method can be used to generate more data for training. This augmentation method was used by Žvirblis
et al. (
2022) in a study where the initial CB tension signal data was divided into two stages. In the first stage, a sliding window method was used to divide the original signal into smaller signals and thereby expand the dataset. One of the disadvantages of a sliding window method is that dividing the original signal into smaller signals can cut off important features of time series data. Because of this, a DL model may not learn properly to classify small-window-size time series data.
Additional noise in the data simulates real data, as real equipment such as signal sensors can introduce noise into the observations. Therefore, adding noise makes DL models more robust to small variations in the data. Laplace, drifted Gaussian, and uniform noise are commonly used for augmentation of time series data (Um
et al.,
2017; Iwana and Uchida,
2021; Žvirblis
et al.,
2022). However, adding too much noise to the original dataset can hinder the DL model’s ability to extract features from the signal. For this reason, it is important to study the influence of different amounts of noise on the model’s classification accuracy.
In augmentation of time series data, scaling means changing the amplitude or size of the original time series. Scaling augmentation methods for time series include magnitude warping, time warping, window warping, and frequency warping (Iwana and Uchida,
2021; Um
et al.,
2017). However, these augmentations can over-distort important features of time series data, so it is important to choose the methods appropriate for a specific dataset.
VAEs can be used for augmentation of time series data as well (Kingma and Welling,
2019; Goubeaud
et al.,
2021). Desai
et al. (
2021) presented the architecture of time series data variational autoencoder (TimeVAE), which was compared with other time series synthesis models such as TimeGAN (Yoon
et al.,
2019). On the average, the presented TimeVAE architecture showed higher accuracy of time series data synthesis than other models, especially with small data sets.
GANs is another widely researched framework for data augmentation and data synthesis, including time series data (Goodfellow
et al.,
2014). The architecture of GANs learns the distribution of data by extracting key features of the data. A trained generator of this architecture can then synthesize completely new data.
Currently, there are many different uses of GAN architectures for time series data (Huang
et al.,
2023; Iglesias
et al.,
2023). TimeGAN, one of the most widely used GAN architectures for time series, adds two new embedder and recovery networks to the conventional generator and discriminator architecture (Yoon
et al.,
2019). These new embedding and retrieval networks form an autoencoder in TimeGAN architecture that aims to learn the time dependence and key features of the data. In TimeGAN architecture, the autoencoder uses a recovery loss function, which aims to ensure that the network can accurately recover the original time series data from the latent space.
Conditional GANs contribute to traditional architectures by incorporating conditional information into the training process. This allows the network to be trained to generate more accurate data based on specific inputs, such as classes of data. The conditional GAN time series architecture (TSGAN) has achieved higher accuracy in synthesizing time series data for classification tasks than other GAN architectures (Smith and Smith,
2020). TSGAN architecture was tested on 70 datasets and compared with Wasserstein Generative Adversarial Network (WGAN) architecture (Gregor Hartmann
et al.,
2018). The accuracy of the data synthesized by TSGAN architecture was higher than that of WGAN architecture by about 11% on the average.
There are different DNN architectures and methods for performing augmentations on time series data, but not all the methods can be adapted for specific datasets. This work aims to develop DL models for time series, apply data augmentation methods to CB tension signal data and investigate the influence of these methods on the accuracy of CB state classification.
The rest of the paper is organized as follows. Materials and methods are described in Section
2. The main study results are provided in Section
3. Conclusions close the paper in Section
4.
2 Materials and Methods
The aim of the study on the classification of load and defect states of a CB is to not only compare the classification accuracy of different DNN models but also to compare how the accuracy of the models is affected by different data augmentation methods. This chapter examines selected time series data augmentation methods that will be applied to CB tension signals as well as selected DL models, accuracy metrics, and data acquisition.
2.1 Experimental Design and Data Collection
The test stand, on which the measurements were carried out, was a CB model, shown in Fig.
1. Its supporting structure consists of four self-aligning ball bearing units and two drums are embedded in the inner raceways: drive and return, on which strain gauges T1, T2, and T3 are located. The housings were connected by threaded rods and bolted with nuts, and sets of lenticular washers were used between the surfaces of the bearing housings on both sides to compensate for curvature.
The system uses strain gauges, in which the resistance
${R_{T}}$ depends directly on the resultant belt force
${F_{n}}$. During the dynamic testing of the CB (Fig.
2), the speed of the drive drum is set at
v and the recorded waveforms of the signals from the strain gauges depend on the pre-tension of the belt, longitudinal damage UW, transverse damage UP, and the load
m. The strain gauge sensors have nonlinear characteristics, and there is a nonlinear dependence of the readings depending on where the belt is pressed on the belt strain gauge which girds the shaft. The analog-to-digital converter (ADC) electronic circuit receives the analog data and converts it to digital, and then sends it via Bluetooth transmission to a computer. At the stage of converting analog values to digital, the signal is discretized and its values are measured at a frequency of 200 Hz. The signal, which goes to the computer in digital form, is represented in analog-to-digital units (ADUs) and the acquired values are subject to rounding and linearization.
Fig. 2
CB condition monitoring system.
Fig. 3
CB damage diagram. Here UW I is a longitudinal cut of 50 mm, UW II is a longitudinal cut of 70 mm, UW III is a longitudinal cut of 45 mm with a depth of 1 mm, UW IV is a longitudinal cut of 50 mm with a depth of 1.5 mm and UP I is a cross cut of 10 mm.
The main purpose of data collection was to observe and analyse the influence of various belt loads and defects on CB tension signals. The observations were collected by using three strain gauge sensors, which were placed in parallel at different sections of the CB (top, middle, and bottom) to fully record the strain signal of the CB. For this reason, further data collection, analysis, and model building were done on multi-domain data input structure. Observations were carried out in two stages. In the first stage, observations were made by using the conveyor loaded with one of five different weights: 0.5 kg, 1 kg, 2 kg, 3 kg, or 5 kg. Each weight category was designed to simulate different loading conditions on the CB. In the second stage, CB was intentionally damaged in certain places to simulate defects in real conditions as shown in Fig.
3. During the second stage of observations, CB tension signals were recorded as in the case of belt damage without any weight load.
Fig. 4
Sensor signal values in different conditions of the conveyor belt: a) first lower sensor; b) second middle sensor; c) third upper sensor.
Fig. 5
The scheme of experiment.
To simulate different rotational speeds of the CB, three different revolutions per minute (RPM) speeds were chosen: 159, 318, and 540. The CB speed of 159 RPM corresponded to a linear speed of 0.5 m/s, 318 RPM corresponded to 1 m/s, and 540 RPM corresponded to 1.7 m/s. Each observation was performed 9 times (3 times for each RPM). All the observations lasted about 8 seconds each on the average.
To gain more insight into the collected data, the amplitude means of the tension signals under different weight loads and defect states were calculated and presented in Fig.
4. The damaged CB signal has the highest average amplitude of all the load conditions, therefore it is clearly identifiable. Higher amplitude can be explained by the presence of defects and irregularities on the surface of the CB, which cause larger fluctuations in the tension signals. It can also be observed that the average tension amplitude of the maximum weight load of 5 kg was the lowest while that of 0.5 kg was the highest as compared with other weight loads.
Augmentation of the collected data and other steps of the experiment are shown in Fig.
5.
2.2 Data Augmentation Techniques
Testing a wide range of data augmentation techniques is crucial in DL because different augmentations can have varied and sometimes profound impacts on model performance, generalization, and robustness. Data augmentation helps to prevent overfitting by exposing the model to diverse data transformations, which simulates the variability it may encounter in real-world scenarios. Different data types or domains may benefit from specific augmentations. For instance, noise injections can be useful for vibration or sound data. Testing various augmentations helps identify those that address domain-specific challenges effectively, enhancing the model’s adaptability.
We applied a range of data augmentation techniques, both traditional and advanced. These included:
-
• Basic augmentations: sliding window.
-
• Advanced augmentations: random Laplace noise, drifted Gaussian noise, uniform noise, and magnitude warping.
-
• Generative augmentations: variational autoencoders.
A sliding window splits the data into more time series of smaller sizes. The sliding window formula used in this work is:
where a time series of size
k is split into multiple time series of length
m. Here,
$W[i]$ represents the
i-th window, starting with index
$1+i$ and ending with index
$m+i$.
To expand the dataset of CB signals and to standardize their length, signals of all original observations were divided into 0.5 s (200 points), 1.0 s (400 points), and 1.5 s (600 points) signals. The step of each signal’s sliding window was 100% of the window size itself. For example, a step of 0.5 s-length signal is 0.5 s (200 points). This was done to create enough of different signals for training the models and to avoid over-fitting the models.
Random Laplace noise is based on sampling random values from a Laplace distribution. This distribution is characterized by the Laplace probability density function:
where
μ is the mean of the distribution, and
σ is the scale parameter controlling the width of the distribution.
Drifted Gaussian noise adds a random value from a Gaussian (normal) distribution to each point in the signal. Gaussian distribution is characterized by Gaussian probability density function:
where
μ is the mean of the distribution and
σ is the standard deviation.
Uniform noise generates a value from a uniform distribution. The uniform distribution is characterized by the density function that represents the equal likelihood of any value within the specified interval
$[a;b]$:
where
a and
b are the lowest and highest value of
x, respectively.
Magnitude warping for time series data involves the random scaling of certain segments of the data. To perform the deformations, nodes
$u={u_{1}},{u_{2}},\dots ,{u_{i}}$ are generated randomly from a Gaussian distribution. The scaling is then defined by cubic spline interpolation of the nodes
$S(x)$ (Iglesias
et al.,
2023). The magnitude warping function is represented by the formula:
where
$\alpha =\{{\alpha _{1}},{\alpha _{2}},\dots ,{\alpha _{i}}\}=S(x)$,
$S(x)$ is cubic spline interpolation of the knots.
TimeVAE architecture is one more of the methods for data augmentation of CB tension signals (Desai
et al.,
2021). TimeVAE architecture is trained using the evidence lower bound (ELBO) loss function:
where the first term
$-{\mathbb{E}_{{q_{\phi }}(z|x)}}[\log {p_{\theta }}(x|z)]$ is the reconstruction loss, which measures how accurately the model reconstructs the input data. It includes the log-likelihood for the variable
z drawn from the distribution
${q_{\phi }}(z|x)$, where
${q_{\phi }}(z|x)$ is the encoded latent space for the variable
x. The second term
${D_{\mathrm{KL}}}({q_{\phi }}(z|x)||{p_{\theta }}(z))$ is the Kullback-Leibler deviation between
${q_{\phi }}(z|x)$ and
${p_{\theta }}(z)$ distributions. This regularization term is meant to ensure that the learned latent space remains similar to the prior distribution. In TimeVAE architecture, the variable
z is taken from a Gaussian distribution and passed to the decoder, thus making the VAE decoder generative.
The encoder passes the input through a one-dimensional convolutional layer with ReLU activation function. The input is flattened and then connected to the output from the encoder’s fully connected layer. The encoder output parameters are used for constructing Gaussian distribution from which the variable z is derived. This variable is then passed to the decoder, which consists of fully connected, convolution, and time-distributed concatenation layers. The output data from the decoder is the same shape as the input data.
2.3 Deep Learning Models Architectures
Multiple DL models were tested in order to ensure the broad applicability of findings, including fully convolutional network (FCN), convolutional neural network combined with long short-term memory network (CNN-LSTM), residual network (ResNet), and inception network (InceptionTime). All the models were built to classify six CB states, which included five load conditions (0.5 kg, 1 kg, 2 kg, 3 kg, and 5 kg) and one damaged-belt condition.
All the models were built using the Python programming language with TensorFlow and Keras DL libraries. The Kaggle platform was used for executing the experiments, and the NVIDIA Tesla P100 graphics processor was used for model training. In addition, each model was trained using the cross-entropy loss function and the Adam optimizer with a training rate of $\eta =0.001$. The following subsections present detailed architecture and parameter configuration of each constructed model.
2.3.1 FCN Model
FCN is an architecture based on deep CNNs that were originally developed for image segmentation (Long
et al.,
2015). In the case of time series, FCN architecture can be used for feature extraction. In the output layer, the classification can be performed using either the exponential normalization (softmax) or the sigmoid activation function (Wang
et al.,
2017b). The basic block of FCN architecture consists of a convolutional layer, followed by a batch normalization layer and a rectified linear unit (ReLU) activation layer. During the training, the batch normalization layer accelerates gradient convergence and improves the model’s robustness. Batch normalization is given by the following formula:
where
x is the input,
μ is the mini-batch mean,
σ is the mini-batch standard deviation,
ϵ is the numerical stability constant,
γ is the learned scale parameter, and
β is the learned shift parameter.
The mathematical expression of FCN architecture block is given in formulas (
1), (
2), and (
3):
where ∗ denotes the convolution operation,
x is the input data,
W is the convolution layer kernel,
b is the bias, BN denotes the batch normalization operation, and ReLU is rectified linear unit operation.
The final FCN is formed by concatenating three convolutional blocks. After applying these blocks, the extracted data features are passed to the global average pooling (GAP) layer, which is responsible for reducing the feature map size (Hsiao
et al.,
2019). GAP layer is superior to the traditional fully connected layer because it significantly reduces the number of weights and helps the model avoid over-training. The final layer consists of the softmax or sigmoid activation function.
The architecture of the built FCN model is shown in Fig.
6. The developed model is composed of three one-dimensional convolutional layers with the number of filters 64, 128, and 64, respectively. The filter size of each convolutional layer was
$3\times 1$. The selection of 64, 128, and 64 filters was made based on findings in previous studies that demonstrated good accuracy in time series classification tasks using similar FCN model (Wang
et al.,
2017b). The final layers consist of a GAP layer and a fully connected layer with a softmax activation function. The resulting FCN model was the smallest among all the models built in this study as it consisted of only 26437 trainable parameters.
Fig. 6
FCN model architecture.
2.3.2 CNN-LSTM Model
The architecture and parameter configuration of the second hybrid CNN-LSTM model is shown in Fig.
7. The developed model is composed of four different one-dimensional convolutional blocks and two LSTM layers. The number of convolutional block filters in the model decreases from 512 to 8, respectively. Convolutional blocks use batch normalization, ReLU activation function, average pooling, and dropout layers. The number of convolutional filters was empirically tested with various configurations, and the chosen structure provided optimal classification accuracy. The convolutional blocks are followed by two LSTM layers, each composed of 16 cells. The final layer is fully connected and utilizes the softmax activation function. The trainable parameters of the created CNN-LSTM model ranged from 226.077 with 0.5 s-length signals to 245.533 with 2.0 s-length signals.
Fig. 7
CNN-LSTM model architecture.
2.3.3 ResNet Model
The architecture and parameter configuration of ResNet model is shown in Fig.
8. The model consists of three residual blocks. In the first residual block, each of one-dimensional convolutions has 64 filters and in the second and third, 128. The selected filter sizes were based on previous studies that applied ResNet architecture for time series classification (Wang
et al.,
2017b). In the architecture diagram of the model, arrows with a plus sign represent skip connections. The last two layers of the model consist of a GAP layer and a fully connected layer with a softmax activation function. The developed ResNet model had the largest number of parameters among all the built models and consisted of 508.357 trainable parameters.
2.3.4 InceptionTime Model
The architecture and parameter configuration of the InceptionTime model with six inception modules is shown in Fig.
9. Every third module is connected by ResNet skip connections. The final two layers of the model include a GAP layer and a fully connected layer with a softmax activation function. The InceptionTime model has a total of 427.685 trainable parameters.
Fig. 8
ResNet model architecture.
Fig. 9
Inception network InceptionTime model architecture.
2.4 Classification Accuracy Metrics
Overall classification accuracy measures overall accuracy of the model. It is the ratio of correct guesses to all guesses. The overall accuracy in percent is represented by the formula:
where TP represents true positives, TN represents true negatives, FP represents false positives, and FN represents false negative model results.
The accuracy of each experiment was measured more than once, so it is important to estimate the error in order to evaluate the data from these samples of different accuracy. By calculating the standard error (SE) of the mean, it is possible to assess the extent to which the sample is representative of the population and draw reasonable conclusions from it. SE was calculated by the formula:
where
s is the sample standard deviation and
n is the sample size.
Classification accuracy alone can lead to misinterpretation of results when the dataset is unbalanced. For this reason, depending on the classification results of the classification model, recall and precision metrics of the model can also be calculated. These metrics are represented as percentages:
where recall shows whether the model can predict all the instances of different classes and precision shows how often the positive predictions are correct.
In addition, F1-score, which is a weighted average of recall and precision, can be used to determine the classification accuracy in the case of unbalanced datasets. This metric is expressed as a percentage in the formula:
where Recall is the proportion of actual positives that were correctly identified and Precision is the proportion of predictions that were correct.
4 Conclusion
In this research, we examined existing DL algorithms and DNN architectures for classifying CB states. In addition, various time series data augmentation methods applicable to CB tension signals were examined.
The study successfully developed and evaluated several DNN models for the classification of CB load and defect states using tension signals, specifically based on FCN, ResNet, InceptionTime, and CNN-LSTM architectures. FCN model was able to classify CB states with an accuracy of $92.6\% \pm 1.54\% $, making it the most accurate of the studied models. ResNet and InceptionTime models also performed well, with accuracies of $92.1\% \pm 1.47\% $ and $92.5\% \pm 0.71\% $, respectively. CNN-LSTM model demonstrated the worst results, with a maximum accuracy of $75.7\% \pm 0.88\% $ only.
The impact of various data augmentation methods on classification accuracy was also analysed. The combined addition of Laplace and drifted Gaussian noise increased the baseline (without any augmentation) accuracy of FCN model by 4.5% to $92.6\% \pm 1.54\% $. Adding Laplace and uniform noise increased the accuracy of InceptionTime model by 3.4% to $92.5\% \pm 0.71\% $. The classification accuracy of CNN-LSTM model trained with signals generated by TimeVAE increased by 11.4% to $75.7\% \pm 0.88\% $, but it still remained much lower than that of other models. The baseline accuracy of ResNet model increased by 1% only to $92.1\% \pm 1.47\% $ after training with drifted Gaussian noise augmented data.
These results underline the effectiveness of applying data augmentations to small CB tension signal datasets, enhancing the classification accuracy of models based on FCN and InceptionTime architectures. In classifying CB states, FCN-based model showed higher accuracy and speed compared to other models, despite having the lowest amount of trainable parameters. Successful application of FCN model demonstrated the importance of selecting and optimizing the right architecture for specific data and classification tasks.
CB status classification under fixed loads and rotation speed could be considered as a limitation of this study. A set of fixed parameters does not reflect real world conditions and future investigations should be based on random CB status classification on unseen experiment parameters. Empirical model parameters selection method can be also considered a limitation of this study.
Further research on CB state classification could aim to improve accuracy in classifying weights of similar mass (1 kg, 2 kg, 3 kg). Additionally, future research could explore advanced generative data augmentation techniques, for example, those utilizing GANs or other VAE architectures to enhance the quality of CB tension signal data.