The widespread use of sensors has resulted in an unprecedented amount of time series data. Time series mining has experienced a particular surge of interest, among which, subsequence matching is one of the most primary problem that serves as a foundation for many time series data mining techniques, such as anomaly detection and classification. In literature there exist many works to study this problem. However, in many real applications, it is uneasy for users to accurately and clearly elaborate the query intuition with a single query sequence. Consequently, in this paper, we address this issue by allowing users to submit a small query set, instead of a single query. The multiple queries can embody the query intuition better. In particular, we first propose a novel probability-based representation of the query set. A common segmentation is generated which can approximate the queries well, in which each segment is described by some features. For each feature, the corresponding values of multiple queries are represented as a Gaussian distribution. Then, based on the representation, we design a novel distance function to measure the similarity of one subsequence to the multiple queries. Also, we propose a breadth-first search strategy to find out similar subsequences. We have conducted extensive experiments on both synthetic and real datasets, and the results verify the superiority of our approach.

In recent years, the plummet in the cost of sensors and storage devices has resulted in massive time series data being captured and has substantially driven the need for the analysis of time series data. Among various analysis and applications, the problem of subsequence matching is of primordial importance, which serves as the foundation for many other data mining techniques, such as anomaly detection (Boniol and Palpanas,

Specifically, given a long time series

Another type of works define the query in a more flexible way. Query-by-Sketch (Muthumanickam

Nevertheless, in many real applications, users are unable to accurately and clearly elaborate the query intuition with a

EOG pattern.

Signal interference pattern.

The above examples clearly demonstrate the limitation of the single query mechanism. Although a single query can express the shape the user is interested with, it is not enough to express the extent of time shifting and amplitude scaling, as well as the error range.

To solve this problem, in this paper, we propose a multiple query approach. Compared to a single query, submitting a small number of queries together can express the query intuition more accurately. Consider the example in Fig.

We first propose a

Our contributions can be summarized as follows:

We analyse the problem of expressing the query intuition, and propose a multi-query mechanism to solve this problem.

We present a probability-based representation of the multiple queries, as well as the corresponding distance between it and any subsequence.

We present a breadth-first search algorithm to efficiently search for similar subsequences.

We conduct extensive experiments on both synthetic and real datasets to verify the effectiveness and efficiency of our approach.

The rest of the paper is organized as follows. The related works are reviewed in Section

In last two decades, the problem of subsequence matching has been extensively studied. Existing approaches can be classified into two groups:

In summary, up to now, no existing work has attempted to express the query intuition via the multi-query mechanism.

In this section, we begin by introducing all the necessary definitions and notations, followed by a formal problem statement.

In this work, we are dealing with time series. A time series

Given a time series

The two representative distance measures are

Euclidean Distance:

Dynamic Time Warping:

In this paper, instead of processing one single query, we attempt to find out subsequences similar to multiple queries. That is, given a set of queries,

Since each query sequence varies in length, we do not impose a constraint on the length of subsequences in

In this section, we first present the probability-based representation of the query set, and then propose a distance definition based on the representation.

In this paper, instead of processing the queries in

In many real world applications, the meaningful query sequence can be approximately represented as a sequence of line segments. Recall the EOG pattern in Fig.

In response, we propose a two-step approach to represent the bundle of queries together. In the first step, we represent each single query

In step one, we perform a traditional segmentation. We use a bottom-up approach to convert the query

For each line segment

After obtaining

Specifically, given query set

Formally, we denote the representation of

Now we introduce our approach to generate

Given the fact that the query representation consists of

Formally, to define the distance between subsequence

In this section, we introduce the search process. Obviously, it is exhaustive to find out the best segmentation of all subsequences in the time series

Candidate generation. Given the submitted query set

Post-processing. All subsequences in

We first introduce our search strategy to generate the candidate set

In the second round, we obtain candidate set

After

Now we introduce our approach to generate possible candidates in each round.

In the first round, we enumerate all subsequences

In the

Note that the subsequences in

The objective of the segmentation is to minimize the distance between

Afterwards, we simply sort the remaining subsequences in

In this section, we propose two optimization strategies to further accelerate the search process.

In the search process, for each candidate, we adopt linear regression to find out the best line segment

However, to process extremely long sequences, the storage overheads are unaffordable. As a consequence, we propose to split the time series into several blocks and conduct subsequence matching within each block sequentially and summarize the results in the end. Obviously, this approach reduces memory consumption while being accurate and efficient.

Up to now, we have generated the candidates segment-by-segment sequentially. However, different standard deviation of the feature values of different segments in

The overall process of our approach can be divided into two phases: query representation and query processing.

Before the two phases, we have to scan the whole time series

Then, in the first phase, we scan all the queries and perform traditional piecewise linear approximations. Thanks to the basic aggregates, the time cost of conducting linear regression is negligible. This phase is then

In the second phase, we first assume

Since

In this section, we conduct extensive experiments to verify the effectiveness and efficiency of our approach. All experiments are run on a PC with Intel Core i7-7700K CPU (8 cores @ 4.2 GHz) and 16 GB RAM.

We generate the synthetic sequence as follows. First, we generate a long random walk sequence

Two fundamental patterns.

(a) W-wave |

(b) Backward wave |

We generate two types of patterns, W-wave from the Adiac dataset in UCR archive (Dau

Length: It is obvious that the W-wave in Fig.

Amplitude: Similar to the case of length scaling, we increase or decrease the amplitude of the middle two segments in the W-wave. We still use a factor

Shape: For the backward wave shown in Fig.

Influence of variations.

Length | ||||

Amplitude | ||||

Shape |

For each type of variation, we test our approach under different extents. Take the length variation as an example. To generate a dataset, we first set a parameter

For each variation, given the fixed

A summary of the synthetic datasets.

Dataset | Pattern | Variation | Length |

1 | W-wave | Length | 0.7 million |

2 | W-wave | Amplitude | 0.7 million |

3 | Backward wave | Shape | 0.4 million |

The real dataset is the train monitoring dataset, which is the time series collected by the vibration sensor. Its total length is 15 million. There exist more than 100 interference subsequences that vary in length and amplitude. The length of these subsequences is within the range of 200 to 2500. Consequently, we maintain a query set of size 15 whose lengths almost distribute uniformly between 200 and 2500. The rest ones are leaved as the target results.

Note that it is difficult to find reasonable baselines to compare our approach to because the existing methods are only devised for the case of a single query. Since no approach can deal with multiple queries as a whole, we choose two representative subsequence matching algorithms, UCR Suite (Rakthanmanon

To make UCR-Suite (or SpADe) support multiple queries, we first find out top-

Accuracy comparisons under different variations.

(a) Length scaling |

(b) Amplitude scaling |

(c) Shape scaling |

Let

Efficiency comparisons under different variations.

(a) Length scaling |

(b) Amplitude scaling |

(c) Shape scaling |

In the first experiment, we compare our approach MQ with UCR-ED, UCR-DTW, and SpADe on synthetic datasets. Both accuracy and efficiency are tested. To compare these approaches extensively, we vary the parameter

We set the number of queries,

UCR-ED is omitted for its consistent inferiority to UCR-DTW.

For MQ, we set the only parameter, the size of the candidate setIn each set of experiments, we attempt to find out the top-40 subsequences in the time series. The accuracy is the ratio between the number of

The results are shown in Fig.

Comparison on real dataset.

(a) Accuracy vs. |

(b) Efficiency vs. |

Influence of parameter

(a) Accuracy vs. |

(b) Efficiency vs. |

In this experiment, we compare our approach with other ones as the number of query sequences

It can be seen that in Fig.

Figure

In this experiment, we investigate the influence of the parameter

Results are shown in Fig.

In this paper, we have proposed a novel subsequence matching approach, Multi-querying, to reflect the query intuition. Given multiple queries, we use a multi-dimensional probability distribution to represent them. Then, a breadth-first search algorithm is then applied to finding out the top-