1 Introduction
In recent years, the plummet in the cost of sensors and storage devices has resulted in massive time series data being captured and has substantially driven the need for the analysis of time series data. Among various analysis and applications, the problem of subsequence matching is of primordial importance, which serves as the foundation for many other data mining techniques, such as anomaly detection (Boniol and Palpanas, 2020; Boniol et al., 2020; Wang et al., 2022), and classification (Wang et al., 2018; Abanda et al., 2019; Iwana and Uchida, 2020; Boniol et al., 2022).
Specifically, given a long time series X, for any query series Q, the subsequence matching problem finds the K number of subsequences from X most similar to Q (top-K query), or finds subsequences whose distance falls within the threshold ε (range query). In the last two decades, plenty of works have been proposed for this problem. Most existing works find results based on the strict distance, like Euclidean distance and Dynamic Time Warping. Among them are scanning-based approaches (Li et al., 2017; Rakthanmanon et al., 2012) and index-based ones (Linardi and Palpanas, 2018; Wu et al., 2019).
Another type of works define the query in a more flexible way. Query-by-Sketch (Muthumanickam et al., 2016) and SpADe (Chen et al., 2007) approximate the query with a sequence of line segments, and find subsequences that can be approximated in a similar way.
Nevertheless, in many real applications, users are unable to accurately and clearly elaborate the query intuition with a single query sequence. Specifically, users may have different strictness requirements for different parts of the query. We illustrate it with the following two examples.
Case study 1. In the field of wind power generation, Extreme Operating Gust (EOG) (Hu et al., 2018) is a typical gust pattern which is a phenomenon of dramatic changes of wind speed in a short period. Early detection of EOG can prevent the damage to the turbine. A typical pattern of EOG, as Q and ${S_{1}}$ in Fig 1, has three physical phases, where its corresponding shape contains a slight decrease (1–100), followed by a steep rise and a steep drop (101–200), and a rise back to the original value (200–300). Users usually emphasize the steep increase/decrease in the second part, which means ${S_{1}}$ is more preferred compared to ${S_{2}}$ in Fig 1. However, if the analyst submits query Q, ${S_{2}}$ is more similar to Q under either ED or DTW distance.
Case study 2. During the high-speed train’s work time, the sensor will continuously collect vibration data for monitoring. When the train passes by some source of interference, the value of sensors will increase sharply, and return to a normal value after some time, as Q, ${S_{1}}$ and ${S_{2}}$ in Fig 2. However, if Q is issued as a query, subsequence ${S_{3}}$ is more similar to it, which is an unexpected result. By combining Q, ${S_{1}}$ and ${S_{2}}$ together, we can learn that the pattern occurrences may have variable durations and distinct amplitudes. The most strict constraint is that the pattern should include an almost upright rise and an almost upright fall.
The above examples clearly demonstrate the limitation of the single query mechanism. Although a single query can express the shape the user is interested with, it is not enough to express the extent of time shifting and amplitude scaling, as well as the error range.
To solve this problem, in this paper, we propose a multiple query approach. Compared to a single query, submitting a small number of queries together can express the query intuition more accurately. Consider the example in Fig. 2: if users take Q, ${S_{1}}$ and ${S_{2}}$ as the query set, it indicates that the user can show more tolerance in realtion to the subsequence length and the value range, but less in relation to the slope of the increasing and decreasing part. Moreover, submitting multiple queries is also a natural solution in the real-world applications. For example, in the above case of train monitoring, analysts hope to find out all interfered subsequences, and then correct them. The analyst will go through a small part of the long sequence. Once coming across a few interfered subsequences, he/she can submit them together in order to find more instances.
We first propose a probability-based representation of the multiple queries. Then, we design a novel distance function to measure the similarity of one subsequence to the multiple queries. In the end, a breadth-first search algorithm is proposed to find out the desired subsequences. To the best of our knowledge, this is the first work to study how to express the query intuition.
Our contributions can be summarized as follows:
-
• We analyse the problem of expressing the query intuition, and propose a multi-query mechanism to solve this problem.
-
• We present a probability-based representation of the multiple queries, as well as the corresponding distance between it and any subsequence.
-
• We present a breadth-first search algorithm to efficiently search for similar subsequences.
-
• We conduct extensive experiments on both synthetic and real datasets to verify the effectiveness and efficiency of our approach.
The rest of the paper is organized as follows. The related works are reviewed in Section 2. Section 3 introduces definitions and notations. In Section 4 and 5, we introduce our approach in detail. Section 6 presents an experimental study of our approach using synthetic and real datasets, and we offer conclusions in Section 7.
2 Related Work
In last two decades, the problem of subsequence matching has been extensively studied. Existing approaches can be classified into two groups:
Fixed Length Queries: Traditionally, to find out similar subsequences, two representative distance measures are adopted, Euclidean distance (ED) and Dynamic Time Warping (DTW). ED computes the similarity by a one-to-one mapping while DTW allows disalignment and thus supports time shifting. UCR Suite (Rakthanmanon et al., 2012) is a well-known approach that supports both ED and DTW for subsequence matching and proposes cascading lower bounds for DTW to accelerate search speed. FAST (Li et al., 2017) is based on UCR Suite, and further proposes some lower bounds for the sake of efficiency. Both UCR Suite and FAST have to scan the whole time series to conduct distance computation. EBMS (Papapetrou et al., 2011), however, reduces the subsequence matching problem to the vector matching problem, and identifies the candidates of matches by the search of nearest neighbours in the vector space. Also, some index-based approaches have been proposed for similarity search. Most of them build indexes based on summarizations of the data series (e.g. Piecewise Aggregate Approximation (PAA) (Keogh et al., 2001), or Symbolic Aggregate approXimation (SAX) (Shieh and Keogh, 2008)). Coconut (Kondylakis et al., 2018) overcomes the limitation that existing summarizations cannot be sorted while keeping similar data series close to each other and proposes to organize data series based on a z-order curve. To further reduce the index creation time, adaptive indexing techniques have been proposed to iteratively refine the initial coarse index, such as ADS (Zoumpatianos et al., 2016).
Variable Length Queries: For variable length queries, SpADe (Chen et al., 2007) proposes a continuous distance calculation approach, which is not sensitive to shifting and scaling in both temporal and amplitude dimensions. It scans data series to get local patterns, and dynamically finds the shortest path among all local patterns to be the distance between two sequences. Query-by-Sketch (Muthumanickam et al., 2016) proposes an interactive approach to explore user-sketched patterns. It extracts shape grammar, a combination of basic elementary shapes, from the sketched series, and then applies a symbolic approximation based on regular expressions. To better satisfy the user, Eravci and Ferhatosmanoglu (2013) attempts to improve the search results by incorporating diversity in the results for relevance feedback. Relatively speaking, indexing for variable length queries is more intractable. $\mathrm{KV}$-$\mathrm{matc}{h_{DP}}$ (Wu et al., 2019) utilizes multiple varied-length indexes to support normalized subsequence matching under either ED or DTW distance. ULISSE (Linardi and Palpanas, 2018), by comparison, uses a single index to answer similarity search queries of variable length. It organizes the series and their summaries in a hierarchical tree structure called the ULISSE index.
In summary, up to now, no existing work has attempted to express the query intuition via the multi-query mechanism.
3 Preliminaries
In this section, we begin by introducing all the necessary definitions and notations, followed by a formal problem statement.
3.1 Definition
In this work, we are dealing with time series. A time series $X=({x_{1}},{x_{2}},\dots ,{x_{N}})$ is an ordered sequence of real-valued numbers, where $N=|X|$ is the length of X. A subsequence S, ${S^{\prime }}$ or $X[i,j]=({x_{i}},{x_{i+1}},\dots ,{x_{j}})$ $(1\leqslant i\leqslant j\leqslant n)$ denotes the continuous sequence of length $j-i+1$ starting from the i-th position in X. Note that the subsequence is itself a time series.
Given a time series X, a query sequence Q, and a distance function D, the problem of subsequence matching is to find out the top-K subsequences from X, denoted as $\mathbb{R}=\{{S_{1}},{S_{2}},\dots ,{S_{K}}\}$, which are most similar to Q.
The two representative distance measures are Euclidean Distance (ED) and Dynamic Time Warping (DTW). Formally, given two length-L sequences, S and ${S^{\prime }}$, their ED and DTW distance can be computed as the following:
Definition 1.
Euclidean Distance: $\textit{ED}(S,{S^{\prime }})=\sqrt{{\textstyle\sum _{i=1}^{L}}{({s_{i}}-{s^{\prime }_{i}})^{2}}}$, where ${s_{i}}$ and ${s^{\prime }_{i}}$ is the value at i-th $(1\leqslant i\leqslant L)$ time stamp of S or ${S^{\prime }}$ respectively.
Definition 2.
Dynamic Time Warping:
\[\begin{aligned}{}& \textit{DTW}\big(\langle \rangle ,\langle \rangle \big)=0;\hspace{2em}\textit{DTW}\big(S,\langle \rangle \big)=\textit{DTW}\big(\langle \rangle ,{S^{\prime }}\big)=\infty ;\\ {} & \textit{DTW}\big(S,{S^{\prime }}\big)=\sqrt{{\big({s_{1}}-{s^{\prime }_{1}}\big)^{2}}+\min \left\{\begin{aligned}{}& \textit{DTW}\big(\textit{suf}(S),\textit{suf}\big({S^{\prime }}\big)\big),\\ {} & \textit{DTW}\big(S,\textit{suf}\big({S^{\prime }}\big)\big),\\ {} & \textit{DTW}\big(\textit{suf}(S),{S^{\prime }}\big),\end{aligned}\right.}\end{aligned}\]
where $\langle \rangle $ indicates empty series and $\textit{suf}(S)=({s_{2}},\dots ,{s_{L}})$ is a suffix subsequence of S.In this paper, instead of processing one single query, we attempt to find out subsequences similar to multiple queries. That is, given a set of queries, $\mathbb{Q}=\{{Q_{1}},{Q_{2}},\dots ,{Q_{N}}\}$, our objective is to find out the top-K subsequences similar to queries in $\mathbb{Q}$, denoted as $\mathbb{R}$.
Since each query sequence varies in length, we do not impose a constraint on the length of subsequences in $\mathbb{R}$. In this way, we find out variable-length subsequences answering multiple queries, which is worthy of wide use in real time series applications.
4 Query Representation and Distance Definition
In this section, we first present the probability-based representation of the query set, and then propose a distance definition based on the representation.
4.1 Query Representation
In this paper, instead of processing the queries in $\mathbb{Q}$ independently, we first represent them by a unified formation, which is a multi-dimensional probability distribution. Then we find target subsequences from X based on the representation.
In many real world applications, the meaningful query sequence can be approximately represented as a sequence of line segments. Recall the EOG pattern in Fig. 1. The query sequence Q can be approximated with 4 line segments. These line segments capture the most representative characteristics in Q.
In response, we propose a two-step approach to represent the bundle of queries together. In the first step, we represent each single query ${Q_{i}}$ in $\mathbb{Q}$ individually by a traditional segmentation in which each segment is described by some features. Then, in the second step, we represent each feature as a Gaussian distribution over the values from multiple queries.
4.1.1 Step One: Represent Each Single Query
In step one, we perform a traditional segmentation. We use a bottom-up approach to convert the query ${Q_{i}}=({q_{1}},{q_{2}},\dots ,{q_{f}})$ into a piecewise linear representation, where ${q_{f}}$ is the segment of single query ${Q_{i}}$ $(1\leqslant f\leqslant |{Q_{i}}|)$. Initially, we approximate ${Q_{i}}$ with $\big\lfloor \frac{|{Q_{i}}|}{2}\big\rfloor $ line segments. The j-th line, ${H_{j}}$, connects ${q_{2j-1}}$ and ${q_{2j}}$. Next, we iteratively merge the neighbouring lines. In each iteration, we merge the two neighbouring segments into one new line segment that has the minimal approximation error. The merging process repeats until we have m (a pre-set parameter) number of line segments. For each ${Q_{i}}$, we obtain its segmentation, denoted as $({Q_{i}^{1}},{Q_{i}^{2}},\dots ,{Q_{i}^{m}})$ and its linear representation, denoted as $({H_{i}^{1}},{H_{i}^{2}},\dots ,{H_{i}^{m}})$.
For each line segment ${H_{i}^{j}}$ ($1\leqslant i\leqslant N$ and $1\leqslant j\leqslant m$), we represent it as a 4-dimension vector, ${f_{i}^{j}}=({l_{i}^{j}},{\theta _{i}^{j}},{v_{i}^{j}},{\varepsilon _{i}^{j}})$, which corresponds to the length, slope, the value of the starting point and MSE error of ${H_{i}^{j}}$, respectively. As a result, the query sequence ${Q_{i}}$ is represented by a $4m$-dimension vector, ${F_{i}}=({f_{i}^{1}},{f_{i}^{2}},\dots ,{f_{i}^{m}})$.
4.1.2 Step Two: Represent Multiple Queries
After obtaining ${F_{i}}$’s ($1\leqslant i\leqslant N$), we can generate the uniform representation of query set $\mathbb{Q}$, which is a multi-dimensional probability distribution. We first present the formal distribution, and then give our approach to generate the specific distribution for $\mathbb{Q}$.
Specifically, given query set $\mathbb{Q}$, its representation, denoted as ${P_{\mathbb{Q}}}$, consists of $4m$ number of individual Gaussian distributions, each of which corresponds to a feature in ${f_{i}^{j}}$. For each feature, we produce a Gaussian distribution to capture latent semantics, which is determined by two parameters: the mean value and the standard deviation. The former encodes the ideal value of the feature, while the latter provides an elastic range.
Formally, we denote the representation of $\mathbb{Q}$ as ${P_{\mathbb{Q}}}=({P^{1}},{P^{2}},\dots ,{P^{m}})$, where ${P^{j}}=({p_{l}^{j}},{p_{\theta }^{j}},{p_{v}^{j}},{p_{\varepsilon }^{j}})$ corresponds to the j-th line segments $({H_{1}^{j}},{H_{2}^{j}},\dots ,{H_{N}^{j}})$. Take the slope feature as an example, that is, ${p_{\theta }^{j}}$ is the Gaussian density function of the slope of the j-th segment, denoted as
where ${\mu _{\theta }^{j}}$ (or ${\sigma _{\theta }^{j}}$) is the mean value (or standard deviation) of the slope values of $({H_{1}^{j}},{H_{2}^{j}},\dots ,{H_{N}^{j}})$. Specifically, ${\mu _{\theta }^{j}}=\frac{{\textstyle\sum _{i=1}^{N}}{\theta _{i}^{j}}}{N}$, ${\sigma _{\theta }^{j}}=\sqrt{\frac{1}{N}{\textstyle\sum _{i=1}^{N}}{({\theta _{i}^{j}}-{\mu _{\theta }^{j}})^{2}}}$. The mean value ${\mu _{\theta }^{i}}$ describes the slope the user prefers, and the standard deviation ${\sigma _{\theta }^{i}}$ represents how strictly the user stresses on this feature. Apparently, the smaller the value of ${\sigma _{\theta }^{i}}$ is, the stricter the user’s requirement.
(1)
\[ {p_{\theta }^{j}}(x)=\frac{1}{\sqrt{2\pi }{\sigma _{\theta }^{j}}}\exp \bigg(-\frac{{(x-{\mu _{\theta }^{j}})^{2}}}{2{({\sigma _{\theta }^{j}})^{2}}}\bigg),\]Now we introduce our approach to generate ${P_{\mathbb{Q}}}$ for the query set $\mathbb{Q}$. We first get the representations $({l_{i}^{j}},{\theta _{i}^{j}},{v_{i}^{j}},{\varepsilon _{i}^{j}})$ that correspond to the features of the j-th line segment in ${Q_{i}}$. To get the specific Gaussian distribution ${P^{j}}=({p_{l}^{j}},{p_{\theta }^{j}},{p_{v}^{j}},{p_{\varepsilon }^{j}})$, we directly compute the mean value and the standard deviation of the feature values of $({H_{1}^{j}},{H_{2}^{j}},\dots ,{H_{N}^{j}})$. Then, ${P_{\mathbb{Q}}}=({P^{1}},{P^{2}},\dots ,{P^{m}})$.
4.2 Distance Definition
Given the fact that the query representation consists of $4m$ number of Gaussian distributions rather than a sequence, the existing distance measures, like ED and DTW, are inapplicable. In this paper, we propose a novel distance function $D(S,\mathbb{Q})$ based on the probability distribution.
Formally, to define the distance between subsequence S and the query representation ${P_{\mathbb{Q}}}$, we first approximate S with m line segments. We indicate the segmentation as $seg=({S^{1}},{S^{2}},\dots ,{S^{m}})$, where ${S^{j}}$ ($1\leqslant j\leqslant m$) denotes the j-th segment of S. Note that the segment here infers subsequence rather than line segment. We extract four features, the length ${l^{j}}$, the slope ${\theta ^{j}}$, the value of the starting point ${v^{j}}$, and the MSE error ${\varepsilon ^{j}}$ from the linear representation of ${S^{j}}$. For ease of presentation, we represent all the j-th segments in $\mathbb{Q}$ as ${\mathbb{Q}^{j}}$. That is, ${\mathbb{Q}^{j}}=({Q_{1}^{j}},{Q_{2}^{j}},\dots ,{Q_{N}^{j}})$. Then the distance between ${S^{j}}$ and ${\mathbb{Q}^{j}}$ is
$\textit{dist}({S^{j}},{\mathbb{Q}^{j}})$ is the negative logarithm of the probability. The smaller the value, the more similar ${S^{j}}$ and ${\mathbb{Q}^{j}}$ are. Accordingly, under segmentation $seg$, the distance between S and the query set $\mathbb{Q}$ can be computed as
Since S can be segmented by different segmentations, the value of $D(S,\mathbb{Q},\textit{seg})$ may be different. In this paper, we define the distance between S and $\mathbb{Q}$ as the minimal one among all possible segmentations, that is,
(2)
\[ \textit{dist}\big({S^{j}},{\mathbb{Q}^{j}}\big)=-\log \big({p_{l}^{j}}\big({l^{j}}\big){p_{\theta }^{j}}\big({\theta ^{j}}\big){p_{v}^{j}}\big({v^{j}}\big){p_{\varepsilon }^{j}}\big({\varepsilon ^{j}}\big)\big),\](3)
\[ D(S,\mathbb{Q},\textit{seg})={\sum \limits_{j=1}^{m}}\textit{dist}\big({S^{j}},{\mathbb{Q}^{j}}\big).\]5 Query Processing Approach
In this section, we introduce the search process. Obviously, it is exhaustive to find out the best segmentation of all subsequences in the time series X. In response, we divide the search process into two phases:
-
1. Candidate generation. Given the submitted query set $\mathbb{Q}$, we utilize a Breadth-First Search (BFS) strategy to find at most ${n_{c}}$ number of candidates from X, denoted as $\textit{CS}$.
-
2. Post-processing. All subsequences in $\textit{CS}$ will be verified and re-ordered by computing its actual distance. Moreover, we dismiss the trivial match subsequences.
5.1 BFS-Based Search Process
We first introduce our search strategy to generate the candidate set $\textit{CS}$ with size not exceeding the parameter ${n_{c}}$. We utilize an iterative approach, and generate the candidates segment-by-segment. In the first round, we generate at most ${n_{c}}$ number of candidates with only one segment. The candidate set is denoted as ${\textit{CS}_{1}}=\{c{s_{1}},c{s_{2}},\dots ,c{s_{{n_{c}}}}\}$. Each candidate, $c{s_{i}}$, is a triple $\langle {s_{i}},{e_{i}},{d_{i}}\rangle $, in which ${s_{i}}$ is its starting point and ${e_{i}}$ is its ending point. So $c{s_{i}}$ corresponds to the subsequence $X[{s_{i}},{e_{i}}]$. The third element ${d_{i}}$ is distance $\textit{dist}(c{s_{1}},{\mathbb{Q}^{1}})$. All candidates in ${\textit{CS}_{1}}$ are ordered based on the values of ${d_{i}}$ ascendingly. In other words, ${\textit{CS}_{1}}$ contains ${n_{c}}$ number of subsequences with smallest distance with ${\mathbb{Q}^{1}}$. We discuss how to select top-${n_{c}}$ candidates in the next section.
In the second round, we obtain candidate set ${\textit{CS}_{2}}$ by extending the candidates in ${\textit{CS}_{1}}$ with the second segment. Specifically, given any candidate subsequence $cs=\langle s,e,d\rangle $ in ${\textit{CS}_{1}}$, if we want to extend $cs$ to $c{s^{\prime }}$ with a length-L segment, the new candidate $c{s^{\prime }}=\langle s,{e^{\prime }},{d^{\prime }}\rangle $ contains two segments: one corresponds to $X[s,e]$ and the other to $X[e+1,e+L]$. Note that the new starting point s keeps unchanged, and the new ending point ${e^{\prime }}$ changes to $e+L$. Also, the new distance ${d^{\prime }}$ is updated to $d+\textit{dist}(X[e+1,{e^{\prime }}],{\mathbb{Q}^{2}})$. Since from each candidate $cs$ in ${\textit{CS}_{1}}$, we can extend it to multiple candidates by concatenating $X[s,e]$ with variable length segments, we generate all possible candidates, compute the distance and add top-${n_{c}}$ candidates into ${\textit{CS}_{2}}$.
After m rounds, we obtain the candidate set ${\textit{CS}_{m}}$, which is the final candidate set $\textit{CS}$. Now each candidate in $\textit{CS}$ consists of m segments.
5.2 Candidate Generation
Now we introduce our approach to generate possible candidates in each round.
In the first round, we enumerate all subsequences $X[s,e]$ from all possible starting points, that is, we try all s’s within $[1,n]$. To avoid the low-quality candidates, we only select subsequences with length satisfying the $3\sigma $ standard. Formally, given any starting point s, the ending point e must satisfy $e\in [{\mu _{l}^{1}}-3{\sigma _{l}^{1}},{\mu _{l}^{1}}+3{\sigma _{l}^{1}}]$. For each candidate subsequence $X[s,e]$, we compute the optimal linear approximation, $y=\theta \cdot x+b$, as well as the MSE error ε as follows,
where $l=e-s+1$ is the length of $X[s,e]$. After that, we compute $\textit{dist}(X[s,e],{\mathbb{Q}^{1}})$. Also, if only any other feature of a candidate (θ, v or ε) violates the $3\sigma $ standard, we ignore this candidate. During enumerating different candidates, we maintain ${\textit{CS}_{1}}$ as a priority queue to keep top-${n_{c}}$ candidates.
(5)
\[ \left\{\begin{array}{l}\theta =\displaystyle \frac{12\textstyle\sum ix-6(l+1)\textstyle\sum x}{l(l+1)(l-1)},\hspace{1em}\\ {} b=\displaystyle \frac{6\textstyle\sum ix-2(2l+1)\textstyle\sum x}{l(1-l)},\hspace{1em}\\ {} \varepsilon =\displaystyle \sum {x^{2}}+{\theta ^{2}}\displaystyle \sum {i^{2}}+l{b^{2}}-2\theta \displaystyle \sum ix-2b\displaystyle \sum x+2\theta b\displaystyle \sum i,\hspace{1em}\end{array}\right.\]In the j-th round ($2\leqslant j\leqslant m$), we generate candidates by extending previous ones in ${\textit{CS}_{j-1}}$. For each candidate $cs=\langle s,e,d\rangle $ in ${\textit{CS}_{j-1}}$, we try all possible segments next to $X[s,e]$ whose length falls within $[{\mu _{l}^{j}}-3{\sigma _{l}^{j}},{\mu _{l}^{j}}+3{\sigma _{l}^{j}}]$. Similar with the first round, we also dismiss the candidates violating the $3\sigma $ standard. When extending candidate $cs=\langle s,e,d\rangle $ to $c{s^{\prime }}=\langle s,{e^{\prime }},{d^{\prime }}\rangle $, except the new ending point ${e^{\prime }}$, we also update the distance as ${d^{\prime }}=d+\textit{dist}(X[e+1,{e^{\prime }}],{\mathbb{Q}^{j}})$.
5.3 Post-Processing
Note that the subsequences in $CS$ are not approximated optimally. Here, an additional refinement step has to be performed, where subsequences are verified and re-ordered using the optimal segmentation. Specifically, we fetch the subsequences in $\textit{CS}$, approximate each subsequence $cs$ with m line segments via a dynamic programming algorithm, and thus get the actual distance $D(cs,\mathbb{Q})$ under the new segmentation.
The objective of the segmentation is to minimize the distance between $cs$ and $\mathbb{Q}$. We search the optimal segmentation from left to right sequentially on $cs$. We define $E(i,j)$ ($1\leqslant i\leqslant m$ and $1\leqslant j\leqslant |cs|$) to be the minimal distance between the prefix of $cs$ (i.e. $cs=X[1,j]$) and the prefix of $\mathbb{Q}$ with i segments (i.e. $[{\mathbb{Q}^{1}},{\mathbb{Q}^{2}},\dots ,{\mathbb{Q}^{i}}]$). We begin by initializing $E(1,j)$ to be $\textit{dist}(X[1,j],{\mathbb{Q}^{1}})$. When computing $E(i,j)$ $(2\leqslant i\leqslant m)$, we consider all the possible segmentations of $X[1,k]$ $(i\leqslant k\leqslant j)$ with $i-1$ segments, compare the sum of $E(i-1,k)$ and $\textit{dist}(X[k,j],{\mathbb{Q}^{k}})$, and define $E(i,j)$ to be the minimal one. Formally, the dynamic programming equation is presented as the following
We re-compute the distance between each subsequence in $\textit{CS}$ and $\mathbb{Q}$. Before re-ordering, we have to dismiss the trivial match subsequences since candidates can overlap with each other. Formally, given two candidates $cs=X[s,e]$ and $c{s^{\prime }}=X[{s^{\prime }},{e^{\prime }}]$, if their overlapping ratio, $\frac{cs\cap \hspace{0.1667em}c{s^{\prime }}}{cs\cup \hspace{0.1667em}c{s^{\prime }}}$, exceeds $\min (0.8,\frac{\min (|Q|)}{\max (|Q|)})$, where the latter indicates the ratio between the minimum length and the maximum length of queries $\mathbb{Q}$, we dismiss the subsequence with the larger distance.
(6)
\[ E(i,j)=\left\{\begin{array}{l@{\hskip4.0pt}l}\textit{dist}(X[1,j],{\mathbb{Q}^{1}}),\hspace{1em}& i=1,\\ {} +\infty ,\hspace{1em}& i\gt j,\\ {} \underset{i\leqslant k\leqslant j}{\min }\big(E(i-1,k)+\textit{dist}\big(X[k,j],{\mathbb{Q}^{k}}\big)\big),\hspace{1em}& \text{otherwise}.\end{array}\right.\]Afterwards, we simply sort the remaining subsequences in $\textit{CS}$ according to $D(cs,\mathbb{Q})$ and select the K smallest as the final results $\mathbb{R}$.
5.4 Optimization
In this section, we propose two optimization strategies to further accelerate the search process.
5.4.1 Basic Aggregates Based Linear Representation Computation
In the search process, for each candidate, we adopt linear regression to find out the best line segment $y=\theta \cdot x+b$ in a least square sense using Eq. (5). It is noteworthy that θ, b, and ε can be computed by some combination of three basic aggregates: $\textstyle\sum x$, $\textstyle\sum {x^{2}}$ and $\textstyle\sum ix$, with cost $O(1)$, as proposed in Wasay et al. (2017). As long as we maintain these three in-memory arrays to store these basic aggregates of time series X, linear regression can be conducted in $O(1)$ time for any subsequence in X.
However, to process extremely long sequences, the storage overheads are unaffordable. As a consequence, we propose to split the time series into several blocks and conduct subsequence matching within each block sequentially and summarize the results in the end. Obviously, this approach reduces memory consumption while being accurate and efficient.
5.4.2 Adjusting the Searching Order
Up to now, we have generated the candidates segment-by-segment sequentially. However, different standard deviation of the feature values of different segments in $\mathbb{Q}$ will result in different search space. For example, in $\mathbb{Q}$, one segment has the standard deviation of length ${\sigma _{l}}=5$, while in other segment, ${\sigma _{l}}=10$. According to the $3\sigma $ standard, the former has $5\cdot 3\cdot 2+1=31$ candidates, while the latter has 61 candidates. Obviously, we should first search based on the latter. Specifically, we perform the search process in an optimized order, where we first consider the segment whose standard deviation of the value of length feature is the smallest, and then consider its neighbouring segments. Suppose there are 3 sets of line segments in $\mathbb{Q}$, i.e. $m=3$, their standard deviations of the value of length feature are $2,3,1$ respectively, that is, ${\sigma _{l}^{1}}=2$, ${\sigma _{l}^{2}}=3$, ${\sigma _{l}^{3}}=1$. Then, in the search process, we generate candidates from right to left sequentially.
5.5 Complexity Analysis
The overall process of our approach can be divided into two phases: query representation and query processing.
Before the two phases, we have to scan the whole time series X once to store the basic aggregates. Its time cost is $O(n)$, where n is the length of X.
Then, in the first phase, we scan all the queries and perform traditional piecewise linear approximations. Thanks to the basic aggregates, the time cost of conducting linear regression is negligible. This phase is then $O(N|Q{|^{2}})$ in time complexity, where N is the number of queries. For ease of presentation, although multiple queries can vary in length, we denote $|Q|$ as the length of the queries.
In the second phase, we first assume ${w_{i}}=6{\sigma _{l}^{i}}+1$, where ${\sigma _{l}^{i}}$ is the standard deviation of the length values of $({H_{1}^{i}},{H_{2}^{i}},\dots ,{H_{N}^{i}})$. Suppose we generate candidates from left to right sequentially; when considering the first segments, it requires $O({w_{1}}n\log ({n_{c}}))$ time to maintain the candidate set. For the following i-th segments, the time complexity is $O({w_{i}}{n_{c}}\log ({n_{c}}))$. In the post-processing process, we have to verify and compute the actual distance between $\mathbb{Q}$ and every subsequence $cs$ in the candidate set. It takes $O({n_{c}}m|cs{|^{2}})$ time to conduct the dynamic programming algorithm for the ${n_{c}}$ candidates, where m is the number of line segments. Moreover, the re-ordering process is at most $O({n_{c}}\log {n_{c}})$ in time complexity.
Since ${w_{i}}$ is a constant, ${n_{c}}$ is proportional to n, and $|cs|$ and $|Q|$ is much smaller than n, we can infer that the time complexity of our approach is $O(n\log ({n_{c}}))$ theoretically.
6 Experiments
In this section, we conduct extensive experiments to verify the effectiveness and efficiency of our approach. All experiments are run on a PC with Intel Core i7-7700K CPU (8 cores @ 4.2 GHz) and 16 GB RAM.
6.1 Datasets
6.1.1 Synthetic Datasets
We generate the synthetic sequence as follows. First, we generate a long random walk sequence T. Then we embed in T some meaningful pattern instances, in which some are taken as queries, and others as target results.
We generate two types of patterns, W-wave from the Adiac dataset in UCR archive (Dau et al., 2019), and backward wave, common in stock series. For each pattern, we first generate a seed instance and add some noise following a Gaussian distribution $N(0,0.05)$, as shown in Fig. 3. Then we modify them to generate more instances. Specifically, we adopt three types of variations, length, amplitude, and shape as follows.
Note that the three types of variations will influence different features in the line segments. We list them respectively in Table 1.
-
∙ Length: It is obvious that the W-wave in Fig. 3(a) can be split into four segments. For the middle two segments, we change the segment length with a scaling factor λ by inserting or deleting some data points. In other words, for a length-l segment, the length of the new segment is $l\cdot \lambda $.
-
∙ Amplitude: Similar to the case of length scaling, we increase or decrease the amplitude of the middle two segments in the W-wave. We still use a factor λ to control the extent. Specifically, every value in new segment changes to $v\cdot \lambda $, where v is the original value.
-
∙ Shape: For the backward wave shown in Fig. 3(b), we change the global shape of the pattern by modifying the last two segments on both length and amplitude. To be more specific, we change the segment length from l to $l\cdot \lambda $, and the data values from v to $v\cdot \lambda $.
For each type of variation, we test our approach under different extents. Take the length variation as an example. To generate a dataset, we first set a parameter r, which determines the length variation range. Specifically, given r, we can only set the length scaling factor λ within the range $[1/(1+r),1+r]$. Obviously, the larger the value of r, the larger the length scaling extent.
For each variation, given the fixed r, we pick out 50 values from the range $[1/(1+r),1+r]$ as the value of λ. Then we generate 50 corresponding pattern instances, 40 of which are randomly planted into the random walk long time series. The rest 10 instances form the query set $\mathbb{Q}$. A summary of the synthetic datasets is presented in Table 2.
6.1.2 Real Datasets
The real dataset is the train monitoring dataset, which is the time series collected by the vibration sensor. Its total length is 15 million. There exist more than 100 interference subsequences that vary in length and amplitude. The length of these subsequences is within the range of 200 to 2500. Consequently, we maintain a query set of size 15 whose lengths almost distribute uniformly between 200 and 2500. The rest ones are leaved as the target results.
6.2 Counterpart Approaches
Note that it is difficult to find reasonable baselines to compare our approach to because the existing methods are only devised for the case of a single query. Since no approach can deal with multiple queries as a whole, we choose two representative subsequence matching algorithms, UCR Suite (Rakthanmanon et al., 2012) and SpADe (Chen et al., 2007), and then enable them to handle the problem of multiple queries. UCR Suite finds the best normalized subsequences and supports both Euclidean distance and Dynamic Time Warping (UCR-ED and UCR-DTW for short). SpADe finds the shortest path within several local patterns, and is able to handle shifting and scaling both in temporal and amplitude dimensions.
To make UCR-Suite (or SpADe) support multiple queries, we first find out top-K similar subsequences for each query based on UCR-Suite (or SpADe). We then sort these $K\cdot N$ number of subsequences in the ascending order of their distances (normalized by the subsequence length) and pick out the top-K ones excluding any trivial results as final. For UCR-Suite, we utilize both ED and DTW as the distance, denoted as UCR-ED and UCR-DTW, respectively.
Let N be the number of queries in $\mathbb{Q}$, we denote our approach and the other three competitors as MQ-N, UCR-ED-N, UCR-DTW-N and SpADe-N, respectively. Note that the three rivals have to scan the time series for N times. For fairness, we do not count I/O time when comparing efficiency.
6.3 Results on Synthetic Datasets
In the first experiment, we compare our approach MQ with UCR-ED, UCR-DTW, and SpADe on synthetic datasets. Both accuracy and efficiency are tested. To compare these approaches extensively, we vary the parameter r for all three variations, length, amplitude and shape. Specifically, for the length variation, we set the maximal scaling factor r from 0.25 to 0.7 with step 0.05. For the amplitude variation, we varied r in a range of 0.2 to 2 with step 0.2. For the shape variation, we changed r in a range of 0.05 to 0.5 with step 0.05.
We set the number of queries, N, as 5 and 10, respectively. The experimental results are then indicated by MQ-5, MQ-10, UCR-DTW-5, UCR-DTW-10, SpADe-5, and SpADe-10.1 For MQ, we set the only parameter, the size of the candidate set ${n_{c}}$, to be $0.05\cdot n$, where n is the length of the time series X.
In each set of experiments, we attempt to find out the top-40 subsequences in the time series. The accuracy is the ratio between the number of correct subsequences and 40. Subsequence S is correct, if the overlapping ratio between S and certain planted instance exceeds the tolerance parameter, $\epsilon =\min \big(0.8,\frac{\min (|Q|)}{\max (|Q|)}\big)$.
The results are shown in Fig. 4 and Fig. 5, respectively. It can be seen that MQ outperforms UCR-DTW and SpADe in both accuracy and efficiency under all variations. The reason is that MQ summarizes common characteristics in multiple queries while the other approaches are only able to find out the subsequences that are similar to certain given query. As a consequence, when the number of query sequences N increases, UCR-DTW and SpADe yield better results. Nevertheless, UCR-DTW-10 (or SpADe-10) is twice as slow as UCR-DTW-5 (or SpADe-5) on average, which means with more query sequences provided, UCR-DTW and SpADe can find out more satisfying subsequences, but at the cost of efficiency. Instead, MQ captures latent semantics and thus demonstrates its superiority in both accuracy and efficiency under different variations. The only exception is in Fig. 4(c), where the accuracy of MQ sharply decreases when $r=0.4$, which is mainly because the undue scaling in shape will result in ambiguity of what users really want.
6.4 Results on Real Datasets
In this experiment, we compare our approach with other ones as the number of query sequences N varied on real dataset. We pick out queries from the query set of size 15 so that their lengths distribute uniformly. Note that no matter the value of N, we find out top-100 subsequences from the real time series. The results are shown in Fig. 6.
It can be seen that in Fig. 6(a), the accuracy of MQ, UCR-ED, and UCR-DTW increases when N increases while the accuracy of SpADe is consistently low. The reason is that SpADe extracts local patterns by using a fixed size of sliding window, and consequently, it fails to capture the query intuition. Moreover, it is noteworthy that when $N=6$, the accuracy of MQ has already exceeded 0.9, which means that MQ is able to find out what users really want with only a small query set.
Figure 6(b) compares MQ and other approaches on efficiency. Obviously, MQ outperforms all other approaches, and its running time is insensitive to N since MQ summarizes all the queries and searches for results in the time series only once. Due to the specific pattern shape and the lack of abundant pruning strategies, UCR-ED is inferior to UCR-DTW in terms of efficiency in this dataset.
6.5 Influence of Parameter ${n_{c}}$
In this experiment, we investigate the influence of the parameter ${n_{c}}$, the size of the candidate set. On the real dataset, we varied ${n_{c}}$ from 0.01 to 0.1, and the number of queries, N was set to 3, 6, and 9 respectively.
Results are shown in Fig. 7. It can be seen that as ${n_{c}}$ gets larger, the accuracy of MQ increases while the efficiency decreases. It is because once the query set is fixed, we can maintain more candidates by enlarging the candidate set, i.e. increasing the value of ${n_{c}}$ at the cost of more running time. Also, we can find that MQ has already achieved satisfying results when ${n_{c}}=0.05$, so we set the default value of ${n_{c}}$ to 0.05 in previous experiments. Generally, as shown in Fig. 7(a), the accuracy of MQ increases as the number of queries N increases. The only exception is when ${n_{c}}=0.01$. The reason is that the standard deviation of the query feature values experiences an increase as N increases, resulting in a larger search space. The small candidate set then fails to maintain enough candidates, and thus achieves poor results. Meanwhile, the changes in the search space cause the running time to increase proportionally, as shown in Fig. 7(b).
7 Conclusions
In this paper, we have proposed a novel subsequence matching approach, Multi-querying, to reflect the query intuition. Given multiple queries, we use a multi-dimensional probability distribution to represent them. Then, a breadth-first search algorithm is then applied to finding out the top-K most similar subsequences. Extensive experiments have demonstrated that Multi-querying outperforms the state-of-the-art algorithms in terms of accuracy and performance. To the best of our knowledge, this is the first study to introduce the concept of multiple queries to express the query intuition.