3.1 Key Frame Extraction Algorithm (KKF)
A drawback of unsupervised key frame based clustering algorithms is that they rely solely on the euclidean distance between skeleton joints when performing clustering. This method may group frames that are far apart in time based on pose similarity, thereby disrupting the inherent temporal continuity of actions. These methods often disrupt temporal order during clustering, which can lead to misjudgments in HAR. Many current algorithms that maintain temporal stability of actions face a critical limitation, as they either fail to achieve scale-consistent key frames or introduce random errors during scale normalization. These methods often rely on random frame deletion or insertion, which carries the risk of losing informative frames and adding low-saliency frames. To address these issues and establish network inputs with a unified scale without compromising input quality, we propose the KKF algorithm.

Fig. 2
Fifteen joints of the human skeleton, used as original points for feature extraction.
The proposed KKF algorithm aims to select the
K key frames that can best represent the motion trend of the entire action. Inspired by the previous experiments (Zhang
et al.,
2023), we employed the concepts of inter-frame difference and cluster mean frame. As indicated in Fig.
2, we take 15 skeleton joints as an example, and the position data of these joints in each frame are represented as the following 45-dimensional feature vector. In all the following equations, the
i represents the frame number of the human behavioural skeleton, the
n represents the
n-th skeletal joint point of each frame, and the position information of each frame
${F_{i}}$ can be expressed as:

Fig. 3
Joint angles are used for our proposed method. Each joint is used to construct our KDJMM.
Although inter-frame difference methods effectively capture motion magnitude, they exhibit limitations in detecting subtle kinematic patterns like hand clapping. This paper introduces a hybrid method that combines joint angle and joint position features. It improves keyframe extraction accuracy, especially in fine-grained action recognition tasks. As shown in Fig.
3, we select the angle of each joint and map it across the three planes using the calculation methods in formulas (
16) and (
17). The angle information of each frame can also be represented
${A_{i}}$ as a 45-dimensional vector:
When extracting key frames, we apply min-max normalization to standardize the scale of both angle and position features to a
$[0,1]$ range. This process ensures compatibility between features that originally have divergent value ranges (e.g., joint angles in radians vs. displacement in meters), allowing their joint utilization in the unified key frame selection metric.
The normalized data of each frame are summed to form a 90-dimensional vector ${W_{i}}$ for calculating the key frame algorithm index.
The inter-frame difference
$({V_{i,j}})$ and frame-to-CMF distance (
${V_{i,{c_{j}}}}$) can be expressed as
where the
${W_{i}^{n}}$ represents the
n-th dimension of the normalized vector and
${\textit{CMF}_{{c_{j}}}^{n}}$ represents the
n-th dimension of the
${\textit{CMF}_{{c_{j}}}}$.
This indicates that a larger V value corresponds to a greater comprehensive difference between frames. The core objective of our KKF algorithm is to cluster frames with high spatio-temporal similarity into one category, while assigning frames with significant spatio-temporal differences to distinct clusters. To ensure temporal stability, the KKF algorithm first divides all frames evenly into K initialization intervals based on temporal order to form K clusters ${C_{i}}=\{{c_{1}},{c_{2}},\dots ,{c_{k-1}},{c_{k}}\}$. The first and last frames of each initialization interval are selected to form two sets: The right boundary set of the initialization cluster $\{{r_{{c_{1}}}},{r_{{c_{2}}}},\dots ,{r_{{c_{k-1}}}},{r_{{c_{k}}}}\}$, and the left boundary set $\{{l_{{c_{1}}}},{l_{{c_{2}}}},\dots ,{l_{{c_{k-1}}}},{l_{{c_{k}}}}\}$. This method satisfies the requirement of significant spatial differences in cluster initialization to reduce the time consumption of iterations.
The concept of a cluster mean frame (CMF) is proposed and defined in equations (
7) and (
8):
where the
${l_{{c_{i}}}}$ and
${r_{{c_{i}}}}$ are the left and right boundaries of cluster
${c_{i}}$, the
${N_{{c_{i}}}^{n}}$ is the
n-dimensional vector of the
$CMF$ of cluster
${c_{i}}$. The CMF is derived from each dimension and mean of all frames in cluster
${c_{i}}$, which is the vector that can best comprehensively represent the spatial information in the cluster frame. The CMF facilitates computation of the inter-cluster and frame-to-cluster dissimilarities so that the obtained frames can optimally characterize the cluster, ensuring that the holistic action semantics is obtained to the greatest extent. To operationalize this principle, we iteratively refine the attribution of the boundary frames of adjacent clusters by calculating the CMF of each cluster to achieve the balance of spatial information for each cluster and avoid cross-time compression. Finally, we iteratively optimize the cluster boundary frames and select the frame closest to the cluster CMF as the key information frame. In this way, we can extract the key frames that represent human behaviour. The pseudo-code of the critical moment mode selection algorithm is given in Algorithm
1 (i.e. KKF).

Algorithm 1
Skeleton key frame extraction(KKF)
The proposed method optimizes the trade-off between accuracy and efficiency by dynamically weighting kinematic metrics. During human action analysis tasks, there exist some classic key frames predefined by human kinematics. These moments are manifested as: 1) peak kinematic discontinuity, and 2) maximum spatial deviation. These frames are rich in dynamic information and crucial for understanding the whole action sequence. For example, we can define the peak kinematic discontinuity of human movement as: the frame with the most extensive variation
$S(x)$ from the previous frame, while the maximum spatial deviation as the frame with the most extensive variation
$L(x)$ from the first frame. We introduce two functions:
$S(x)$ and
$L(x)$. Specifically,
$S(x)$ denotes the short-term motion difference between consecutive frames, formulated as
In contrast,
$L(x)$ denotes the long-term spatial deviation of frame
x relative to the initial frame, defined as
Taking the action of waving the hand twice as an example, the change curves of both movement and maximum spatial deviation are visualized in Fig.
4 $S(x)$ and
$L(x)$. When a peak is detected in the curve, we directly set that cluster’s CMF value to the value of the special frame.

Fig. 4
10 key frames extracted by angle and displacement with action waving as an example, $S(x)$, $L(x)$ curves, the red triangle marked is the special key frame.
3.2 Key Dense Joint Motion Matrix (KDJMM)
Following our previous work (Yao
et al.,
2022), the DJMM has solved the problem of motion information loss when joint motion trajectories overlap. Inspired by this approach, we use the key frames obtained by the KKF algorithm to extract features such as displacement, angle, velocity, and angular velocity of the corresponding skeleton. The feature calculations are based on relative coordinates, which enhances viewpoint robustness. Finally, the features are arranged into a uniform-scale KDJMM, allowing us to better capture the subtle movement changes.
The construction of KDJMM is shown in Fig.
5. Each frame contains information on 15 3D skeletal joints. Among them, 19 features are extracted for each joint on position, displacement, velocity, angle, and angular velocity, demonstrated in Fig.
5(a). Then, the constructed
K 2D matrices are stacked together in the order of the extracted
K key frames to form a temporally stable and uniform KDJMM, as shown in Fig.
5(b). The details of the feature calculation procedure are given below:

Fig. 5
KDJMM (a) Extracting feature information from a key frame. (b) Extracting skeletal joint information from K key frames to construct the KDJMM.
3.2.1 Relative Coordinate Feature
Relative coordinates are less sensitive to changes in viewing angle than absolute coordinates. Therefore, the features we extract are based on relative coordinates. Meanwhile, to enhance the fine-grained description of human action, we introduce a variety of geometric features, including inter-joint angles, velocities, displacements, and other features, as well as their mapping in the three planes.
The physical significance of the position information of each joint is to provide us with the most basic spatial information about human behaviour. The following method is used to calculate the most basic relative coordinates:
In the calculation method, the ${x_{il}^{n}}$ denotes the x relative coordinate of the n-th joint of the i-th frame we calculated, the ${x_{i}^{n}}$ is the original coordinate of the n-th joint of the i-th frame, and the ${x_{1}^{1}}$ denotes the original coordinate of the first joint of the first frame. Similarly, the relative positions of all the joints are calculated. Three feature values for each joint are obtained.
3.2.2 Displacement Feature
The physical significance of the total displacement of the joints of each frame from the first frame is that it provides a long-distance temporal dependence to observe the human body movement from a global perspective. At the same time, in order to get a more detailed and comprehensive change situation, we map the change situation to the three planes
$(x-y,y-z,z-x)$. The three-plane mapping of each joint and the method of calculating the total displacement are given below:
The $D{x_{i}^{n}}$ represents the displacement in the X-plane of the n-th joint of the i-th frame and the corresponding joint of the first frame, and the ${x_{il}^{n}}$ represents the relative position of the n-th joint of the i-th frame. The $D{y_{i}^{n}}$ and $D{z_{i}^{n}}$ are calculated in the same way.
The total displacement is calculated as follows:
The formula
${D_{i}^{n}}$ denotes the total displacement of the
n-th joint of the
i-th frame from the joint corresponding to the first frame. The four feature values for each joint are obtained with the above formula (
12)–(
13).
3.2.3 Velocity Feature
The intensity of displacement change per unit time of each skeletal joint is expressed in a physical sense as the velocity of each skeletal joint of the human body. The eigenvalues of the joint velocity components in the
x,
y, and
z directions are computed as follows:
where the
$V{x_{i}^{n}}$,
$V{y_{i}^{n}}$, and
$V{z_{i}^{n}}$ denote the mapping values of the
n-th joint in the
i-th frame in the three-plane velocity. The
${x_{i-1l}^{n}}$,
${y_{i-1l}^{n}}$, and
${z_{i-1l}^{n}}$ denote the three-dimensional relative coordinate values of the
n-th joint in the
i-th frame. We define the three mapping values of the feature value to be zero when
i is zero.
The velocity feature is calculated as follows:
3.2.4 Angle Feature
The physical meaning of an angle is that it provides a geometrical feature of the human body structure and represents the geometrical relationship between neighbouring joints. In our definition, an angle is composed of three joints. We preferentially define two neighbouring joints as the corresponding angles of the intermediate joints. The combination of angles is identified by us as (J
2, J
1, J
3), (J
1, J
2, J
3), (J
2, J
3, J
1), (J
2, J
4, J
5), ( J
4, J
5, J
6), (J
5, J
6, J
3), (J
2, J
7, J
8), (J
7, J
8, J
9), (J
8, J
9, J
3), (J
1, J
10, J
11), (J
10, J
11, J
12), (J
11, J
12, J
3), (J
1, J
13, J
14), (J
13, J
14, J
15), (J
14, J
15, J
3), where each combination represents the angle of the joint at its intermediate position. Taking (A, B, C) as an example, we calculate the angle between
$\overrightarrow{\mathrm{BA}}$ vector and
$\overrightarrow{\mathrm{BC}}$ vector as the angle feature of joint B as follows:
The vectors $\overrightarrow{\mathrm{BA}}$ and $\overrightarrow{\mathrm{BC}}$ are calculated by subtracting the values of the B joints using the skeletal joints A and C, respectively.
The angle is calculated as follows:
The ${\theta _{B}}$ is the angular feature of the joint B. The dot product of vectors $\overrightarrow{\mathrm{BA}}$ and $\overrightarrow{\mathrm{BC}}$ is calculated by removing the product of the modes of the two vectors $\| \overrightarrow{\mathrm{BA}}\| \times \| \overrightarrow{\mathrm{BC}}\| $ to get the cosine value and then converted by the inverse cosine function ${\cos ^{-1}}$ to get the radians to get our angular feature ${\theta _{B}}$. We calculate the three-plane angular feature by choosing a two-by-two combination of coordinates of $\overrightarrow{\mathrm{BA}}$ and $\overrightarrow{\mathrm{BC}}$ vectors and then using the above angular formula to get the three-plane mapping eigenvalue of the angle. The change in angle per unit of time for each joint and the mapping on the three planes are computed to obtain the three mapping values for each joint. In the end, 19 features are calculated for each joint in proposed method. The main features are focused on displacement and angle, of which we choose 15 joints, and finally form a $K\times 15\times 19$ size of the KDJMM. Our framework enables task-adaptive kinematic feature computation (displacement, velocity, direction, acceleration) through configurable joint selection. Specifically for gesture recognition, incorporating detailed hand joints (e.g. metacarpophalangeal) improves recognition accuracy compared to using only major limb joints.
3.4 Data Augmentation Strategy
Human action characterization must account for inherent biological variability. Consider two individuals executing identical motions – one standing 1.8 meters tall, the other 1.6 meters. Their height disparity introduces measurement artifacts. Furthermore, inter-subject kinematic variations manifest in three domains: temporal (movement pacing discrepancies), spatial (viewpoint-dependent morphological distortions), and directional (trajectory orientation effects). These observations motivate our multi-modal augmentation protocol incorporating camera perspective adjustments, anthropometric scaling, and velocity profile warping.
Considering that our feature construction choices are based on relative coordinates, the KDJMM is unaffected by viewpoint changes. In this experiment, we follow our previous work (Yao
et al.,
2020). Body size was controlled by multiplying all coordinates by a body size coefficient (BS). The speed of movement is controlled by the velocity coefficient (VS). We control the difference between the body shape and the speed around 0.2, BS =
$(0.8,0.85,0.9,0.95,1.0,1.05,1.10,1.15,1.2)$ and VS =
$(0.8,0.9,1.0,1.1,1.2,1.3)$. By combining the BS and the VS, the set of data will be enlarged up to 54 times of the original one. The diversity of data will be effectively increased, helping the model to accelerate the training process and learn a more comprehensive representation of features, thus improving the accuracy of classification or detection.
The proper setting of VS and BS coefficients is crucial. Taking a 1.7-meter-tall individual as an example, when BS = 1.5 is applied, their height would be scaled to 2.55 meters, which clearly exceeds normal human body dimensions. Similarly, excessively large VS coefficients would distort the speed characteristics of actions like walking and running, making it difficult to distinguish these basic motion patterns. We recommend maintaining these ranges: 0.8 ⩽ BS ⩽ 1.2 and 0.8 ⩽ VS ⩽ 1.3.