1 Introduction
An important trend in development of mathematical models, simulating various real world applications, deals with new type nonlocal boundary and conjugation conditions. The non-locality mechanism helps to simulate important physical processes more accurately. Still the existence and uniqueness of the solution, its stability is very sensitive to approximation of such nonlocal boundary conditions and require development and analysis of special discrete approximation techniques and modifications of implementation algorithms. As one important example we mention the simulation of enzyme-catalysed glucose oxidation and redox reactions over inhomogeneous surfaces (Skakauskas
et al.,
2018; Čiegis
et al.,
2018).
It is a standard situation in applications of the mathematical modelling technique that systems of nonstationary 3D differential equations should be solved. Even if state of the art high-order discrete approximations are constructed on adaptive meshes, the total complexity of the obtained computational algorithms is very large. It restricts our possibility to include such models into general optimization routines or to use them in real time decision making applications.
Recently, a new important technique has been proposed to solve such applied problems more efficiently. A model reduction technique is applied to avoid the effects of curse of dimensionality. The given 3D heat conduction problem is reduced to a hybrid dimension problem, keeping the initial dimension only in some parts of the domain and reducing it to one-dimensional equation within the domains in some distance from the base regions. Such mathematical models are typical in industrial installations such as pipelines.
Our aim is to add two additional improvements into this methodology. First, the economical ADI type finite volume scheme is constructed to solve the non-classical heat conduction problem. It is proved that the ADI scheme is unconditionally stable. Second, the parallel factorization algorithm is proposed to solve the obtained systems of discrete equations. Due to both modifications the run-time of computations is reduced essentially.
The method of partial dimension reduction was first introduced in Panasenko (
1998) and then developed in Panasenko (
2005). It is based on the asymptotic analysis of a solution of a partial differential equation set in a thin domain and on a projection of the variational model formulation on a subspace of functions having a form of the asymptotic expansion in all zones of regular behaviour of the solution. Then the dimension of the problem is reduced within the big part of the domain keeping the full dimension description only in small zones of a singular behaviour of the solution.
Applying this approach a model with nonlocal junctions of one dimensional and full dimensional equations is obtained. Note that nonlocal conjugation conditions make the main challenges in development of effective parallel numerical methods for this class of problems of hybrid dimension. Let us mention some papers where different techniques were used to discretize the obtained hybrid mathematical model. A finite volume setting was studied in Panasenko and Viallon (
2015). The method of asymptotic partial decomposition of the domain was compared with several existing methods of dimension reduction in Amar and Givoli (
2018) and its effectiveness with respect to the precision and time of the execution is shown. In a recent paper (Čiegis
et al.,
2020) ADI type finite volume scheme was considered for a 2D heat conduction problem.
However, the question about the most effective parallel numerical solvers for such problems of hybrid dimension is still an important topic. Non-classical conjugation conditions require to develop special solvers to get the discrete solutions. A straightforward way is to solve the obtained systems of linear equations by using general iterative solvers, e.g. well-known multigrid solvers. For multidimensional parabolic problems an alternative and most popular way is to use splitting methods (Hundsdorfer and Verwer,
2003; Samarskii,
2001). It is well-known that in many cases ADI type methods are the best selection.
Thus in this paper we have developed and analysed special efficient ADI schemes to solve the hybrid 3D partial dimension reduction models for a nonstationary heat conduction problem. Our second goal is to propose efficient parallel implementations of the constructed ADI schemes. All main topics required in the analysis of discrete schemes, i.e. the approximation, positivity and stability of the solution are investigated. The accuracy of the reduced dimension model and the scalability of the parallel algorithms are tested in a series of computational experiments.
The approximation error is investigated quite straightforwardly, since the regularity of the solution and dependence of it on the reduction parameter
δ is known from results given in Amosov and Panasenko (
2018), Panasenko (
2005). A more complicated step is to investigate the stability of proposed parallel ADI schemes in the case of new non-local conjugation conditions.
The rest of the paper is organized in the following way. In Section
2 the problem is formulated. The nonstationary heat conduction equation is formulated in the 3D parallelepiped and the initial and boundary conditions are specified. In this section also, the partially reduced dimension model is constructed and an approximate solution is defined by considering a simplified domain when 3D small size subregions are coupled with a 1D line. The non-local conjugation conditions are defined at the truncations of the tube. In Section
3 the full problem in the 3D domain is solved by using the ADI stability correction scheme. It is proved that this scheme is unconditionally stable. Its implementation requires solving many linear systems with tridiagonal matrix and the classical factorization algorithm is used. The Wang parallel factorization algorithm also can be used for parallel computers. In Section
4 the modified ADI stability correction scheme is constructed for the reduced dimension heat conduction problem. It is proved that the unique solution of the obtained system of linear equations exists for each time level. The efficient factorization algorithm is defined, it is based on algorithms presented in Čiegis
et al. (
2020). A parallel version of the modified factorization algorithm for implementation of the ADI scheme approximating the reduced dimension heat conduction problem is constructed in Section
5. A theoretical scalability analysis of this parallel algorithm is done. In Section
6 results of computational experiments are presented and analysed. Some final conclusions are done in Section
7.
2 Problem Formulation
Let us consider a parallelepiped $\Omega \subset {\mathbb{R}^{3}}$, which has the following form $\Omega =(0,{X_{1}})\times D$, where D is a rectangle $D=\{({x_{2}},{x_{3}}):0<{x_{j}}<{X_{j}},\hspace{0.2778em}j=2,3\}$.
We are interested to solve a linear heat equation in
$\Omega \times (0,T]$:
where
$\partial {\Omega _{1}}=\partial \Omega \setminus (D\times \{{x_{1}}=0\}\cup \{{x_{1}}={X_{1}}\})$ denotes a part of the boundary, where the Neumann boundary condition is specified.
Next, following Amosov and Panasenko (
2018), we consider a modified approximate problem. First, let
$S(u)$
denote the averaging operator. We assume that the initial condition
${u^{0}}$ and source function
f satisfy the relations
It means that
${u^{0}}$ and
f don’t depend on
${x_{2}}$,
${x_{3}}$ within Ω.
Denote a reduced dimension domain
${\Omega _{\delta }}=\{({x_{1}},{x_{2}},{x_{3}})\in (\delta ,{X_{1}}-\delta )\times D\}$. Function
U is called an approximate solution to problem (
1)–(
4) if it satisfies the following problem (Amosov and Panasenko,
2018)
where
$\partial {\Omega _{\delta ,1}}=\partial {\Omega _{\delta }}\setminus (D\times \{{x_{1}}=\delta \}\cup \{{x_{1}}={X_{1}}-\delta \})$.
In
${\Omega _{\delta }}\times (0,T]$ the solution
U doesn’t depend on
${x_{2}}$,
${x_{3}}$, i.e.
Thus in the reduced domain it is sufficient to find 1D space function
$U({x_{1}},t)$.
By analysing the weak form of the heat equation, it is shown in Amosov and Panasenko (
2018) that the following conjugation conditions are valid at the truncation points of the space domain
The conditions (
10) are classical and mean that
U is continuous at the truncation points. While the remaining two conditions (
11) are nonlocal and they define the conservation of full fluxes along the separation planes.
3 ADI Scheme for the 3D Differential Problem
The uniform spatial mesh
${\bar{\Omega }_{h}}={\bar{\omega }_{1}}\times {\bar{\omega }_{2}}\times {\bar{\omega }_{3}}$ is defined as
For simplicity of notations we also consider a uniform time mesh:
Let
${U_{ijk}^{n}}$ be a numerical approximation to the exact solution
$u({x_{1i}},{x_{2j}},{x_{3k}},{t^{n}})$ of problem (
1)–(
4) at the grid point
$({x_{1i}},{x_{2j}},{x_{3k}},{t^{n}})$.
The following standard operators are defined for discrete functions:
The discrete approximation of diffusion operators is obtained by using the finite volume method.
Then the heat conduction problem (
1)–(
4) is approximated by the following alternating direction implicit (ADI) stability correction scheme (Hundsdorfer and Verwer,
2003)
At the first step the heat conduction problem (
1)–(
4) is approximated by the explicit Euler scheme. It is well known that this scheme is only conditionally stable. The next three steps of scheme (
12) guarantee the unconditional stability of the discrete scheme and they increase the accuracy of approximation in time until the second order.
At all three steps 3D systems are split into a set of 1D problems along ${x_{3}}$, ${x_{2}}$ and ${x_{1}}$ coordinates respectively, and each subproblem is solved by using the classical fast factorization algorithm.
Lemma 1.
If a solution of the problem (
1)
–(
4)
is sufficiently smooth, then the approximation error of ADI scheme (
12)
is $O({\tau ^{2}}+{h_{1}^{2}}+{h_{2}^{2}}+{h_{3}^{2}})$.
Proof.
For a self-completeness of the text and in order to explain the approximation property of the ADI scheme (
12) we present a short proof of this result (for a detailed analysis see Hundsdorfer and Verwer,
2003). By adding all equations (
12) the following discrete equation is obtained
In order to compute the approximation error of (
12), we use the relations
Thus the approximation error of ADI scheme (
12) with respect to time coordinate is the same as of the classical Crank-Nicolson method. The second order accuracy of the constructed approximations of space derivatives is shown, e.g. in Hundsdorfer and Verwer (
2003), Samarskii (
2001). □
It is a straightforward task to show the result presented in the following lemma.
Lemma 2.
The discrete operators ${A_{1}^{h}}$, ${A_{2}^{h}}$ and ${A_{3}^{h}}$ are symmetric, ${A_{1}^{h}}$ is positive definite and ${A_{2}^{h}}$, ${A_{3}^{h}}$ are non-negative operators.
All operators
${A_{1}^{h}}$,
${A_{2}^{h}}$ and
${A_{3}^{h}}$ commute, thus there exists a complete system of functions
where
${\phi _{lm}}({x_{l}})$ are eigenvectors of the operator
${A_{l}^{h}}$,
$l=1,2,3$:
and
${\lambda _{lm}}$ are eigenvalues
Lemma 3.
ADI scheme (
12)
is unconditionally stable.
Proof.
For the given normal operators
${A_{l}^{h}}$,
$l=1,2,3$ the Fourier stability analysis can be used. Let us consider the solution of ADI scheme (
12) in the case when boundary conditions
${g_{j}}=0$,
$j=1,2$ and
$f(x,t)=0$. The solution of (
12) can be written as
Substituting this formula into equations (
12), we obtain for each mode the stability equations
where the discrete stability function
${q_{lmr}}$ is defined as
Since eigenvalues satisfy estimates (
14), then
$|{q_{lmr}}|<1$ and the ADI scheme (
12) is unconditionally stable in the
${L_{2}}$ norm. □
It is easy to see that the discrete stability function approximates the continuous stability function of the differential equation (
1) with the second order accuracy.
4 The ADI Scheme for the Partially Dimension Reduced Problem
Let
${K_{1}}$ and
${K_{2}}$ define the truncation points of the domain
${x_{1{K_{1}}}}=\delta $,
${x_{1{K_{2}}}}={X_{1}}-\delta $. Then the spatial mesh
${\omega _{1}}$ is splitted into three parts:
We note that more general meshes are used in Viallon (
2013); Panasenko and Viallon (
2015), where some atypical cells along the interface are constructed, and that lead to non-admissible meshes. In this paper we use the finite volume approach. For more general non-uniform meshes it can be recommended to use a general framework of finite element and finite difference schemes for construction of discrete approximations on non-matching grids (Eymard
et al.,
2000; Faille,
1992). A numerical comparison of these methods for 1D-2D discrete schemes is done in Viallon (
2013).
The heat conduction problem (
5)–(
11) is approximated by the modified version of the ADI scheme
Here
${S_{h}}$ denotes the discrete averaging operator:
where
${d_{lm}}=1$, if
$0<m<{J_{l}}$ and
${d_{lm}}=0.5$, if
$m=0,{J_{l}}$,
$l=2,3$. Notation of the argument
${U_{i}^{n}}$ indicates that the two dimensional average of discrete function
${U^{n}}$ is computed at the grid point
${x_{1i}}$.
Note that equations (
19), (
20) approximate the nonlocal flux conjugation conditions (
11).
We define two discrete meshes ${\Omega _{h,RD}}=({\bar{\omega }_{2}}\times {\bar{\omega }_{3}}\times ({\omega _{11}}\cup {\omega _{13}}))\cup {\bar{\omega }_{12}}$ and ${\bar{\Omega }_{h,RD}}={\Omega _{h,RD}}\cup ({\bar{\omega }_{2}}\times {\bar{\omega }_{3}}\times ({x_{10}}\cup {x_{1{J_{1}}}}))$. Let us consider discrete functions ${U_{ijk}}=U({x_{1i}},{x_{2j}},{x_{3k}})$ defined on the spatial mesh ${\bar{\Omega }_{h,RD}}$. We denote the set of such vectors by ${D_{h}}$.
Let us define three operators for
$U\in {D_{h}}$:
Then we can write the ADI scheme as
Next, our goal is to prove that the discrete operators ${\mathcal{A}_{1}^{h}}$, ${\mathcal{A}_{2}^{h}}$ and ${\mathcal{A}_{3}^{h}}$ are symmetric.
For
$U,V\in {D_{h}}$, such that
${U_{0jk}}={U_{{J_{1}}jk}}=0$,
${V_{0jk}}={V_{{J_{1}}jk}}=0$,
$({x_{2}},{x_{3}})\in {\bar{\omega }_{2}}\times {\bar{\omega }_{3}}$ the formulas
define a scalar product and a norm in this vector space. We remind that
${d_{lm}}=1$, if
$0<m<{J_{l}}$ and
${d_{lm}}=0.5$, if
$m=0,{J_{l}}$,
$l=2,3$.
Lemma 4.
The discrete operators ${\mathcal{A}_{1}^{h}}$, ${\mathcal{A}_{2}^{h}}$ and ${\mathcal{A}_{3}^{h}}$ are symmetric, ${\mathcal{A}_{1}^{h}}$ is positive definite and ${\mathcal{A}_{2}^{h}}$, ${\mathcal{A}_{3}^{h}}$ are non-negative operators.
Proof.
First, we investigate the operator
${\mathcal{A}_{2}^{h}}$. Applying the summation by part formula, we get
In a similar way we get the estimates
It follows from the obtained estimates that
${\mathcal{A}_{2}^{h}}$ and
${\mathcal{A}_{3}^{h}}$ are symmetric and non-negative definite operators.
Next, we investigate the operator
${\mathcal{A}_{1}^{h}}$. Applying the summation by parts formula and taking into account the boundary conditions for vectors
U,
V, and the nonlocal conjugation conditions (
19), (
20) we get
Here we use the notation
${U_{{K_{1}}jk}}={U_{{K_{1}}00}}$,
${U_{{K_{2}}jk}}={U_{{K_{2}}00}}$ for
$({x_{2j}},{x_{3k}})\in {\bar{\omega }_{2}}\times {\bar{\omega }_{3}}$.
It follows from the obtained estimates that ${\mathcal{A}_{1}^{h}}$ is a symmetric and positive definite operator. □
Due to nonlocal conjugation conditions (
19), (
20) the classical factorization algorithm should be modified in order to solve 1D subproblems (
17)–(
20).
Lemma 5.
The unique solution of the linear system of equations (
15)
–(
20)
exists and it can be computed by using the efficient factorization algorithm.
Proof.
The given proof also defines the constructive algorithm to solve (
17)–(
20). Since the first part (
15) of the discrete ADI scheme defines an explicit algorithm and 1D subproblems (
16) in the second part of the scheme are efficiently solved by using the classical factorization algorithm, then it is sufficient to consider in detail the subproblem (
17)–(
20). For each
$({x_{2j}},{x_{3k}})\in {\bar{\omega }_{2}}\times {\bar{\omega }_{3}}$ the solution is factorized separately in subdomains
${\omega _{11}}$,
${\omega _{12}}$ and
${\omega _{13}}$. Let us write equations of the system (
17)–(
20) as
1.
Domain ${\omega _{11}}$. The solution is presented in the following form:
By induction it can be proved that the estimates
$0\leqslant {\alpha _{ijk}}\leqslant 1$ are valid.
2.
Domain ${\omega _{12}}$. The solution is presented in the following form:
This factorization is done in two steps. First, the solution is written in the form
Let us assume that
$0\leqslant {\tilde{\alpha }_{i-1,0,0}}$,
${\tilde{\beta }_{i-1,0,0}}\leqslant 1$ and
${\tilde{\alpha }_{k-1,0,0}}+{\tilde{\beta }_{k-1,0,0}}\leqslant 1$. Then it follows from the given formulas that
$0\leqslant {\tilde{\beta }_{i00}}\leqslant 1$ and
${\tilde{\alpha }_{i00}}\geqslant 0$. It remains to prove that
${\tilde{\alpha }_{i00}}\leqslant 1$ and
${\tilde{\alpha }_{i00}}+{\tilde{\beta }_{i00}}\leqslant 1$. It follows from simple inequalities
that
${a_{i00}}{\tilde{\alpha }_{i-1,00}}+{b_{i00}}\leqslant {c_{i00}}-{a_{i00}}{\tilde{\beta }_{i-1,0,0}}$. Thus we get the estimate
The proof by induction is completed.
Next, we compute coefficients
${\alpha _{i00}}$,
${\beta _{i00}}$ and
${\gamma _{i00}}$ in formula (
25)
From estimates for
${\tilde{\alpha }_{0k}}$,
${\tilde{\beta }_{0k}}$ and (
29) it follows that
3.
Domain ${\omega _{13}}$. For each
$({x_{2j}},{x_{3k}})\in {\omega _{2}}\times {\omega _{3}}$ the solution is presented in the following form:
By induction it can be proved that the estimates
$0\leqslant {\beta _{ijk}}\leqslant 1$ are valid.
Substituting (
23)–(
30) into equations (
19) and (
20) we get a linear system of two equations to find
${U_{{K_{1}}00}^{n+1}}$,
${U_{{K_{2}}00}^{n+1}}$:
where coefficients are defined by
From the estimates given above we have that
thus the determinant of the matrix of system (
31) is positive and the unique solution
${U_{{K_{1}}00}^{n+1}}$,
${U_{{K_{2}}00}^{n+1}}$ exists. Then the backward factorization step is applied and
${U_{ijk}^{n+1}}$ are computed. □
It is noted in Hundsdorfer and Verwer (
2003) that the stability analysis of such ADI schemes for three dimensional cases is far from easy and even the linear commuting case needs a closer attention. Since in our case operators
${\mathcal{A}_{1}^{h}}$,
${\mathcal{A}_{2}^{h}}$,
${\mathcal{A}_{3}^{h}}$ don’t commute, we can’t apply the spectral stability analysis.
Here we restrict to writing the stability matrix
It follows from Lemma
4, that
Then it is sufficient to assume that the discrete problem is such that the stability estimate
is satisfied. The validity of this assumption should be tested for each case of the discrete problem experimentally.
5 Parallel Algorithm
In this section a parallel version of the ADI algorithm (
15)–(
20) is presented. We note that development of efficient parallel algorithms for implementation of ADI type solvers is quite a non-trivial task. First the factorization algorithm for 3D problems severely depends on data movement in the memory of computers (from and to in different levels of the memory). Only few arithmetical operations are done with each small size portion of data, therefore cost of data movement is very important issue (see, Guo and Lu,
2016; Otero
et al.,
2020). Second, nonlocality of some parts of the discrete scheme lead to modifications of the standard factorization algorithm and these changes should be resolved efficiently by the parallel algorithm (see Čiegis
et al.,
2014, where a parallel ADI algorithm is constructed to solve multidimensional parabolic problems with nonlocal boundary conditions). Third, different approaches of parallelization can be used for different architectures of parallel computers. At least three main classes can be considered, including GPU processors (Imankulov
et al.,
2021; Xue and Feng,
2018; Otero
et al.,
2020), shared memory processors (multicore processors) and general global memory parallel machines, when implementation of parallel algorithms can be done, e.g. by using MPI library (Čiegis
et al.,
2014). In this paper we restrict to the construction of parallel algorithms for general distributed memory computers. It is obvious that these algorithms can be efficiently used also for multicore computers and distributed clusters of multicore nodes.
First, in defining a parallel version of the ADI algorithm for the reduced dimension model we restrict to a a simple modification of the standard sequential factorization algorithm. Only two parts are used to decompose the domain ${\Omega _{h}}$ in each dimension. Therefore the maximum number of parallel processes is restricted to eight processes.
We start by defining a parallel algorithm for the solution of a tridiagonal system of linear equations:
Let us assume that two processes are used to solve it in parallel. System (
33) is partitioned among both processes. The first process is responsible for the first
$(J+1)/2$ equations
$0\leqslant j<(J+1)/2$ and the second process solves the remaining equations. Both processes implement the classical sequential factorization algorithm in parallel but in different directions. The first process applies this algorithm from left-to-right:
while the second process computes in the opposite direction:
Then both processes exchange coefficients
$({\alpha _{\frac{J+1}{2}-1}},{\beta _{\frac{J+1}{2}-1}})$,
$({\alpha _{\frac{J+1}{2}}},{\beta _{\frac{J+1}{2}}})$, solve
$2\times 2$ dimension systems and find solutions
${U_{\frac{J+1}{2}-1}}$,
${U_{\frac{J+1}{2}}}$. The second backward part of the factorization algorithm is again implemented in parallel.
Now we are ready to describe the main decomposition details of the parallel version of ADI algorithm (
15)–(
20).
1.
The case of $p=2$ processes. The discrete grid
${\bar{\omega }_{1}}$ is splitted into two parts
${\bar{\omega }_{1}}={\omega _{1,p0}}\cup {\omega _{1,p1}}$, where
The first process computes the solution on the discrete grid
$({\omega _{11}}\times {\bar{\omega }_{2}}\times {\bar{\omega }_{3}})\cup {\bar{\omega }_{12}}$, and the second process computes the solution on the grid
${\omega _{13}}\times {\bar{\omega }_{2}}\times {\bar{\omega }_{3}}$.
Before computations of the explicit step (
21), the first process sends the value
${U_{{K_{2}}00}^{n}}$ to the second process, and the second process computes the value of
$S({U_{{K_{2}}+1}^{n}})$ and sends it to the first process. Then both processes compute their parts of
${U^{n+\frac{1}{4}}}$ in parallel.
The
$l=2,3$ steps of the ADI scheme (
22) don’t require any communication between processes and can be done in parallel.
The last step
$l=4$ of the ADI scheme (
22) is implemented in the following way. The first process computes the factorization coefficients of (
23) and (
25), and the second process computes coefficients of (
30). These computations can be done in parallel. Then the second process computes the values of
$S({\beta _{{K_{2}}+1}^{n}})$,
$S({\gamma _{{K_{2}}+1}^{n}})$ and sends them to the first process. The latter solves system (
31) and sends the value
${U_{{K_{2}}00}^{n+1}}$ to the second process. Then both process compute their parts of
${U^{n+1}}$ in parallel.
A simple but still accurate complexity model of the parallel ADI algorithm can be constructed
where
${T_{2}}$ is computation time of the one time step of the parallel algorithm,
α is data communication start-up time,
β defines time required to send one element.
We note that this model takes into account only main computational and data communication costs. In real computations the values of these parameters can vary. One more important detail, i.e. automatic memory caching is also not estimated in this formula. Thus efficiency estimates of such models are only asymptotically valuable, but exactly such information is sufficient in most cases.
Let us define the speed-up
${S_{p}}$ and efficiency
${E_{p}}$ of the parallel algorithm
where
p is the number of processes,
${T_{0}}$ is the computation time of the sequential algorithm.
Since
${T_{0}}=c{J_{1}}{J_{2}}{J_{3}}$ and additional data communication costs are constant, then even for moderate size grids
${\omega _{1}}\times {\omega _{2}}\times {\omega _{3}}$ it follows from (
34) that
i.e. the parallel algorithm is scalable.
2.
The case of $p=4$ processes. In addition to splitting the discrete grid
${\bar{\omega }_{1}}$, the grid
${\bar{\omega }_{2}}$ is also divided into two parts
${\bar{\omega }_{2}}={\omega _{2,p0}}\cup {\omega _{2,p1}}$, where
The first process computes the solution on the grid
$({\omega _{11}}\times {\bar{\omega }_{2,p0}}\times {\bar{\omega }_{3}})\cup {\bar{\omega }_{12}}$, the second process computes the solution on the grid
${\omega _{13}}\times {\bar{\omega }_{2,p0}}\times {\bar{\omega }_{3}}$, the third process computes the solution on the grid
$({\omega _{11}}\times {\bar{\omega }_{2,p1}}\times {\bar{\omega }_{3}})\cup {\bar{\omega }_{12}}$, and the fourth process computes the solution on the grid
${\omega _{13}}\times {\bar{\omega }_{2,p1}}\times {\bar{\omega }_{3}}$.
Before parallel computations of the explicit step (
21), the first process sends the value
${U_{{K_{2}}00}^{n}}$ to the second and fourth process, these two processes compute their part of
$S({U_{{K_{2}}+1}^{n}})$ and send the obtained values to the first process. In a similar way the first process sends the value
${U_{{K_{1}}00}^{n}}$ to the third process, which computes its part of
$S({U_{{K_{1}}+1}^{n}})$ and sends the obtained value to the first process.
The second part of data communication is done among two pairs of processes: the first and third processes exchange the values of ${U_{i,{J_{2}}/2,k}^{n}}$ and ${U_{i,{J_{2}}/2+1,k}^{n}}$, $0<i<{K_{1}}$, $0\leqslant k\leqslant {J_{3}}$, and the second and fourth processes exchange the values of ${U_{i,{J_{2}}/2,k}^{n}}$ and ${U_{i,{J_{2}}/2+1,k}^{n}}$, ${K_{2}}<i<{J_{1}}$, $0\leqslant k\leqslant {J_{3}}$. These communications are done in parallel. Then all four processes compute their parts of ${U^{n+\frac{1}{4}}}$ in parallel.
The
$l=2$ step of the ADI scheme (
22) don’t require any communication between processes and can be done in parallel.
The
$l=3$ step of the ADI scheme (
22) is implemented by using the parallel factorization algorithm, which is described above for the system of linear equations (
33). The data communication part between the given pairs of the processes is equivalent to the one defined for the explicit part of the ADI algorithm (
21). This step again can be done in parallel.
The last step
$l=4$ of the ADI scheme (
22) is implemented as for
$p=2$ processes with one small modification. The second and fourth process compute the values of
$S({\beta _{{K_{2}}+1}^{n}})$,
$S({\gamma _{{K_{2}}+1}^{n}})$ and the third process computes the values of
$S({\alpha _{{K_{1}}-1}^{n}})$,
$S({\gamma _{{K_{1}}-1}^{n}})$. Then they send these values to the first process. It solves system (
31) and sends the values
${U_{{K_{1}}00}^{n+1}}$,
${U_{{K_{2}}00}^{n+1}}$ to the remaining three process. Then all four process compute their parts of
${U^{n+1}}$ in parallel.
The complexity model of the parallel ADI algorithm for
$p=4$ processes is given by
It follows from this theoretical model, that now the data communication costs depend linearly on the size of
${J_{1}}{J_{3}}$. Thus only for sufficiently large
${J_{2}}$ these costs can be neglected and estimates
are valid. We get that the parallel algorithm is scalable for sufficiently large
${J_{2}}$ and this specific size depends on both parameters – computational complexity
c and data communication costs
β.
3.
The case of $p=8$ processes. In addition to splitting discrete grids
${\bar{\omega }_{1}}$,
${\bar{\omega }_{2}}$ the discrete grid
${\bar{\omega }_{3}}$ is also divided into two parts
${\bar{\omega }_{3}}={\omega _{3,p0}}\cup {\omega _{3,p1}}$, where
The parallel algorithm is modified to include data distribution steps in the
${x_{3}}$ dimension, all communications are defined by using analogous formulas given for
$p=4$ case in the
${x_{2}}$ dimension.
The complexity model of the parallel ADI algorithm for
$p=8$ processes is given by
It follows from this model that for sufficiently large grid sizes, when
${J_{1}}{J_{2}}{J_{3}}\gg {J_{1}}({J_{2}}+{J_{3}})$, the estimates
are valid and the parallel algorithm is scalable.
Remark 1.
Let us consider the case when the number of processors/cores is larger than 8. In order to use all
p processors, we modify the constructed parallel algorithm. We assume that
p is an even number and factorize it as
$p=2\times {p_{2}}\times {p_{3}}$. The decomposition of the discrete grid
${\Omega _{h}}$ is done in two steps. First, the
${\bar{\omega }_{1}}$ grid is divided into two parts following the algorithm defined for
$p\leqslant 8$ processes. Thus the implementation of the parallel algorithm in
${x_{1}}$ dimension remains unchanged. Next, we divide the grids
${\bar{\omega }_{j}}$ into
${p_{j}}$,
$j=2,3$ parts. If
${p_{j}}>2$, then in this dimension the proposed parallel factorization algorithm is changed to a more expensive but general algorithm. We recommend to use Wang’s or Cyclic reduction parallel factorization algorithms (Kumar
et al.,
1994; Imankulov
et al.,
2021).
6 Computational Experiments
Consider the problem (
1)–(
4) in the domain Ω with geometry parameters
${X_{1}}=6$,
${X_{2}}=1$,
${X_{3}}=1$. We solve this problem till
$T=1$ with given initial, boundary conditions and the zero source function
The accuracy of time integration of ADI schemes. First, we have tested the time integration accuracy of the ADI solver for the full dimension model. A uniform space grid
${\Omega _{h}}$ with
${J_{1}}=120$,
${J_{2}}=20$,
${J_{3}}=20$ is used. Table
1 gives for a sequence of decreasing time step widths
τ the errors
$e(\tau )$ and the experimental convergence rates
$\rho (\tau )$ of the discrete solution for ADI scheme (
12) in the maximum norm:
where the reference solution
$U({x_{1i}},{x_{2j}},{x_{3k}},T)$ is computed by using the very small time step
$\tau =6.25\times {10^{-5}}$. Then the integration error introduced by the ADI scheme can be measured accurately.
Table 1
Errors
$e(\tau )$ and experimental convergence rates
$\rho (\tau )$ for the discrete solution of ADI scheme (
12) for a sequence of time steps
τ.
τ |
$e(\tau )$ |
$\rho (\tau )$ |
0.002 |
$3.0352\times {10^{-5}}$ |
2.016 |
0.001 |
$7.5272\times {10^{-6}}$ |
2.012 |
0.0005 |
$1.8549\times {10^{-6}}$ |
2.021 |
0.00025 |
$4.411\times {10^{-7}}$ |
2.072 |
It follows from the presented results, that the accuracy of ADI scheme agrees well with the theoretical prediction.
The accuracy of the reduced dimension model. Next we investigate the accuracy of the reduced dimension model (
5)–(
11). For the space discretization a uniform grid
${\Omega _{h}}$ with
${J_{1}}=300$,
${J_{2}}=50$,
${J_{3}}=50$ is used, and integration in time is done with
$\tau =0.001$.
The numerical tests have been performed on the computer with Intel® Xeon® processor E5-2670, 8GB RAM, and the algorithms have been implemented using C++ language.
We note that CPU time for computing the full model solution by using the ADI scheme (
12) is 55.9 seconds.
Table
2 gives for a sequence of reduction parameter
δ errors
$e(\delta )$
of the reduced dimension model (
5)–(
11) discrete solution in the maximum norm, where
${U_{ijk}^{N}}$ is the solution of the ADI scheme (
12) and
${U_{ijk}^{N}}(\delta )$ is the solution of the reduced dimension ADI scheme (
15)–(
20). CPU times are also presented for the scheme (
15)–(
20).
Table 2
Errors
$e(\delta )$ of the discrete solution of the reduced dimension model (
5)–(
11) for a sequence of truncation parameter
δ.
|
$\delta =1$ |
$\delta =1.5$ |
$\delta =2$ |
$\delta =2.8$ |
$e(\delta )$ |
$4.304\times {10^{-3}}$ |
$1.748\times {10^{-4}}$ |
$7.201\times {10^{-6}}$ |
$4.522\times {10^{-8}}$ |
CPU |
19.4 |
29.1 |
40.8 |
60.2 |
It follows from the presented results, that starting from $\delta =1$ the accuracy of the reduced dimension model is sufficient for most real world applications, while CPU time for the reduced dimension model is reduced up to 3 times.
The efficiency of the parallel ADI scheme (
15)–(
20). In this section we present results of the parallel scalability tests. All parallel numerical tests in this work were performed on the computer cluster “HPC Vanagas” at the High Performance Computing Laboratory of Vilnius Gediminas technical university. We have used up to 8 cores of Intel
® Xeon
® processor E5-2630 with 20 cores (2.20 GHz) and 32 GB DDR4 of RAM. The parallel algorithms have been implemented using
MPI library.
Our goal is to investigate the efficiency of the proposed parallel version of the ADI scheme (
15)–(
20). We have solved two problems with different sizes of the discrete meshes
$450\times 75\times 75$ and
$600\times 100\times 100$. The CPU time
${T_{p}}$, speed-up
${S_{p}}$ and efficiency
${E_{p}}$ coefficients are presented in Table
3.
Table 3
Scalability analysis of the parallel algorithm. The speed-up ${S_{p}}$ and efficiency ${E_{p}}$ coefficients for two problems of dimension $450\times 75\times 75$ and $600\times 100\times 100$.
|
${T_{p}}(450)$ |
${S_{p}}(450)$ |
${E_{p}}(450)$ |
${T_{p}}(600)$ |
${S_{p}}(600)$ |
${E_{p}}(600)$ |
$p=1$ |
77.4 |
– |
– |
254.1 |
– |
– |
$p=4$ |
17.5 |
4.42 |
1.12 |
57.79 |
4.40 |
1.10 |
$p=8$ |
9.85 |
7.86 |
0.97 |
29.2 |
8.70 |
1.08 |
Two conclusions follow from the presented results of computational experiments. First, the constructed parallel algorithm scales well even for problems of moderate sizes.
Second, MPI library enables efficient usage of the allocated local memory for each process, thus the efficiency of the parallel algorithm ${E_{p}}$ over 1 is achieved.
As a future task, it is interesting to consider the scalability of the modified parallel algorithm from Remark 1, when the number of processes is larger than 8.