A Parallel Algorithm for Solving a Two-Stage Fixed-Charge Transportation Problem

. This paper deals with the two-stage transportation problem with ﬁxed charges, denoted by TSTPFC. We propose a fast solving method, designed for parallel environments, that allows solving real-world applications eﬃciently. The proposed constructive heuristic algorithm is iterative and its primary feature is that the solution search domain is reduced at each iteration. Our achieved computational results were compared with those of the existing solution approaches. We tested the method on two sets of instances available in literature. The outputs prove that we have identiﬁed a very competitive approach as compared to the methods than one can ﬁnd in literature.


Introduction
When looking at the definition of supply chains (SCs), we find the commonly accepted variant: they are considered worldwide networks in which the actors are suppliers, manufacturers, distribution centres (DCs), retailers and customers. The typical SC performs several functions; these are: the purchase and processing of raw materials, and their subsequent conversion into intermediary and finished manufactured goods, along with the delivery of the goods to the customers. The major goal of this entire operation is the satisfaction of the customers' needs and wants.
A particular SC network design problem is the focus of this paper, more specifically, the two-stage transportation problem with fixed charges for opening the distribution centres. This is a modelling problem for a distribution network in a supply chain that is described as two-stage. This two-stage supply chain network design problem includes manufacturers, DCs and customers and its primary feature resides in the fact that for the opening of distribution centres there exist fixed charges added to the variable costs of transportation, that are proportionate to the quantity of goods delivered. The aim of the envisaged optimization problem is to determine which DCs should be opened and to pinpoint and choose the shipping routes, starting from the manufacturers and passing through the picked DCs to reach the customers, and to satisfy all the capacity restrictions at manufacturers and DCs so as to meet the customers' specific demands, minimizing the total costs of distribution. The problem with this design was considered first by Gen et al. (2006). For a survey on the fixed-charge transportation problem and its variants we refer to Buson et al. (2014), Cosma et al. (2018Cosma et al. ( , 2019Cosma et al. ( , 2020, Calvete et al. (2018), Pirkul and Jayaraman (1998), Pop et al. (2016Pop et al. ( , 2017, etc. The variant addressed within the current paper envisages a TSTPFC for opening the DCs, as presented by Gen et al. (2006). The same problem was also considered by Raj and Rajendran (2012). The authors of the two specified papers developed GAs that build, firstly, a distribution tree for the distribution network that links the DCs to customers, and secondly, a distribution tree for the distribution network that links the manufacturers to DCs. In both GAs, the chromosome contains two parts, each encoding one of the distribution trees. Calvete et al. (2016) designed an innovative hybrid genetic algorithm, whose principal characteristic is the employment of a distinct chromosome representation that offers information on the DCs employed within the transportation system. Lately, Cosma et al. (2019) described an effective solution approach that is based on progresively shrinking the solutions search domain. In order to avoid the loss of quality solutions, a mechanism of perturbations was created, which reconsiders the feasible solutions that were discarded, and which might eventually lead to the optimal solution.
The investigated TSTPFC for opening the DCs is an NP-hard optimization problem because it expands the classical fixed charge transportation problem, which is known to be NP-hard, for more information see Guisewite and Pardalos (1990). That is why we describe an efficient parallel heuristic algorithm.
Parallel computing seeks to exploit the availability of several CPU cores which can operate simultaneously. For more information on parallel computing we refer to Trobec et al. (2018).
In this paper, we aim to illustrate an innovative parallel implementation of the Shrinking Domain Search (SDS) algorithm described in Cosma et al. (2019), that is dealing with the TSTPFC for opening the DCs. Our constructive heuristic approach is called Parallel Shrinking Domain Search and its principal features are the reduction of the solutions search domain to a reasonably sized subdomain by taking into account a perturbation mechanism which permits us to reevaluate abandoned feasible solutions whose outcome could be optimal or sub-optimal solutions and its parallel implementation that allows us to solve real-world applications in reasonable computational time. The proposed solution approach was implemented and tested on the existing benchmark instances from the literature.
The paper is organized as follows: in Section 2, we define the investigated TSTPFC for opening the DCs. In Section 3 we describe the novel solution approach for solving the problem, designed for parallel environments. In Section 4 we present implementation details and in Section 5 we describe and discuss the computational experiments and our achieved results. Finally, the conclusions are depicted in Section 6.

Definition of the Problem
In order to define and model the TSTPFC for opening the DCs we consider a tripartite directed graph G = (V , A), that consists of a set of vertices V = V 1 ∪ V 2 ∪ V 3 and a set of arcs A = A 1 ∪ A 2 defined as follows: The entire set of vertices V is divided into three mutually exclusive sets corresponding to the set of manufacturers denoted by V 1 with |V 1 | = p, the set of distribution centres denoted by V 2 with |V 2 | = q and the set of customers denoted by V 3 with |V 3 | = r.
In addition, we suppose that: • Every manufacturer i ∈ V 1 has S i units of supply, every DC j ∈ V 2 has a given capacity T j , each customer k ∈ V 3 has a demand D k and the total number of units received by DC j, j ∈ V 2 from manufacturers and sent from DC j to customers is denoted by d j ; • Every manufacturer may transport to any of the q DCs at a transportation cost b ij per unit from manufacturer i ∈ V 1 , to DC j ∈ V 2 ; • Every DC may transport to any of the r customers at a transportation cost c jk per unit from DC j ∈ V 2 , to customer k ∈ V 3 ; • In order to open any of the DCs we have to pay a given fixed charge, denoted by f j and there exists a limitation on the number of the DCs that are permitted to be opened, denoted by w.
The goal of the investigated TSTPFC for opening the DCs is to select the DCs, the shipping routes and corresponding transported quantities on these routes, so that the customer demands are satisfied, all the transportation restrictions are fulfilled, and the total transportation costs are minimized. In Fig. 1 we present the investigated TSTPFC for opening the DCs. In order to provide the mathematical formulation of the investigated transportation problem with fixed charges, we consider the following decision variables: the binary vari-ables v j ∈ {0, 1} that indicate if DC j has been opened and the linear variables x ij 0 representing the amount of units have been transported from manufacturer i to DC j and y jk 0 representing the amount of units have been shipped from DC j to customer k.
Then the TSTPFC for opening the distribution centres can be formulated as the following mixed integer problem, proposed by Raj and Rajendran (2012): In order to have a nonempty solution set we make the following suppositions: The aim of the investigated problem is to minimize the total transportation costs, therefore the objective function has three terms associated to the transportation costs between manufacturers and distribution centres, between depots distribution centres and customers and the costs of opening the DCs, respectively. Constraints (1) guarantee that the capacity of the manufacturers is not exceeded, while constraints (2) ensure that the total shipment received from DCs by each customer satisfies its demand. Restrictions (3) are the flow conservation conditions and they guarantee that the units received by a DC from manufacturers are equal to the units shipped from the distribution centres to the customers and as well ensure that the capacity of the DCs is not exceeded. Constraint (4) limits the number of distribution centres that can be opened and the last three constraints ensure the integrality and non-negativity of the decision variables.

Description of the Parallel SDS Algorithm
The difficulty of the investigated transportation problem lies in the large number of feasible solutions, from which the optimal option should be chosen. Because each used DC adds a certain cost to the objective function, the fundamental decision of the algorithm is related to the distribution centres that should be used in order to optimize the total distribution costs. Thus, the operations of the algorithm can be separated into the following steps: S1: Choosing a set of promising distribution centres. S2: Solving an optimization subproblem in which only the DCs in the chosen set are used.
Each iteration of the PSDS algorithm involves one or more operations. The set of DCs that are used in the optimal solution is called the best set. At the initialization of the algorithm, the number of DCs in the optimal set (DCno) will be estimated based on the minimum capacity of the DCs and the total customer demand. This estimate will be permanently updated throughout the algorithm. If the distribution system has q DCs, then the optimum set search domain has q DCno elements. Evaluating each set involves solving the S2 subproblem. For large systems, it is not possible to evaluate all these variants. We will refer to the number of DCs in a set by the set type and to the cost of the distribution solution obtained by solving subproblem S2 through the set cost.
The proposed Parallel Shrinking Domain Search algorithm (PSDS) is an iterative algorithm that applies the following strategy: at each iteration, a fixed number of sets of the search domain will be evaluated, after which the search domain for the next iteration will be reduced. As the search domains narrow down, they will be explored at more detail. In the last iterations, the search domains can be explored exhaustively because they will contain fewer elements than the number settled for evaluation. The algorithm ends when a single set exists in the search domain. In order to avoid losing the optimal solution in the domain search reduction and for DCno estimate adjustment, a perturbation mechanism has been created to reconsider some sets outside the search domain. The search strategy of the PSDS algorithm is shown in Fig. 2.
The solution-building process is relatively expensive because it requires a large number of operations and the problem is even more complex as the size of the distribution system is larger. The performance of the algorithm has been improved by building solutions in parallel. For this purpose, the Java Fork and Join framework has been used.
In Fig. 3, we illustrate the operating principle of the proposed PSDS algorithm. The parallel shrinking domain algorithm maintains the following set of lists: • Good DCs -contains the promising DCs, based on which the sets within the search domain will be generated at every iteration. This list contains the features of surviving sets; • Bad DCs -a list of disadvantageous DCs that is necessary for the implementation of the perturbation operation, and to correct the DCno estimation; • Good sets -a list that preserves the best performing sets discovered during the execution of the algorithm. This list has a fixed number of items representing the surviving sets.
The sets found at the beginning of this list, contain only good DCs. In the remaining of this document, they will be called Best sets; • Sets for evaluation -a list of prepared sets for evaluation representing the sets from next generation of sets; • Sets types -a list that contains a quality estimation of all types of sets found in the Best sets list. This is a list of structures {type, quality}; • All sets -a hash set containing every set created and placed in the Sets for evaluation list during the execution of the algorithm.
At the initialization of the algorithm, the optimal set type (DCno) is estimated and all available DCs are added to the Bad DCs list. Then the Good Sets list, the Sets for evaluation list and the All sets hash set are constructed and a single item of DCno type and quality = 1 is added to the Sets Types list. Next, a Thread Pool is created with a number of Worker Threads equal to the number of CPU logic processors. For performance comparison, experiments with fewer threads were also performed. The worker threads will be enabled at each iteration by creating a Recursive Evaluator task that will be sent to the Thread Pool for parallel evaluation of the sets in the Sets for evaluation list.
Every iteration of the proposed algorithm executes in sequence the four main blocks presented in Fig. 3: Generation, Evaluation, Selection and Classification. The first block prepares the sets to be evaluated, the second one deals with the evaluation of the sets, and the last two handle the results. From the point of view of complexity, the first and the last two blocks are negligible in relation to the second. The efficiency of the algorithm has been significantly improved by parallel implementation of the processing in the Evaluation block.
The Generation block contains three types of generators for feeding the Sets for evaluation list. All the sets created during the algorithm are kept in a hash set All Sets. Thus, each duplicate can be detected and removed easily. Such a mechanism could be implemented because the total number of sets is relatively small. The optimization process ends after a very small number of iterations. The total number of sets generated during the execution of the algorithm is relatively small. Due to this property, the PSDS algorithm can be applied for solving large instances of the problem.
The first generator type creates a fixed number of sets by picking at random DCs from the Good DCs list. The types of the created sets are retrieved from the Best types list. For each type present in this list, a number of sets proportional to the quality of the type are generated. The quality of the type is determined by the Classification block.
The second generator type creates perturbations by inserting "bad" distribution centres taken from the Bad DCs list in the good sets from the Good sets list. This operation is essential for our optimization process: there could be distribution centres erroneously categorized by the Classification module, because they were found only in sets composed mainly by "bad" distribution centres. Due to the perturbation mechanism, at each iteration these sets get an opportunity to return to the Good DCs list. This mechanism creates a new set for each "bad" distribution centre in the Bad DCs list, by changing one element of a set taken from the Good sets list. The Good sets list is processed in the order given by the cost of the corresponding distribution solutions. This attempts to place each "bad" distribution centre into the best possible set.
Another key operation is the update of the DCno estimation. For this purpose, both larger and smaller sets than those existing in the Best sets list will be created by the third generator type. For creating larger sets, each "bad" distribution centre is added to the best possible set from the Best sets list. The smaller sets are generated by cloning sets from the Best sets list and randomly deleting one of their elements.
The Evaluation block has the role of evaluating the sets from the Sets for evaluation list. For this purpose, a Recursive Evaluator task is created, which is sent to the Thread Pool for execution. The operation of this task is shown in Fig. 4. The Recursive Evaluator task divides the Sets for evaluation list into two equal parts, then creates two sub-tasks (ST ) to evaluate the two halves. When the number of items a task has to evaluate drops below a certain threshold, a Final Task (FT ) is created that will be executed by one of the Worker Threads. For the results presented in this article, a threshold of 10 was used. The algorithm by which each set is evaluated by a FT will be presented at the end of this section.
The Selection and Classification blocks are triggered when all the worker threads have ended. The Selection module takes all the sets in the Sets for evaluation list that are better than the last element in the Good sets list and moves them to that list. Next, the Good sets list is sorted by the set costs and only the best elements are kept, so that its size stays constant. Then the Sets for evaluation list is cleared to make room for the next generation of sets.
The Classification block uses the information in the Good sets list for updating the Good DCs and Bad DCs lists. The Good sets list is traversed based on the cost order. The first distribution centres found in the Good sets elements are added to the Good DCs list, and the remaining ones form the Bad DCs list. The number of sets selected to form the Good sets list decreases each time. Due to this, the PSDS algorithm ends after a small number of iterations.
The Classification block estimates the quality of the set types in the Best sets list and places the result in the Sets types list. The quality of each set type is estimated based on the number of items of that type found in the Best sets list and the positions of those items in the list.
The DC set evaluation operation is presented in Fig. 5. The relatively bulky data structure representing the characteristics of the distribution system (the unit costs b ij and c jk ), the fixed costs (f j ), the demands of the customers (Dk) and the manufacturers and distribution centers capacities (S i and T j ) is static, so it will not be copied at the creation of each task. Because this data is shared by all the final tasks running in paralel, it must remain unchanged. For the construction of each distribution solution, two small lists (used DCs list and used Ms list) will be constructed by the final task, in which the quantities to be delivered by the DCs and manufacturers used in the solution will be kept. Each solution is constructed in r stages. At each stage the best supply variant for one customer is searched for. Every decision taken in this stage will affect all decisions to be taken in the next stages, as certain transport links might be opened and some of the capacities of manufacturers and distribution centres will be consumed.
Each customer demand is resolved in one or a few stages. At every stage, the cheapest Manufacturer -Dictribution Centre -Customer supply route is sought by the Find Route module, that performs a greedy search. The cost of a route depends on the amount of transported goods, the unit costs of the transport lines and the fixed costs of the transport lines that have not been opened previously. If the found route can not ensure the entire demand of the customer, because of the limited capacities of the distribution centres and manufacturers, then an extra stage is added for the remaining quantity.
The Supply module only uses the two local lists as storage area and updates the total unit costs corresponding to the distribution solution. The last operations of the Supply module is the removal of the unused distribution centres from the evaluated set, and the addition of the remaining distribution centres fixed costs to the final cost of the set.

Implementation Details
In the description of our algorithm we will use the following abbreviations: The hierarchy of the procedures that make up the PSDS algorithm is shown in Fig. 6. The startup of the algorithm is presented in Algorithm 1. Based on experiments, the following initialization parameters were used: totalSetsInit = 6q, goodPercentInit = 0.5, where q is the total number of distribution centres. The calls on lines 15 and 16 generate and evaluate an initial collection of sets and then, the call on line 17 starts the optimization process.
The generation procedure is presented in Algorithm 2. It generates multiple sets based on the distribution centres found in the DCList parameter. This procedure produces perturbations only if the second parameter is true. Petrurbations are almost always generated. The only exception occurs in the case of the call on line 15 of the startup procedure. A number of sets equal to totalSets will be generated at most. For each set type in the SetsTypes list, a number of sets in proportion with the set quality will be generated. If the DCList does not have enough elements for generating the required amount of sets, then the exhaustiveGenerator procedure will be called to generate all the possible sets. Otherwise procedure randomGenerator will be called. If perturbations are required, then the procedures largerSetsGenerator and perturbationsGenerator are called for each distribution center in the BadDCs list.
The randomGenerator procedure presented in Algorithm 3 creates at least variantsNo random sets of certain type with distribution centres taken from the DCList. The new created sets are added to the Sets for evaluation list. The procedure shuffles the distribution centres in a working list and avoids putting a distribution centre multiple times in the same set. Each set is validated on line 10. For a valid set, the sum of the distribution centres capacities must be large enough to satisfy all the customers, and the set has to pass the Duplicates detector.
The exhaustiveGenerator procedure is presented in Algorithm 4. It generates all possible sets of certain type with the distribution centres from the DCList. The newly created sets are validated on line 3 and added to the Sets for evaluation list.
The largerSetsGenerator procedure is presented in Algorithm 5. It searches through the Best sets list to find the best ranked set in which the badDC can be inserted to create a new validated set. The procedure is called in line 14 of the generation procedure, aiming to insert each bad distribution centre in a new valid set.
The perturbationsGenerator procedure presented in Algorithm 6 searches in the Good sets list, the first set in which an item can be substituted by the badDC. All the new created sets are added to the Sets for evaluation list. The procedure is called in line 15 of the generation procedure. It tries to insert each "bad" distribution centre in the best possible set.
The evaluation procedure is presented in Algorithm 7. It is called in the startup and shrinkingDomainSearch procedures, when the preparation of the Sets for evaluation list is finished. The procedure creates a Recursive evaluator task for all the sets in the Sets for evaluation list, and sends it to the Thread pool for execution. When the evaluation ends, all the sets that are not worse than the ones in the Good sets list are moved to that list. The Sets for evaluation list is cleared, to be prepared for a new iteration, and the Good sets list is sorted according to sets cost.
The recursiveEvaluator procedure is presented in Algorithm 8. The first parameter represents the position of the first set from the SetsForEvaluation list to be considered, and the length parameter represents the number of sets to be evaluated, starting from the first. If the value of the length parameter exceeds the value of the threshold wThreshold, then through the calls on lines 9 and 10 two sub-tasks are created, each having to evaluate half of the initial number of sets. When the length parameter falls below wThreshold, the procedure is converted into a Final Task that is retrieved and executed by one of the available Worker Threads in the Thread Pool.
The shrinkingDomainSearch presented in Algorithm 9 is the central procedure of the PSDS algorithm. The search domain is reduced at every iteration of the main loop, by decreasing the goodPercent. Therefore, the goodDCsNo is also reduced. The speed parameter controls the convergence of the algorithm. By increasing the speed parameter, the total number of iterations is reduced. This reduces the total number of operations, the algorithm runs quicker, but the optimal solution might be lost, because the search domains are explored less thoroughly. For the results included in this paper, the speed parameter was fixed to 1.1. The updateGoodDCs procedure call rebuilds the Good DCs and Bad DCs lists, at each iteration. The for loop estimates the quality of all the set types from the Best sets list, considering the number of items of each type, and the positions of those items in the list. These estimations will determine the number of sets that will be generated for each type. The DCNo estimation is updated on line 18. The generators and evaluation procedures are called at the end of the main loop.
The updateGoodDCs procedure is presented in Algorithm 10. It moves from the Bad DCs list to the Good DCs list a number of distribution centres equal to GoodDCsNo. The distribution centres are taken from the best items of the Good sets list. The bestSetsNo variable is recalculated in the process.
The smallerSetsGenerator procedure presented in Algorithm 11 generates all the possible sets by removing one distribution centre from the items in the Best sets list. Each new created set is validated before being added to the Sets for evaluation list.
The remaining of this section is dedicated to the adjustment of the algorithm's operating parameters. The charts presented in Figs. 7-12 show the gaps for the average of the best solutions and for the average running times required to find the best solutions, in the cases of four different instances. The selected instances have been run 10 times, using five different values for the following algorithm parameters: speed, goodPercentInit and totalSetsInit.
The gap of the average best solution Zg v is given by relation (9), where v is the value of the studied parameter, ref is the reference value of the same parameter, Z v is the average of the best solutions found in the ten runs of the instance when using v and Z ref is the average of the best solutions found in the ten runs of the instance when using ref. The gap of the running time T g v is given by relation (10), where T v is the average of the running times when using v and T ref is the average of the running times when using ref.
Figures 7 and 8 deal with the speed parameter. The reference value of this parameter that has been set for building the charts is 1.1. For higher values of this parameter, the algorithm ends faster because it performs fewer iterations, but this increases the likelihood of missing the optimal solution, because the search domains are reduced too much with  The reference value of this parameter that has been set for building the charts is 50. The graph in Fig. 9 shows that the decrease of this parameter below the reference value increases the probability of missing the optimal solution, and the graph in Fig. 10 shows that increasing this parameter over the reference value also unjustifiably increases the running time. Figures 11 and 12 deal with the totalSetsInit parameter, that is: totalSetsInit = q × t c , where t c is the totalSetsInit coefficient, and q is the number of distribution centres. The reference value of the totalSetsInit coefficient that has been set for building the charts is 6. The chart in Fig. 11 shows that decreasing the initial number of sets below q × 6 increases the likelihood of missing the optimal solution, and the chart in Fig. 12 shows that increasing too much the initial number of sets has a negative impact on the running time of the algorithm.

Computational Results
This section is dedicated to the achieved computational results with the aim of assessing the effectiveness of our approaches suggested for solving the TSTPFC for opening the DCs. The results presented in this section were obtained by running our algorithm for solving the TSTPFC for opening the DCs on a set of 16 test instances of medium sizes, and on a set of 8 instances of larger sizes. Both sets of benchmark instances are known from literature. We refer to Calvete et al. (2016) for more information regarding the characteristics of the first set of instances, and to Cosma et al. (2019) for more information regarding the characteristics of the second set instances.
We implemented our parallel heuristic algorithm for solving the considered transportation problem, in Java language. We have run each of the instances 10 times, as Calvete et al. (2016) did. For our tests, we used two computing systems having the following two significantly different Central Processing Units: • Intel Core i5-4590 CPU at 3.3 GHz having 4 cores / 4 logical processors; • Intel Xenon Silver 4114 at 2.2 GHz having 10 cores / 20 logical processors.
For the envisaged test instances, we compared our developed parallel heuristic algorithm with the existing solution approaches with the aim of analysing the performance of our solution. The obtained computational results are presented in Tables 1 and 2. In Table 1, we describe the results of the computational experiments in the case of the two classes of instances introduced by Calvete et al. (2016). The first column of the table provides the name and the characteristics of the test instance, the next column provides the optimal solution of the problem, denoted by Z opt achieved by the professional optimization software LINGO and as well the corresponding execution time T opt required to obtain it, the next column displays the best solution Z best achieved by Calvete et al. (2016) and as well the corresponding average computational time T avg and average number of iterations I t avg required to achieve the best solution. Finally, the last seven columns display results achieved by our novel parallel heuristic algorithm: the best solution Z best achieved in all ten runs of the computational experiments, the corresponding best computational times T best and average computational times T avg for obtaining the solution, the best and the avearage iteration at which the best solution appears, the average duration of an iteration avg T /it in seconds and the number of worker threads used in each experiment wt#. The computational times are displayed in seconds, with four exceptions, in the case of problems P 6,1 , P 7,1 , P 8,1 and P 8,2 where LINGO needs more than two hours to solve the problems.  The results in Table 1 show that our developed parallel heuristic algorithm delivers the same result as the one provided by Calvete et al. (2016), in all ten runs of computing experiments. These results correspond to the optimal solutions of the problem obtained by LINGO. In terms of efficiency, our parallel heuristic algorithm runs faster than the hybrid evolutionary algorithm proposed by Calvete et al. (2016) when using a single working thread and our calculation runtimes decrease as the number of worker threads increases, for all the tested instances.
Since Table 1 shows a comparison of the running times of the proposed algorithm with those obtained by Calvete et al. (2016), a comparison of the effectiveness of the programming languages in which the two algorithms were implemented and a comparison of the processing power of the CPUs used in the experiments, are required. The algorithm proposed by Calvete et al. (2016) has been run on an Intel Pentium D CPU at 3.0 GHz. For the results presented in Table 1, we used an Intel Core i5-4590 processor at 3.3 GHz. The single thread ratings of the two processors are shown in PassMark. Pentium D rating: 698, Core i5 rating: 2114. The processor used in our experiments is 3.03 times more powerful. Regarding the languages, the proposed algorithm is implemented in Java while the algorithm proposed by Calvete et al. (2016) is programmed in C++. A comparison of the two programming languages in terms of efficiency is shown in Hundt (2011). C++ has a time factor of 1, and 64 bit Java has a time factor of 5.8. The programming language used for implementing the PSDS algorithm is 5.8 times slower. We considered that the greater speed of the processor roughly compensates the slowness of the Java language. Because the ratings are always approximate, we did not use a scaling factor. The times shown in Table 1 were actually measured during the experiments.
The results corresponding to the set of larger instances are presented in Table 2, and they are compared to the results achieved by Cosma et al. (2019). The first two columns display the instance number (I #) and the instance features (p, q, r and w). The next five columns display the results of the Shrinking Domain Search (SDS) algorithm reported by Cosma et al. (2019): the best solution Z best , the best and the average running time for finding the best solution and the best and the average iteration in which the best solution appears. The next column displays the CPU used in the experiments. The last six columns display the results achieved by the PSDS algorithm: the best and the average running times for obtaining the best solution, the best and the average iteration at which the best solution appears, the average duration of an iteration avg T /it in seconds and the number of worker threads wt#. We reported the computational times in seconds.
Two different processors (Intel i5-4590 and Intel Xenon 4114) were used for the experiments shown in Table 2. Because the working frequencies of the 2 processors are different, for analysing the results we calculated a scaling factor based on the single thread results as follows: s = average(t Xe /t i5 ), where t i5 and t Xe are the average running times required for finding the best solution in the case of i5 and Xenon processors. Thus, based on the data in Table 2, s = 0.89. Analysing the data in Table 2, it turns out that in single thread mode, the PSDS algorithm is on average 67.58% less efficient than the SDS algorithm. This decrease in efficiency occurs because some of the CPU power used to initiate the parallel processing, and the Recursive evaluator cannot be as effective as the evaluation procedure in the SDS algorithm. When 4 worker threads are enabled, then in the case of the i5 processor, the average running time required for finding the optimal solution decreases by an average of 55.09%. When 20 worker threads and the Xenon processor are used, then the average running time required to find the optimal solution decreases by an average of 80.06%. The scaling factor s = 0.89 was used to calculate this gain. It should be noted here that although 20 worker threads have been activated, the Xenon processor has only 10 physical cores, so the efficiency of the PSDS algorithm cannot be increased by increasing the number of worker threads above 10.
In Tables 1 and 2 we may remark that, in the case of all the test instances our parallel heuristic algorithm obtained the same results in all ten runs of computational experiments. This confirms both the robustness and the quality of our developed innovative method. The computational execution time decreases as the number of worker threads increases for all instances tested. Because the algorithm has an important random component, the number of iterations required until the optimal solution is found differs at each of the runs. For this reason, the run times are not inversely proportional to the number of threads. To better  quantify the gain due to parallelism, the average time of an iteration was calculated for each run, after which an average was calculated for each test instance. Thus, it can be seen that for relatively small instances (P1,1-P4,1 and P2,2-P4,4), the gain is negligible because there is not enough data to be processed. For the other instances, the gains are significant. The average duration of an iteration roughly halves when doubling the number of worker threads. In terms of single thread performance, the Xenon processor is weaker than the i5 processor, because of the lower clock frequency. The Xenon processor has 10 cores and 20 logical processors. As expected, our algorithm could not obtain any significant gain in terms of efficiency when increasing the number of worker threads over the number of physical cores. Figures 13 and 14 show a comparison of the time evolution of the solutions found by the PSDS algorithm according to the number of used worker threads. The Intel Core i5-4590 CPU at 3.3 GHz was used in the case of Fig. 13 and the Intel Xenon Silver 4114 CPU at 2.2 GHz was used in the case of Fig. 14. Each graph represents the average of the best found solution as a function of the running time, when using a certain number of worker threads. At least ten runs of the second last instance from Table 2 were performed for each of the graphs. The graphs demonstrate the effectiveness of parallel processing. The time required to obtain the same result roughly halves when doubling the number of worker threads.

Conclusions
This study suggests an effective and fast constructive parallel heuristic algorithm whose purpose is to solve the two-stage transportation problem with fixed-charges for opening the distribution centres, which generates an essential design for the distribution system from manufacturers to customers via the DCs.
Our parallel solution approach is based on reducing of the solutions search domain to a subdomain with a reasonable size by considering a perturbation mechanism that permits us to reevaluate abandoned solutions that could conduct to optimal or sub-optimal solutions. Our approach is designed for parallel environments and takes advantage of the multi-core processor architectures.
The achieved computational results on two sets of instances from the existing literature: the first one consisting of 20 medium size benchmark instances and the second one consisting of 8 large size benchmark instances prove that our suggested innovative method is remarkably competitive, and surpasses in terms of execution time the other existing solution approaches meant for providing solutions to the TSTPFC for opening the DCs, allowing us to solve real-world applications in reasonable computational time.
Here are some significant characteristics of the method we suggest: it is designed for parallel environments and takes advantage of the new multi-core processor architectures; it is based on the reduction of the solution search domain to a subdomain with a reasonable size by considering a perturbation mechanism that permits us to reevaluate abandoned solutions that could conduct to optimal or sub-optimal solutions; it is extremely effective, offering outstanding solutions to all the instances tested and in all ten runs of the computing experiments, and it can be adapted easily to various supply chain network design problems, proving its flexibility. D. Dănciulescu received the BS degree in informatics from the University of Craiova in 1994, the BS degree in accounting and management informatics, in 2009, the MS degree in management, in 2011, the PhD degree in cybernetics and economic statistics from the Faculty of Economic Sciences, University of Craiova, in 2003, the second PhD degree in computer science from the West University of Timişoara, in 2015, and the habilitation degree in economic informatics from the A. I. Cuza University of Iaşi, in 2018. She is currently an associate professor with the University of Craiova. She has authored a monograph, eight course books for students and over 80 papers in peer-reviewed journals, out of which 18 ISI papers and papers in ISI or conference proceedings indexed in well-known international databases.