Optimizing electrostatic similarity for virtual screening: A new methodology

Ligand Based Virtual Screening (LBVS) methods are widely used in drug discovery as ﬁlters for subsequent in-vitro and in-vivo characterization. This means, increasing accuracy of LBVS approaches may have a huge impact on increasing chances of success. Since the databases processed in drug discovery campaigns are enormously large, this pre-selection process requires the use of fast and precise methodologies. The similarity between compounds can be measured using diﬀerent descriptors such as shape, pharmacophore or electrostatic similarity. The latter is the goal of this work, i.e., we want to improve the process of obtaining the compounds most similar to a query in terms of electrostatic similarity. To do so, the current and widely proposed methodology in the literature is based on the use of ROCS to assess the similarity of compounds in terms of shape and then evaluate a small subset of them with ZAP for priori-tization regarding electrostatic similarity. This paper proposes an alternative methodology that consists of directly optimizing electrostatic similarity and works with the entire database of compounds without using shape cut-oﬀs. For this purpose, a new and improved version of the OptiPharm software has been developed. OptiPharm implements a parameterizable metaheuristic algorithm able to solve any optimization problems directly related to the involved molecular conformations. We show that our new method completely outperforms the classical proposal widely used in the literature. Accordingly, we are able to conclude that many of the compounds proposed with our novel approach could not be discovered with the classical one. As a result, this methodology opens up new horizons in Drug Discovery.


Introduction
The constant increase in the size of the databases used in Drug Discovery requires efficient techniques and methods that can be used to select the compounds most similar to a query molecule and at the lowest possible cost. One of these techniques is Virtual Screening (VS). VS is an in-silico technique that allows large libraries with millions of compounds to be processed in order to find new compounds related to a pharmacological query based on one or more features. [1][2][3][4] This represents a great advantage over experimental methods such as High-throughput screening (HTS) in terms of efficiency, budget, time and development cost. 5 The resulting compounds from VS are subsequently acquired and empirically tested in the laboratory. In addition, VS techniques are often used as a pre-filter for HTS. 6 All these advantages have increased the popularity of these techniques, which have experienced great advances over the last two decades. The interested reader is referred to previous works 7-11 for a description of different methods and tools currently used on VS.
However, there is still room for improvement regarding the accuracy of VS predictions so as not to discard promising compounds, or to reduce the time and error of calculations that compute the different features of the studied compounds. 12 VS applied to the electrostatic similarity of compounds is a clear example of this. Contrary to what happens when VS is applied to select the most similar compounds in shape or pharmacophore properties, where the tools base their predictions on scoring functions that measure these particular features, 7,13,14 the predictions in this field are not exclusively based on this descriptor, but on both the similarity of the three dimensional shape and electrostatic similarity. [15][16][17][18][19][20][21][22][23][24][25][26][27] Broadly speaking, all the previous works follow the same methodology, although they may differ in the selection procedure used to determine the compounds proposed as best predictions. Essentially, they initially optimize the compounds in the database against the query in terms of shape by using ROCS, 28 they prioritize the top N compounds with the highest shape similarity values and then evaluate them in terms of electrostatic similarity. It is noteworthy that in many different studies, the compounds with the highest values of shape are proposed as the most promising ones, while in just a few , only compounds where the combination of shape and potential is greater than a certain threshold are selected for further experimental validation. In this work, we will focus on comparison with the former approach, since it is the most widely used. The value of N is not fixed, as it depends on the particular study. Usually, N is less than 10% of the total compounds in the database. 18,23,24 A search for the best compounds basing on shape pre-filtering may be counterproductive, since the selection of a low value of N can rule many promising compounds out, which may have a significant impact on the final results.
In this work, we propose a new approach that involves the direct optimization of the electrostatic similarity, which can provide a more realistic description of compound bioactivity. This has been possible thanks to a novel software, called OptiPharm, able to optimize the similarity between two given molecules for any scoring function provided. Not only that, but OptiPharm also possesses other advantages, since it can be parameterized according to different aims such as carrying out an exhaustive search for a compound or, alternatively, obtaining a re-sult of good quality but in a short time. It means OptiPharm can be adapted to different sorts of problems which is an advantage regarding the virtual screening problem, where compounds have different sizes and complexities. The interested reader is referred to the original work 13 for an in-depth description of this method. Additionally, they have software available with the web tool BRUSELAS 29 (http://bio-hpc.eu/software/Bruselas). In this work, OptiPharm has maximized electrostatic similarity by including an objective function measuring it as a unique modification. In the near future new objectives such as the pharmacophoric scoring function or the quantum mechanical based field scoring function will be also included. As results show, new algorithms such as OptiPharm are required since the solutions found are better and hence total time and budget are reduced in the discovery process. Furthermore, most of the research has always followed the same methodology, so the new results obtained are practically unexplored.
The rest of the paper is organized as follows. Section Methods describes the two methods used for virtual screening based on electrostatic similarity, as the literature approach as the novel proposal. Additionally, a brief description about the mathematical formulation of the scoring functions is included. Section Results describes the framework where the experiments have been carried out and the main results obtained. Finally, the conclusions and lines for future research are summarized in the last section.

Methods
In this section, we describe the two optimization methodologies for Ligand Based Virtual Screening (LBVS) based on electrostatic similarity. The former is currently the method most frequently used in the literature. In short, it computes a sublist of molecules with the highest three-dimensional shape similarity. Usually, such a sublist is only composed of less than 10% of the total number of compounds in the database. From the reduced list, the compound(s) with the greatest electrostatic similarity is/are selected. The second one involves the solving of an optimization problem guided by a function measuring the electrostatic similarity. Subsections Literature approach: The LBVS method guided by molecular shape (LBVS-S) and The new approach: A LBVS method guided by electrostatic similarity (LBVS-E) briefly describe both techniques. For the sake of completeness, it is included here a first subsection devoted to defining the mathematical functions used to guide the searching processes. The figures in which the values of these objective functions are graphically represented have been created with VIDA v4.4.0 30 using the default configuration.
Scoring functions to measure similarity between compounds.

Shape Similarity
The shape similarity of two compounds is calculated as follows: where p i and p j are set to 2.7, α i and α j obtain the van der Waals value for each atom and where R ij is the distance between atoms i and j. Notice that the accuracy obtained from Equation 1 depends on the number of atoms in the two compared molecules, i.e., the higher this number, the longer the value of V AB as an absolute value. To be able to measure the level of similarity between compounds, regardless of the number of atoms that they are composed of and the descriptor used, the Tanimoto Similarity 31 value is computed as follows: where V AB is the A molecule overlaid onto B molecule. V AA and V BB is the overlap of the molecules A and B, respectively. Equation 3 has a value in the range [0, 1], where 0 means there is no overlapping and 1 means the shape of both molecules is the same.

Electrostatic similarity
The electrostatic similarities are obtained by numerical solution of the Poisson equation, 32 viz: where ∇(r) is the electrostatic potential, (r) is the dielectric constant, and ρ mol (r) is the molecular charge distribution. Electrostatic similarity between two compounds is compared by determining E AB : where Θ is a masking function to ensure potentials inside the compound are not considered part of the comparison. The integral appearing in Equation 5 is a volume integral, computed using a grid-spacing parameter, h. Again the accuracy obtained by Equation 5 depends on the number of atoms in the compared molecules. As such, similarly to what was done previously, the Tanimoto Similarity 31 value has been computed as follows: where E AB is the A molecule overlaid onto B molecule. E AA and E BB is the overlap of the molecules A and B, respectively. In this case, Equation 6 has a value in the range [−0.33, 1], where −0.33 means the charges of both compounds have the same value but opposite loads, 0 means there is no overlapping, and 1 means the charges of both molecules are the same.
Literature approach: The LBVS method guided by molecular shape (LBVS-S) This method bases its predictions on a previous pre-filtering process consisting of identifying the N candidate compounds from the database with the highest shape similarity. After that, for each selected compound, the electrostatic similarity is calculated at the optimum superimposition obtained in the previous stage. Finally, the molecule with the highest electrostatic similarity value is selected as the one for the solution.
In this work, we have used the tool ROCS 28 to optimize the shape similarity between two molecules. ROCS is a parametrized piece of software used to maximize volume overlapping similarity and utilizes the previously described Equation 3 to represent molecules by means of Gaussian functions. 33,34 Electrostatic similarity has been calculated using the ZAP Toolkit (see Equation 6). This software has been downloaded without modification from. 35 It is worth mentioning that ROCS and ZAP are, by far, the most widely used tools in the literature for VS based on shape and electrostatic similarity. [36][37][38][39][40] For this reason they have been selected as part of this study; i.e. a fair and complete study must be carried out by making a comparison with the state-of-the-art methods.

The new approach: A LBVS method guided by electrostatic similarity (LBVS-E)
Our main aim when using this approach is to obtain the compound(s) with the highest electrostatic similarity values. Thus, an optimization problem must be defined with this aim in mind. Broadly speaking, any tool, method or algorithm used will be better guided towards the optima if the objective function is a numerical model representing the real objective. Until now, the methods proposed were based on the idea that if two compounds have the same shape, they can probably have the same electrostatic properties. Consequently, they solve a shape similarity optimization problem, instead of focusing on the electrostatic similarity, which may be more useful from the drug discovery point of view.The new approach presented here is based on the idea that the scoring function used to guide the optimization method must be the one based on electrostatic similarity, and not any other.
If not, the search may converge to a sub-optimal solution. [41][42][43] To prove our hypothesis, OptiPharm, 13 a recently proposed tool which implements an evolutionary optimization algorithm specifically designed to work on LBVS problems, is used. It is not limited to a particular objective function; on the contrary, OptiPharm admits any scoring function. In this work, to be sure that we are comparing in the same terms and equal conditions, the electrostatic similarity between two compounds has been computed by using the source code of the ZAP Toolkit, also downloaded from https://docs.eyesopen.com/toolkits/python/zaptk/thewayofzap.html. 35 We have only included a slight modification related to the function that reads compounds. Specifically, the method open belonging to oemolstream ifs has been replaced by openstring. The idea behind this change is to reduce the computing time, i.e. open involves writing to and reading from a local file while openstring allows reading from a c++ string directly.

Benchmarks
In this work, a database provided by The Food and Drug Administration has been used (FDA). The Food and Drug Administration is a federal agency of the United States Department of Health and Human Services responsible for protecting and promoting public health by controlling, among other things, prescription and overthe-counter pharmaceutical drugs (medications). This agency provides a data set containing 1751 compounds, which represents approved medicines that can be safely used on humans in the USA. This database is useful since in the high similarity cases it would directly contribute to drug re-purposing. This is of relevant utility given the clear trend regarding re-purposing drugs observed over the last 5 years. [44][45][46] The version of the database used in this work was obtained from DrugBank v5.0.1 47 and necessary mol2 files for the VS calculations were set up by using Am-berTools 48 by removing salts and neutralizing their protonation state, computing partial charges by MMFF94 force field, adding hydrogen atoms and minimizing energies (default parameters). 49 A comprehensive computational analysis may cover a representative sample of the database. The compounds included in the FDA database have different attributes, one of the most relevant for the study at hand being the number of atoms. In this work, a selection of 50 compounds has been made in the following way: the compounds in the database have been sorted by the number of atoms, including hydrogen, and then divided into 24 intervals (see Figure 1). From each sector, at least one compound was chosen at random and proportional to the number of compounds in the sector.

Results
Influence of the size list of top-ranked compounds in the LBVS-S method.
As previously mentioned, the LBVS method guided by molecular shape, bases its predictions on pre-filtering of the N best compounds in terms of superimposition score. In this subsection, a study has been conducted to know how the value of N affects the final results from the point of view of electrostatic similarity. In particular, the LBVS-S has been performed on the selected 50 queries and for five different values of N , i.e. N has been set to 175, 438, 876, 1313 and 1751. Figure 2 illustrates the main steps of the LBVS-S method for the Query DB01213 and N = 1751, i.e. the total number of compounds in the FDA set. Initially, the Query is compared to each compound T arget S from the database to obtain their optimum position and corresponding shape similarity value T c S . As previously mentioned, this stage is carried out by using ROCS. Afterwards, compounds are sorted (Rk S ) in decreasing order by T c S . The N best compounds are selected and evaluated to measure the corresponding electrostatic similarity value T c Eval E . Notice that the evaluation of the electrostatic similarity considers the pose obtained with the shape similarity optimization. The compound with the highest T c Eval E , called BestComp throughout this paper, is selected as the best prediction. Finally, as an additional and unconsidered stage in the LBVS-S method, we have computed the optimized superposition between the BestComp and the Query by using OptiPharm. The corresponding T c E value is then provided.
To get an overview of the results, average values of the BestComp found for the 50 queries and each value for N have been computed, and shown in Table 1. In particular, the average position Av(Rk S ) in the sorted list where the BestComp were located have been computed, together with the following: their mean number of atoms Av(N S ), their average shape similarity value Av(T c S ), their corresponding electrostatic similarity value Av(T c Eval E ) when they are evaluated and finally their mean electrostatic similarity when they are optimized Av(T c E ).
As can be seen, the predictions seem to improve in term of electrostatic similarity as the number N of selected molecules in the sorted list increases (see columns Av(T c Eval E ) and Av(T c E )). In accordance with these results, the posterior comparison between LBVS-S and LBVS-E methods has been carried out by setting N = 1751. ) + +,-..

)
Optimizing shape similarity of the candidate compounds.
Selecting the N best candidates and evaluating Electrostatic similarity.
) + Selecting the compound with the highest electrostatic similarity.
) ) ) ) +  Figure 2: Performance of the LBVS-S method for a particular case where Query = DB01213 and N = 1751 using the FDA database. For each value of N , the following average values from the 50 queries, are shown: position in the shape ranking (Av(Rk S )), number of atoms (Av(N S )), shape similarity score (Av(T c S )) , electrostatic similarity evaluation score (Av(T c Eval E )) and electrostatic optimized similarity value (T c E ).

Performance comparison between LBVS-S and LBVS-E methods.
To analyze the performance of both methods, we have conducted a study in which the selected 50 molecular queries are processed with reference to the FDA database.

DB03255
Selecting the compound with the highest electrostatic similarity.

Figure 3: Performance of the LBVS-E method for a particular case where
Query=DB01213 is compared to the FDA database.
Notice that comparing a query with itself always reaches the maximum similarity value, both for electrostatic potential as well as for shape. Subsequently, these results were removed when ranking the compounds. In other words, the compounds given as a result are not the most similar ones, but the second compounds in the ranked list. Additionally, as previously mentioned, the traditional method has been carried out considering the total number of compounds in the database N = 1751, so as to increase the probability of finding better predictions.
To illustrate how we generate the later summarizing tables, a sample of the results obtained by both methods when comparing a query to the molecules in the dataset is studied. In particular, the instance Query = DB01213 is analyzed. Notice that this is the example used to illustrate the stages of the LBVS-S method in Figure  2. After that, the same instance is considered to exemplify the performance of the LBVS-E method (see Figure 3). Notice that this Query has been selected because it is small and it helps to see the main ideas of the paper very easily by using figures. However, the conclusions inferred from the associated results can be extrapolated to any other Query. As can be observed, the LBVS-E technique solves an optimization problem to determine the electrostatic similarity, T c E , between the pharmaceutical Query and every T arget E in the database. Afterwards, the list of compounds is sorted by the T c E value and the one located in first position, Rk E = 1, is selected as the best prediction. Finally, to complete the study, optimization is carried out to calculate the shape similarity T c S between the chosen compound and the Query. Once the specific case of DB01213 has been explained in detail, the results of the 50 queries have been summarized in Table 3. Columns Rk Eval E and Rk E have been removed in this table because their values are always 1. The last row summarizes the average of the results.
As evidenced, LBVS-E obtains on average T c E = 0.738, which is higher than that given by LBVS-S, T c Eval E = 0.497. Similar conclusions can be inferred when comparing the T c E average values for both methods. Additionally, when the results are analyzed individually, we can see that LBVS-E provides solutions with higher T c E values than those achieved by LBVS-S. In fact, in 48 out of 50 cases, LBVS-E obtains a different compound than that reached by LBVS-S.
Regarding shape similarity, it is possible to infer that, on average, the methods are equivalent in terms of accuracy of the predictions, i.e. LBVS-S obtains an average value of T c s = 0.554 while LBVS-E reaches a mean value of T c s = 0.505. Furthermore, analyzing the obtained results individually, we can see that in 2 out of 50 cases, LBVS-E offers better or equivalent predictions than that achieved by Table 2: Summary of the results obtained for both LBVS-S and LBVS-E methods for the query compound DB01213. The column notation, the colors included and the corresponding results come from Figures 2 and 3, i.e. they maintain the same meaning as shown previously for those pictures. The last row indicates the results associated with the top solution selected for each method. LBVS-S in terms of shape (see columns T c s in LBVS-S and T C Eval s in LBVS-E). It means that cases exist where two compounds can be very similar in terms of electrostatic potential, although they can be very different in terms of three-dimensional shape. It means that those solutions could not be obtained by using the methodology followed by the traditional LBVS-S method, since it only focuses on the compounds with the highest similarity in shape.
Making a somewhat more detailed approach for compounds smaller than 50 atoms, which means the first 23 query compounds in the table, there are 5 cases where the difference is less than 0.05 (DB00529, DB00173, DB00331, DB00915 and DB01352) and in another 3 cases the difference is 0.1 (DB01043, DB07615 and DB01268). Considering the values of these 7 cases in which the shape LBVS-E is smaller than that of LBVS-S, the average difference is 0.048, while the mean gain in electrostatic similarity for those 7 compounds is 0.271. In large compounds, which includes 27 queries, there are only two cases with similar characteristics, which are compounds DB09236 with a difference of 0.07 and DB06699 with a difference of 0.013, both of them for shape similarity. In view of these results, the LBVS-E method seems to be justified when proposing new solutions for small compounds.
However, not all the improvements are related to electrostatic fields. The optimization of electrostatic potential using OptiPharm might allow a better solution to be found in terms of shape too. Compounds DB01119 and DB1213 in Table 3 are some outstanding examples. For example, in the case of Query = DB01119, the best compound found by LBVS-S is DB00828 with T c S = 0.655 and T c Eval E = 0.519. Moreover, LBVS-E's best compound is DB00173. It has a better T c E , i.e. 0.829, but also the position of those compounds after the electrostatic optimization is improved, T c Eval S = 0.779. Table 3: Rows are sorted by the number of atoms of queries. For each query, the same procedure explained in Table 2 is followed. The last row summarizes the average values for each column.

ZAP Toolkit accuracy problem
The ZAP Toolkit has been widely used in the literature to calculate the electrostatic similarity score for two compounds. 2,[15][16][17][18][19][20][21][22][23][24][25][26][27]50 In this subsection we would like to remark that the ZAP Toolkit can return an erroneous value, which was discovered when using Optipharm. During the optimization procedure, Optipharm can progressively separate two input compounds aimed to escape from local optima and explore the searching space in depth. In fact, it is possible to analyze cases where no overlap exists between the input molecules. During the analysis of the results, we discovered that cases exist where the ZAP Toolkit can overflow, mainly when situations such as the previously mentioned happen. See Figure 5 to see a particular example. Herein, compound DB01365 remains fixed on the left while compound DB00459 occupies three positions (red, blue and pink). The red compound obtains an electrostatic similarity value of 1. The light blue compound is displaced half a unit to the left, i.e., closer to the reference compound and its similarity value is 0.38. The pink compound is shifted 0.5 units to the right, that is, away from the reference compound. Its similarity value is 0. Calculations can be made using the Zap python script available at https://docs.eyesopen.com/toolkits/python/zaptk/thewayofzap.html in the Electrostatic Similarity section.
This problem has been solved in OptiPharm by considering the poses with the previously mentioned problem unfeasible. It means that they are no longer considered during the optimization process.

Conclusions
In this work, a new approach to solve the LBVS problem based on the electrostatic similarity has been put forward. More specifically, the software OptiPharm, recently proposed in the literature, has been used as alternative to the current state-of-theart method. OptiPharm implements an evolutionary optimization algorithm able to include as objective, any scoring function. As such, it can directly solve the electro- static similarity optimization problem by only including as an objective a function measuring such a descriptor. The method resulting from joining OptiPharm with the electrostatic similarity objective has been called LVBS-E. Conversely, the method proposed in the literature, which has been named LBVS-S throughout the paper, looks for a sublist of the top compounds with the highest shape similarity by using ROCS, to later evaluate their electrostatic similarity with ZAP. In this work, a study to analyze the influence of the number of compounds in such a sublist has been carried out. As the results have shown, the larger the number of molecules considered, the better the prediction obtained in terms of electrostatic similarity. From this conclusion, a computational study has been carried out to compare the new method LBVS-E with the one in the literature LBVS-S. To increase the probability of finding good predictions, LBVS-S has been executed taking into account the whole database prior to the electrostatic similarity evaluation. Even so, LBVS-E performs better than LBVS-S, achieving better predictions in electrostatic potential for the 50 queries included in the study. Regarding the shape similarity, both methods behave in a similar fashion, on average obtaining compounds with similar shape similarity values. It is important to mention that the new methodology proposed in this paper is novel, which means that the predictions proposed have not been analyzed previously.
Finally, we have shown that ZAP can return erroneous values. This is an important discovery, since it is the most commonly used software in the literature to measure the electrostatic similarity.
In the future, we have plans to implement this objective function from scratch, but for the study at hand, we considered that it was more important to compare it with the state-of-the-art software. Please note that, this implementation will also decrease the execution time in OptiPharm, since from a computational point of view, integrating a function into a program is a very time consuming task. Additionally, other functions measuring the pharmacophore similarity will be implemented. Additionally, we will analyze the problem from a multi-objective perspective, where shape an electrostatic similarity are optimized simultaneously.