Computational evaluation of some compounds as potential anti-breast cancer agents

The emergence of high resistance and toxicity of the existing anti-breast cancer drugs have demanded the need to design new drugs with improved activities against breast cancer. A computational technique incorporating quantitative structure–activity relationship and virtual template-based design was carried out to evaluate thirty-four compounds from derivatives of thiophene, pyrimidine, coumarin, pyrazole and pyridine with anti-breast cancer activities. The chemical structures of the compounds were drawn with chem draw v.12.0.2 and they were optimized using Spartan 14 software. The molecular descriptors were calculated with the aid of PaDel descriptor software. The dataset was curated and then divided into training and test set that was used to generate and validate the model. The first out of the four models generated was chosen as the paramount model with statistical validations of R2 = 0.9847, Radj2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_{{{\text{adj}}}}^{2}$$\end{document} = 0.9814, Qcv2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Q_{{{\text{cv}}}}^{2}$$\end{document} = 0.9763, min expt. error for non-significant LOF (95%) = 0.0679, an external validation Rtest2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R_{{{\text{test}}}}^{2}$$\end{document} of 0.8240 and coefficient of Y-randomization (cRp2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{cR}}_{{\text{p}}}^{2}$$\end{document}) = 0.8200, which confirm the robustness of the model. The high predictive power of the generated model describes the models’ reliability and the designed compounds pointed out compound 2 with pGI50 = 4.2504 as the best designed compound to inhibit breast cancer, compared to its co-designed compounds and the template. The results of this research provide vital information to the pharmaceutical chemists and the pharmacologist in the course of developing new breast cancer drugs.


Background
Cancer is a word used to describe the unusual growth of the cells leading to one of the most dangerous health problems for humans all over the world [1]. Despite the availability of improved drugs targeting cancer therapies, the worldwide cancer burden is expected to increase to 19.3 million new cancer cases, and nearly 10 million cancer deaths were observed in the year 2020 [2].
Breast cancer is the most common cancer among women all over the world and impermanence from breast cancer is commonly due to tumour metastasis [3]. It constitutes a major public health issue globally, with over 1 million new cases diagnosed annually; resulting in over 400,000 annual deaths and about 4.4 million women living with the disease [4]. The mortality rate of breast cancer among Nigerian women is about 16% [5].
Amino-thiophene derivatives were known to be one of the most important groups of heterocyclic compounds with a wide spectrum of biological activities such as antitumor [6] anti-mitotic [7] and antiviral [8]. Furthermore, thieno [2, 3-d] pyrimidine derivatives show anti-proliferative activity [9] while pyrazole derivatives have a specific effect with favourable antitumor activity [10]. Coumarin scaffold turn out to be an attractive subject due to their broad spectrum of pharmacological activities, its derivative is extensively explored for anticancer activities as it possesses minimum side effect along with multi-drug Page 2 of 15 Idris et al. Futur J Pharm Sci (2021) 7:167 reversal activity [11]. Most pyridine derivatives had been synthesized as potentially biologically active compounds and had a multitude of pharmacological characteristics, in particular, anti-cancer activity [12][13][14]. Quantitative Structure Activity Relationship (QSAR) is one of the commonly used computational method for predicting the activities/properties of molecules in drug design as it saves time and lesser cost [15]. Generating a good QSAR model depends on factors such as: the quality of biological data, the choice of descriptors, variable selection, statistical methods and validations.
The aim of this research is to develop a good QSAR model for predicting the activity of some selected compounds against breast cancer and also design new compounds with better activities against breast cancer.

Data collection
The dataset used in this work was collected from the literature [16] and were reported as fifty percent growth inhibition (GI 50 ) concentrations in (mmol L −1 ). These reported inhibitory activities were converted to logarithm scale to have a well-defined range with the help of Eq. (1) shown below.

Compounds sketching, optimization and descriptors calculations
The two-dimensional structure (2D) of the compounds were sketched using ChemDraw software version 12.0.2 [17], they were imported into Spartan 14 V.1.1.4 software to obtain the optimized three-dimensional spatial conformer (3D) at Density Functional Theory (DFT) level applying B3LYP 6-31G * basis set [18]. The optimized compounds in Spartan format were converted to SD file format and later imported into the PaDEL software to calculate the models' descriptors.

Dataset normalization and pre-treatment
To give the descriptors equal chance of occurrence, the compounds were normalized using Eq. (2), [19]. The normalized data were pre-treated using the data pretreatment software obtained from Drug Theoretic and Cheminformatics Laboratory (DTC Lab) to remove all empty columns and some useless descriptors [20].
(1) pGI 50 = − log 10 (GI 50 × 10 −3 ) where X i in the equation is the value of each descriptor for a given molecule and X max and X min are the maximum and minimum values for each column of descriptors X respectively.

Model generation and validation
In other to generate a good QSAR model, the pretreated dataset was divided into training and test set in the ratio 7:3 by the means of data division software of DTC Lab [20]. The model was built using the training set, employing GFA-MLR method from the material studio. The test set was then used to validate the built model [21]. The suitability notch of the generated model was assessed using the lack of fit (LOF) [22], as in Eq. (3).
SEE being the Standard Error of Estimation, C is the number of terms in the model, d is a user-defined smoothing parameter, P is the total number of descriptors in the model and M is the number of training dataset. SEE can be expressed as: where Y exp and Y pre are the experimental activity and the predicted activity in the training set respectively [22].
The squared correlation coefficient (R 2 ) is a validation test used to match the predicted and experimental activities. The model would be considered robust with an R 2 value close to 1. R 2 is expressed as: where Y exp , Y pred and Y training , were respectively the experimental activity, the predicted activity, and the mean experimental activity of the samples in the training set. The validity of the model cannot be based on R 2 only, therefore an adjustment in the R 2 would give a more reliable model. The adjusted R 2 is givens by: where d is the number of descriptors in the model and n is the number of training set compounds.The predictive power of the model is usually determined by the Cross-validation (Q 2 cv ) and the external validation test as expressed in Eqs. (7) and (8) respectively.
(7) where Y pred test is the predicted activity, Y exp test is the experimental activity of the test set and Y training is the mean activity of the training set [21].

Y-randomization
Y-randomization is an external validation test performed to generate a new model from the bogus dataset so as to improve the models' efficacy. For a good model, the randomized squared correlation coefficient ( cR 2 p ) must be greater than 0.5, and is expressed as:   where cR 2 p is the Y-randomization coefficient and R r is the average 'R' of random models [19].

Applicability domain (AD)
Applicability domain is a theoretical region of the chemical space that is defined by the model descriptors, model response and nature of the training set. The leverage approach was employed to measure the data within the AD [23], any dataset that lies outside the AD would be treated as an outlier. Equation (10) is normally used to calculate the AD. where l i is the leverage of each compound, X i is the descriptor row-vector of the query compound i, and X is the (m × n) descriptor matrix of the training set compounds used in building the model. The critical value (l * ) is defined by Eq. (11).
where p is the number of descriptors in the model and n is the number of objects used to develop the model.

Mean effect (ME) and variance inflation factor (VIF)
The mean effect is used to elucidate the comparative importance of each descriptor in the model while the VIF is used to determine the linearity between the descriptors in the model. VIF value of 1 show no linearity among the descriptors and value above 10 indicates a bad model. The ME and VIF are respectively calculated using Eqs. (12) and (13).
where B j is the coefficient of the descriptor j in the model, D j is the value of each descriptor in the data matrix for each of the training set data, m and n are respectively the (11) where R 2 is the multiple regression correlation coefficient between the variables in the model [24].

Molecular design
An In-silico approach of template-based design was employed to design new compounds with enhance activity against breast cancer. This method has been hired frequently to screen and modelled compounds with better-quality activity by relating the experimental activities of the compounds with their structures [25]. Henceforth, compound with the highest activity would be defined as the template to design new compounds with enhanced activities.

Results
All the tables and figures that describes the outcome of the built model and the designed compounds are presented in this section.

Discussion
All the thirty-four compounds used in this study were first sketched by ChemDraw to obtain the 2D structures, they were imported to the spartan 14 software to obtain their 3D optimised structures. The optimized dataset was normalized, pre-treated and the molecular descriptors were calculated with the help of PaDEL descriptor software. A large number of 1874 of molecular descriptors   that are responsible for encrypting the important features of the structures were calculated. The 2D structures and activities of the studied compounds were presented in Table 1. The Genetic Function Approximation (GFA), was used to generate four models, the first model out of the four models was selected as the optimum model since it best agrees with the minimum criteria for generating good QSAR model, reported in Table 2. Table 3 display the validation parameters for the generated models. Table 4 present the Y-randomization test used to affirm the strength of a model. This test was carried out on the training set by keeping the independent variable constant and randomizing the dependent variables. The low values of R, R 2 and Q 2 indicate the robustness of the generated model and the coefficient of Y-randomization ( cR 2 p = 0.8200) confirmed the generated model was not gotten by chance. Table 5 displays the correlation matrix, VIF and the ME of the four descriptors used to build the models. The low value of the Pearson's correlation indicates that there is no significant connection between the descriptors, this means that each descriptor gives different information that influenced the model. The relative importance of each of the descriptor in the model was measured with the low value of the Variance   Inflation Factor (VIF) and since the VIF value were all less than 2, henceforth, the descriptors in the model were rightfully selected and the model is therefore said to be statistically satisfactory [24]. Meanwhile descriptor MaxHBd with highest positive ME value indicates its prominence in the models' activity, as such, the descriptor was made the focal point when designing new enhanced compounds. The descriptor (MaxHBd), means Maximum E-States for (strong) Hydrogen Bond donors.
Descriptive analysis was carried out to back up the evidence that the dataset was well divided into a new set (training set and test set). Table 6 present the maximum, minimum and standard deviation values for both training and test sets were very close suggesting no significant difference in them, as a result, we deduce that the training set is extrapolative within the test set, this confirm the fit of the Kennard and stone method employed in the data division. Table 7 present the details of the descriptors used to build the model. The first two descriptors were 2D and the last two being 3D. The equations generated from the material studio software displayed below, indicates Eq. (1) as the best model when compared to the standard validation parameters for generating a good QASR model in Table 2. The difference between the predicted activity and the reported activity is the residual activity, which is presented in Table 8. The low residual values indicate that the predicted activities lie within the experimental activities, accounting for the high predicting power of the model. Figure 1 and 2 below shows the graphical plot of experimental activity against the predicted activity for both training and test set respectively, the R 2 value of the two plots are satisfactory when compared to the recommended R 2 value of a good QSAR model reported in Table 2. The plot of standardize residual versus experimental activity in Fig. 3, was used to check for any systematic error in the built model, it was found that the built model was free of systematic error since all it standardizes value lies within ± 2 unit. Figure 4 shows the Williams plot, the plot help to determine compounds that are either influential or outliers. Four compounds were found to be outliers because their leverage values were greater that the critical leverage (l * = 0.6) and those compounds shall not be considered while designing a new anti-breast cancer agent.
In other to design more potent anti-breast cancer compounds, compound 10 ( Fig. 5) with the highest reported activity (4.0458) was endorsed as the template. The most influential descriptor maxHBd (maximum E-state for Hydrogen bond donor), with mean effect of 0.8382 was investigated. To raise the hydrogen bond donor, H-bond acceptor and strong electronegative atoms (F, O and N) were attached to the appropriate positions, which lead to the design of six new compounds with enhanced 50% growth inhibitory activity as displayed in Table 9.

Conclusion
This research has effectively built a good QSAR model with high predictive power, using the descriptors max-HBd, GATS8c, TDB10p and RNCS. The Williams plot, outlined four compounds (outliers) that should not be considered for further computational study. The validation parameters used to generate the model as discussed above all passed the minimum recommendation for building a valid QSAR model. Descriptor maxHBd with positive mean effect value of 0.8382 was found to mostly influence the optimum model, and was chosen as the template that was then used to design six new compounds with better inhibitory activities. Three out of the six designed compounds were found to have pIC 50 value (4.2118, 4.1688 and 4.2504) greater than the template and the rest of the design compounds. Conclusively, the research aim was achieved and the results of this work would serve as first-hand information to the pharmaceutical chemist, pharmacist and pharmacologist in the course of producing new drug against breast cancer.