Data collection
The dataset used in this work was collected from the literature [16] and were reported as fifty percent growth inhibition (GI50) concentrations in (mmol L−1). These reported inhibitory activities were converted to logarithm scale to have a well-defined range with the help of Eq. (1) shown below.
$${\text{pGI}}_{50} = - \log_{10} ({\text{GI}}_{50} \times 10^{ - 3} )$$
(1)
Compounds sketching, optimization and descriptors calculations
The two-dimensional structure (2D) of the compounds were sketched using ChemDraw software version 12.0.2 [17], they were imported into Spartan 14 V.1.1.4 software to obtain the optimized three-dimensional spatial conformer (3D) at Density Functional Theory (DFT) level applying B3LYP 6-31G* basis set [18]. The optimized compounds in Spartan format were converted to SD file format and later imported into the PaDEL software to calculate the models’ descriptors.
Dataset normalization and pre-treatment
To give the descriptors equal chance of occurrence, the compounds were normalized using Eq. (2), [19]. The normalized data were pre-treated using the data pre-treatment software obtained from Drug Theoretic and Cheminformatics Laboratory (DTC Lab) to remove all empty columns and some useless descriptors [20].
$$X = \frac{{X_{i} - X_{\min } }}{{X_{\max } - X_{\min } }}$$
(2)
where Xi in the equation is the value of each descriptor for a given molecule and \(X_{\max }\) and \(X_{\min }\) are the maximum and minimum values for each column of descriptors X respectively.
Model generation and validation
In other to generate a good QSAR model, the pre-treated dataset was divided into training and test set in the ratio 7:3 by the means of data division software of DTC Lab [20]. The model was built using the training set, employing GFA-MLR method from the material studio. The test set was then used to validate the built model [21]. The suitability notch of the generated model was assessed using the lack of fit (LOF) [22], as in Eq. (3).
$${\text{LOF}} = \frac{{{\text{SEE}}}}{{\left( {1 - \frac{C + d*P}{M}} \right)^{2} }}$$
(3)
SEE being the Standard Error of Estimation, C is the number of terms in the model, d is a user-defined smoothing parameter, P is the total number of descriptors in the model and M is the number of training dataset. SEE can be expressed as:
$${\text{SEE}} = \sqrt {\frac{{\left( {Y_{\exp } - Y_{{{\text{pre}}}} } \right)^{2} }}{N - P - 1}}$$
(4)
where \(Y_{{{\text{exp}}}}\) and \(Y_{{{\text{pre}}}}\) are the experimental activity and the predicted activity in the training set respectively [22].
The squared correlation coefficient (R2) is a validation test used to match the predicted and experimental activities. The model would be considered robust with an R2 value close to 1. R2 is expressed as:
$$R^{2} = 1 - \left[ {\frac{{\sum (Y_{{{\text{exp}}}} - Y_{{{\text{pred}}}} )^{2} }}{{\sum (Y_{{{\text{exp}}}} - \overline{Y}_{{{\text{training}}}} )^{2} }}} \right]$$
(5)
where \(Y_{{{\text{exp}}}}\), \(Y_{{{\text{pred}}}}\) and \(\overline{ Y}_{{\text{training }}}\), were respectively the experimental activity, the predicted activity, and the mean experimental activity of the samples in the training set. The validity of the model cannot be based on R2 only, therefore an adjustment in the R2 would give a more reliable model. The adjusted R2 is givens by:
$$R_{{{\text{adj}}}}^{2} = \frac{{R^{2} - d(n - 1)}}{n - P + 1}$$
(6)
where d is the number of descriptors in the model and n is the number of training set compounds.The predictive power of the model is usually determined by the Cross-validation \((Q_{{{\text{cv}}}}^{2} )\) and the external validation test as expressed in Eqs. (7) and (8) respectively.
$$Q_{{{\text{cv}}}}^{2} = 1 - \left[ {\frac{{\sum (Y_{{{\text{exp}}}} - Y_{{{\text{pred}}}} )^{2} }}{{\sum (Y_{{{\text{exp}}}} - \overline{Y}_{{{\text{training}}}} )^{2} }}} \right]$$
(7)
$$R_{{{\text{test}}}}^{2} = 1 - \left[ {\frac{{\sum (Y_{{{\text{pred}}_{{{\text{test}}}} }} - Y_{{{\text{exp}}_{{{\text{test}}}} }} )^{2} }}{{\sum (Y_{{{\text{pred}}_{{{\text{test}}}} }} - \overline{Y}_{{{\text{training}}}} )^{2} }}} \right]$$
(8)
where \(Y_{{{\text{pred}}_{{{\text{test}}}} }}\) is the predicted activity, \(Y_{{{\text{exp}}_{{{\text{test}}}} }}\) is the experimental activity of the test set and \(\overline{Y}_{{{\text{training}}}}\) is the mean activity of the training set [21].
Y-randomization
Y-randomization is an external validation test performed to generate a new model from the bogus dataset so as to improve the models’ efficacy. For a good model, the randomized squared correlation coefficient (\({\text{cR}}_{{\text{p}}}^{2}\)) must be greater than 0.5, and is expressed as:
$${\text{cR}}_{{\text{p}}}^{2} = R[R^{2} - (R_{{\text{r}}} )^{2} ]^{2}$$
(9)
where \({\text{cR}}_{{\text{p}}}^{2}\) is the Y-randomization coefficient and \(R_{{\text{r}}}\) is the average ‘R’ of random models [19].
Applicability domain (AD)
Applicability domain is a theoretical region of the chemical space that is defined by the model descriptors, model response and nature of the training set. The leverage approach was employed to measure the data within the AD [23], any dataset that lies outside the AD would be treated as an outlier. Equation (10) is normally used to calculate the AD.
$$l_{i} = X_{i} (X^{{\text{T}}} X)^{ - 1} X_{i}^{{\text{T}}}$$
(10)
where \(l_{i}\) is the leverage of each compound, \(X_{i}\) is the descriptor row-vector of the query compound i, and X is the (m × n) descriptor matrix of the training set compounds used in building the model. The critical value (l*) is defined by Eq. (11).
$$l^{*} = 3\frac{p + 1}{n}$$
(11)
where p is the number of descriptors in the model and n is the number of objects used to develop the model.
Mean effect (ME) and variance inflation factor (VIF)
The mean effect is used to elucidate the comparative importance of each descriptor in the model while the VIF is used to determine the linearity between the descriptors in the model. VIF value of 1 show no linearity among the descriptors and value above 10 indicates a bad model. The ME and VIF are respectively calculated using Eqs. (12) and (13).
$${\text{ME}} = \frac{{B_{j} \mathop \sum \nolimits_{i}^{n} D_{j} }}{{\mathop \sum \nolimits_{j}^{m} \left( {B_{j} \mathop \sum \nolimits_{i}^{n} D_{j} } \right)}}$$
(12)
where \(B_{j}\) is the coefficient of the descriptor j in the model, \(D_{j}\) is the value of each descriptor in the data matrix for each of the training set data, m and n are respectively the number of descriptors that appears in the model and the number of molecules in the training set
$${\text{VIF}} = \frac{1}{{1 - R^{2} }}$$
(13)
where R2 is the multiple regression correlation coefficient between the variables in the model [24].
Molecular design
An In-silico approach of template-based design was employed to design new compounds with enhance activity against breast cancer. This method has been hired frequently to screen and modelled compounds with better-quality activity by relating the experimental activities of the compounds with their structures [25]. Henceforth, compound with the highest activity would be defined as the template to design new compounds with enhanced activities.