Chemometrics and machine learning

We develop and apply chemometric and machine learning methods to face real problems in chemistry, toxicology, pharmacology and environmental sciences. Specific research interests on multivariate modelling are neural networks, variable selection, data fusion, ranking methods, supervised classification, correlation and information measures, multicriteria decision making.
In this framework, new methods for supervised pattern recognition and classification (CAIMAN, N3, BNN), neural netwroks (K-CM), variable reduction and selection (W-VSP, Reshaped Sequential Replacement), unsupervised data analysis (MADS, MOLMAP approach for 3D analytical data), similarity and distance measures (new similarity coefficients for binary data, Locally-centred Mahalanobis distance, higher-order similarity measures), correlation measures (Canonical Measure of Correlation) and multicriteria decision making (Weighted Power-Weakness Ratio) have been proposed in the scientific literature.
On the other side, application of machine learning and chemometrics vary from analytical profils and signals (mainly related to envoironmental and food matrices) to molecular modelling (QSAR and virtual screening).
Visit this list for an overview of scientific publications related to new theoretical proposals and applications in chemometrics and machine learning.

Molecular Descriptors

Molecular descriptors capture diverse parts of the structural information of molecules and they are the support of many contemporary computer-assisted toxicological and chemical applications. . Since the beginning, Milano Chemometrics has studied and developed new theoretically-based molecular descriptors, such as WHIM (Weighted Holistic Invariant Molecular descriptors), G-WHIM (Grid-Weighted Holistic Invariant Molecular descriptors), GETAWAY (GEometry, Topology and Atoms-Weighted AssemblY) descriptors and evaluated their ability in modelling different physico-chemical, biological and environmental responses. Originally, the DRAGON software was developed to calculate molecular descriptors.
Moreover, the second edition of the Handbook of Molecular Descriptors (Molecular Descriptors for Chemoinformatics by Roberto Todeschini and Viviana Consonni) has been published by Wiley-VCH. It is an encyclopedic collection of the molecular descriptors from the beginning. About 3300 definitions, presented in alphabetic order, allow not only a rapid consulting, but also an organized learning of algorithms, meanings and tables of the molecular descriptors, QSAR strategies, and other related topics.
In this framework, the MOLE db – Molecular Descriptors Data Base has been released. This is a free on-line database constituted of 1124 molecular descriptors calculated on 234773 molecules of the NCI database.

QSAR, QSPR and chemical modelling

QSAR models are currently regarded as a scientifically reliable tool for predicting and classifying properties of untested chemicals. QSARs are based on the assumption that the structure of a molecule (for example, its geometric, steric and electronic properties) must contain the features responsible for its physical, chemical, and biological properties and on the ability to capture these features into one or more numerical descriptors. Milano Chemometrics has been involved in several projects related to the proposal and use use of QSAR for the REACH registration of chemicals, such as the study of the relationships between molecular structures of dyes and their toxicological properties and the use of in-silico models to develop a new, safe, multifunctional accelerator curative molecule which can replace thiourea-based accelerators in the vulcanisation process.
Milano Chemometrics has been involved in the development of new QSAR models adressed to the prediction of several properties (such as bioaccumulation, biodegradation and acute toxicity), which  have been proposed in literature. Reserach is also devoted to evaluate new and existing strategies to define the Applicability Domain of QSAR models, that is, the chemical domain where QSAR predictions can be assumed to be reliable. Finally, consensus modelling and data fusion of QSAR predictions are one of the considered research topic.


We develop and distribute (for free) softwares and toolboxes to calculate multivariate models (such as the PCA toolbox, the classification toolbox, the Kohonen and CPANN toolbox), to assess the Applicability Domain of QSAR models, for virtual screening, as well as KNIME workflows. Moreover, banchmark QSAR datasets are available for download. See the download page for further details.