fr | en

Access to molecular databases

Data mining is an important step for preparing and/or analyzing screening data. ChemBioFrance offers a portal to in-house developed and maintained databases, thereby enabling simple queries regarding tour molecules/targets of interest.



 These databases are individually managed by cheminformatic platfoms not directly by ChemBioFrance. Some of them can be downloaded and some are to be requested from the author. Please find hereafter the information:





Structure, bioactive metabolites and PK/PD data on FDA-approved drugs



Kinase inhibitors in clinical development

Bioinfo DB

Commercially-available drug-like compounds


Sites de liaisons 'droguables' des protéines de la PDB


Structure of protein-protein interfaces and their inhibitors


Structure and activity of protein-protein interface modulators


Peptides from procaryotic genomes


Required Information






Douguet, D. (2018) Data Sets Representative of the Structures and Experimental Properties of FDA-Approved Drugs. ACS Med Chem Lett, 9: 204-209

Carles F, Bourg S, Meyer C, Bonnet P (2018) PKIDB: A Curated, Annotated and Updated Database of Protein Kinase Inhibitors in Clinical Trials. Molecules, 23, E908

Desaphy J, Bret G, Rognan D, Kellenberger E. (2015) sc-PDB: a 3D-database of ligandable binding sites--10 years on. Nucleic Acids Res., 43, D399-404

Basse MJ, Betzi S, Morelli X, Roche P. (2016) P2Idb v2: update of a structural database dedicated to orthosteric modulation of protein-protein interactions, Database (Oxford), 2016: baw007.

Labbé CM, Kuenemann MA, Zarzycka B, Vriend G, Nicolaes GA, Lagorce D, Miteva MA, Villoutreix BO, Sperandio O. (2016) iPPI-DB: an online database of modulators of protein-protein interactions., Nucleic Acids Res., 44, D542-547.

Pupin M, Esmaeel Q, Flissi A, Dufresne Y, Jacques P, Leclère V (2016) Norine: A powerful resource for novel nonribosomal peptide discovery., Synth Syst Biotechnol, 1:89-94

Rey J, Deschavanne P, Tuffery P. (2014) BactPepDB: a database of predicted peptides from a exhaustive survey of complete prokaryote genomes Database (Oxford). 2014:bau106.

Project request
Virtual screening of compound libraries

Virtual screening of compound libraries is a potent tool, enabling a low-cost selection of commercially available compounds (10-10000) for experimental validation. Many properties can be used as filters to retrieve potentially interesting hits: predicted binding mode/affinity to a target protein, similarity to a known active compound, peculiar physicochemical properties. ChemBioFrance has access to a catalogue of ca. 5 million commercial compounds immediately available for in silico/experimental screening.





Two type of in silico screening are possible according to current knowledge:

·        Protein structure-based screening: After preparation of the 3D structure (x-ray coordinates, homology model), the ChemBioFrance library is either docked to a binding site of interest or analyzed for its complementarity to a protein-ligand interaction pharmacophore. Hits are selected according to the user's needs: e.g., presence of key interactions with key residues, binding free energy, ligand efficiency). A list of commercially available hits (catalog id, supplier, price) is transmitted to the client for purchase

·        Ligand-based screening: The compound library is screened for 2D/3D similarity to one of several known actives using QSAR/QSPR models, SOM/GTM maps. A list of hits is given to the client pour purchase and experimental validation.


Required information

Name and/or structure of the target (PDB code, UniProt accession number)

Name and structures of known actives



Used software is either in-house developed or commercially available



Docking: Surflex-Dock, PLANTS, IChem

Pharmacophore search: LigandScout, Biovia, IChem

Similarity search: Pipeline Pilot, ROCS



Rivat, C. et al. Inhibition of neuronal FLT3 receptor tyrosine kinase alleviates peripheral neuropathic pain in mice. Nature Commun 2018, 9:1042

Da Silva, F. et al. IChem: A Versatile Toolkit for Detecting, Comparing, and Predicting Protein-Ligand Interactions. ChemMedChem 2018, 13:507-510

da Silva Figueiredo Celestino Gomes, P. et al. Ranking docking poses by graph matching of protein-ligand interactions: lessons learned from the D3R Grand Challenge 2. J Comput Aided Mol Des. 2018, 32:75-87.

Slynko, I. et al. Docking pose selection by interaction pattern graph similarity: application to the D3R grand challenge 2015.  J Comput Aided Mol Des. 2016, 30:669-683.

Ruggiu, F. et al. ISIDA Property-Labelled Fragment Descriptors   Mol Inf, 2010, 29, 855 - 868

Klimenko, K. et al. Chemical Space Mapping and Structure-Activity Analysis of the ChEMBL Antiviral Compound Set. J Chem Inf Model, 2016, 56, 1438-1454

Gaspar, H. et al. GTM-Based QSAR Models and Their Applicability Domains. Mol Inf, 2015, 34 (6-7), 348-356  

Lin, A. et al. Mapping of the Available Chemical Space versus the Chemical Universe of Lead-Like Compounds. ChemMedChem.,2017, 13(6), 540-554

Project request
Characterization and optimization of interfering peptides.

Biologics represent a promising alternative to small molecular weight compounds for the development of new drugs. Among them, peptides are a specific class of compounds involved in cellular signaling and trafficking, can present antibiotic activities or modulate protein-protein interactions. Recent progress have been noticed in the control of their bioavailability (resistance to enzymatic degradation), biodistribution (various administration routes, targeting intracellular proteins or nucleic acids, targeting of specific cell lines) and production cost. More than 60 peptides are currently under clinical development. ChemBioFrance offers the possibility to identify, characterize and optimize interfering peptides, i.e. peptides able to modulate a specific protein-protein interaction.


·        Identification of interfering peptides

o   PEP-scan : requires the purified protein to be targeted, an antibody and the sequence of the partner protein. Costs and timelines depends on the project

o   In silico: requires the 3D structure (or a model) of the interacting partners. Costs and timelines depends on the project

·        Characterization of the peptide target interaction.

This approach requires the structure of the target protein and the sequence of the interfering peptides. Timelines are ca. 1 month for a soluble target and to be defined on a case-by-case basis for a membrane receptor.

·        Optimization. Concerns peptides longer than 6 amino acids:

o   Peptide analogs by N-ter and C-ter deletion

o   Peptide mutants in order to stabilize the peptide-target interaction

o   Head-to-tail cyclisation.

o   Cellular internalization: coupling of the peptide to a cell-penetrating sequence. Proof of concept of internalization can be realized by fluorescence spectroscopy. Costs and timelines depends on the project


Required information

Sequence, structural and mechanistic knowledge of the protein to target.



In-house developed software and protocols.



Bruzzoni-Giovanelli et al.,Interfering peptides targeting protein-protein interactions: the next generation of drugs?, Drug Discov. Today, 2018, 23:272.

Quignot et al.,InterEvDock2: an expanded server for protein docking using evolutionary and biological information from homology models and multimeric inputs. C, Rey J, Yu J, Tufféry P, Guerois R, Andreani J. Nucleic Acids Res., 2018, 46:W408.

de Vries et al., The pepATTRACT web server for blind, large-scale peptide-protein docking., Nucleic Acids Res., 2017, 45:W361.

Lamiable et al., PEP-FOLD3: faster de novo structure prediction for linear peptides in solution and in complex., Nucleic Acids Res., 2016, 44:W449.

Saladin et al., PEP-SiteFinder: a tool for the blind identification of peptide binding sites on protein surfaces.,  Nucleic Acids Res., 2014, 42:W221.

Thévenet et al., PEP-FOLD: an updated de novo structure prediction server for both linear and disulfide bonded cyclic peptides., Nucleic Acids Res., 2012, 40:W288.

Project request
Design and Analysis of Focused Libraries (eg : protein-Protein Interaction Inhibitors)

 Despite the increasing number of protein-protein interaction (PPI) inhibitors, the success rate of experimental PPI screening remains low, notably because of the inadequacy of screening collections. ChemBioFrance has undertaken a major effort in designing compound libraries specifically focused on protein-protein interfaces.




Molecular properties of PPI inhibitors stored in P2IDB et iPPIDB databases are used to guide the design of tailored PPI inhibitor libraries thanks to machine learning models [1-8].

In-house developed machine learning methods 2P2IHUNTER and PPI-HitProfiler have been applied to compound libraries from two suppliers (Molport and Ambinter) registering 6.3 and 5.7 million compounds, respectively. Merging the two sources and filtering compounds for PAINS, aggregators frequent hitters, and subsequently for ADMET parameters led to a unique set of 10,314 compounds that have been plated and are currently available for in silico or experimental screening.


Customizing a PPI-focused library according to particular needs is feasible including:

·        the development of  machine learning model on proprietary data (3D structures, screening data)

·        the selection of predicted PPI modulators among a proprietary compound collection. Hits are selected in agreement with the client (e.g. presence of necessary functional groups, 2D or 3D similarity to known actives). A list of commercially available compounds (catalog number, supplier, and price) can be delivered if necessary.


Required Information

Structure of user compounds

Compound library to be filtered for PPI likeness (in sdf format)



In-house developed databases and software



1.            Basse, M.J., et al., 2P2Idb v2: update of a structural database dedicated to orthosteric modulation of protein-protein interactions. Database (Oxford), 2016. 2016.

2.            Basse, M.J., et al., 2P2Idb: a structural database dedicated to orthosteric modulation of protein-protein interactions. Nucleic Acids Res, 2013. 41(Database issue): p. D824-7.

3.            Hamon, V., et al., 2P2I HUNTER: a tool for filtering orthosteric protein-protein interaction modulators via a dedicated support vector machine. J R Soc Interface, 2014. 11(90): p. 20130860.

4.            Hamon, V., et al., 2P2Ichem: focused chemical libraries dedicated to orthosteric modulation of protein-protein interactions. MedChemComm, 2013. 4(5): p. 797-809.

5.            Bosc, N. et al. Privileged Substructures to Modulate Protein-Protein Interactions. J Chem Inf Model. 2017 Oct 23;57(10):2448-2462. doi: 10.1021/acs.jcim.7b00435.

6.            Reynès, C et al.  Designing focused chemical libraries enriched in protein-protein interaction inhibitors using machine-learning methods. PLoS Comput Biol. 2010 Mar 5;6(3):e1000695. 

7.            Labbé, C.M. et al.  iPPI-DB: an online database of modulators of protein-protein interactions. Nucleic Acids Res. 2016 Jan 4;44(D1):D542-7.

8.            Labbé, C.M. et al.  iPPI-DB: a manually curated and interactive database of small non-peptide inhibitors of protein-protein interactions. Drug Discov Today. 2013 Oct;18(19-20):958-68.






Project request
QSAR modelling of biological activity

Optimizing pharmacokinetic and ADMET properties is a key step in the hit to lead optimization stage. ChemBioFrance offers the possibility to calculate/predict any pf these important properties.






Diverse QSAR and machine learning methods are available to predict the following properties:

·        physicochemical: aqueous solubility, logP, logD, pKA, lolar surface area

·        structural: Lipinski rule-of-5 vilations, QED (Quantitative estimate of drug-likeness), synthetic accessibility

·        absorption: diffusion coefficient, membrane permeation (Caco2, MDCK), BBB crossing, Pgp inhibition/activation, OATP1B1 inhibition

·        distribution: plasmatic protein binding

·        metabolisation: metabolisation sites by CYPs, indiction.inhibition of major CYPs, intrinsic hepatic clearance, UGT substrates

·        toxicity:  maximum recommended therapeutic dose, endocrinal toxicity, skin toxicity, respiratory toxicity, cardiac toxicicity (hERG binding), chromosomal aberrations, phospholipidosis, reproduction toxicity, hepatic toxicity (Ser_AlkPhos, Ser_GGT, Ser_LDH, Ser_AST, and Ser_ALT)


Required information

Name and 2D structure of the compound(s)



Software used are either commercial (PipelinePilot, ADMET Predictor) or developed in-house



Muller C, Pekthong D, Alexandre E, Marcou G, Horvath D, Richert L, Varnek A. Prediction of drug induced liver injury using molecular and biological descriptors. Comb Chem High Throughput Screen. 2015;18(3):315-22

Project request
Searching for unwanted chemical motifs

ADMET filtering of compound libraries enables the removal or annotation of compounds with potential liabilities with respect to a clinical development or identification of a new molecular probe. This analysis can be done after a virtual/experimental screen in order to gain knowledge before or after a hit to lead optimization stage. Many types of filters have been implemented and can be fine-tuned according to the user's need.






Main filters are applied to:

·        Physicochemical properties: rule-of -five

·        Toxicity alerts (toxicophores) and rules (GSK 4/400, Pfizer 3/75)

·        PAINS: pan assay interference compounds (known to affect biological screenings)

·        Estimation of drug-likeness (QED): combination of several physchem properties and structural alerts, enabling a quantitative estimate of drug-likeness.



·        The compound library is first set-up in order to standardize chemical structures

·        Filtering consists in a computation of several above-described properties and an iterative process to remove compounds satisfying queries and structural alerts. Compounds passing the filters, and those rejected are separately stored in tables and structural files.


Required Information

2D structures or SMILES strings of the compounds under study.



Our protocol relies on free software and components of commercially available code (e.g. ChemAxon).



Lagorce et al., FAF-Drugs4: free ADME-tox filtering computations for chemical biology and early stages drug discovery, Bioinformatics, 2017, 33:3658

Lagorce et al., Pan-assay interference compounds (PAINS) that may not be too painful for chemical biology projects., Drug Discov. Today, 2017, 22:1131.

Lagorce et al., Computational analysis of calculated physicochemical and ADMET properties of protein-protein interaction inhibitors., Sci. Rep., 2017, 7:46277.

Lagorce et al., FAF-Drugs3: a web server for compound property calculation and chemical library design., NAR 2015 Jul 1;43(W1):W200-7

Project request
Target identification and profiling

Identifying the main target of a phenotypic screen remains a tricky endeavour. Moreover, it might be interesting to know secondary off-targets of any bioactive molecule. ChemBioFrance offers an in silico approach to target prediction from the simple knowledge of a ligand structure.





A target library of 4,500 proteins of pharmaceutical interest (GPCRs, nuclear receptors, ion channels, kinases, proteases; etc...) is screened according to a proprietary method (Profiler) [1] using machine learning algorithms (support vector machines, random forests) specific for each of the investigated target. According to the current knowledge on the target and its known ligands. For each compound to be investigated, Profiler defines a list of potential targets with predicted inhibition constants. When applied to 189 clinical candidates, the method was able to recover the main target among a short list (usually 15-20 targets) in 87% of cases. Profiler has also been applied to the identification of secondary targets with further experimental validation [1].

By exploiting several public kinase inhibitor profiling datasets, we have developped robust chemogenomics statistical models, also called proteochemometrics PCM, to predict the selectivity profile of novel kinase inhibitors (2). The tool uses 2D and 3D molecular descriptors and takes into account the different conformations, active and inactive, of protein kinases. Three different machine learning algorithms were evaluated: Naïve Bayes (NB), Support Vector Machines (SVM) and Random Forest (RF).


Required information

Name and 2D structure of the compound(s) to profile



Profiler is a software developed at the Laboratoire d'Innovation Thérapeutique (LIT, UMR 7200 CNRS-Université de Strasbourg).


PCM is a tool developed at the Institut de Chimie Organique and Analytique (ICOA, UMR7311 CNRS-Université d'Orléans).



(1)    Meslamani, J., Bhajun, R., Martz, F. and Rognan, D. (2013). Computational profiling of bioactive compounds using a target-dependent composite workflow. J. Chem. Inf. Model., 53, 2322-2333.


(2)    Bosc, N., Wroblowski, B., Meyer, C. and Bonnet P. (2017) Prediction of Protein Kinase-Ligand Interactions through 2.5D Kinochemometrics. J. Chem. Inf. Model., 57, 93-101.

Project request
Predicting physicochemical and biological properties by QSAR

Prediction of physicochemical and/or biological properties can be realized by QSAR methods including machine learning and dep learning. The QSAR approach relies on the description of molecular structures by an ensemble of data called molecular descriptors. These molecular descriptors and then linked to any property of interest thanks to a mathematical model whose parameters may be deduced from learning algorithms. Deep learning can be perceived as an evolution of classical QSAR techniques in two manners: (i) by trying to get rid of molecular descriptors by directly considering molecular graphs, (ii) targeting complex properties like spectra or images. In all cases, models are prepared and validated according to rigorous protocols aimed at estimating model performance and applicability domain. Deep learning methods can also directly generate new chemical structures with the desired properties.



This offer delivers a mathematical model for estimating a property (physicochemical, biological) for a given compound. The model is characterized by its performance on known data. Predicted data are given with a confidence interval. Each query is regarded with respect to the applicability domain of the model.

The model can be delivered as an ensemble of data describing how to use it, including a report, a notice to compute molecular descriptors, mathematical equations used to compute the property, the applicability domain and the confidence interval.

Another possibility is to use a known model for screening a compound library for desired properties.

We can deliver a software whenever necessary with a graphical user interface in order to apply the model and write a formatted report. The interface can be a simple command line or a web interface located at ChemBioFrance or delivered to the client.


Required information

Name, structure and property data of molecules of interest



Used software are either free or developed within ChemBioFrance.


Machine Learning: R, WEKA

Deep learning: Keras Python library



A. Varnek and I. Baskin, Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis?,J. Chem. Inf. Model. 2012, 52, 1413-1437

F. Ruggiu et al. ISIDA Property-Labelled Fragment Descriptors Mol Inf, 2010, 29, 855 - 868

A. Varnek et al ISIDA - Platform for virtual screening based on fragment and pharmacophoric descriptors, Current Computer-Aided Drug Design, 2008, 4 (3), 191-198

H. Gaspar et al. GTM-Based QSAR Models and Their Applicability Domains. Mol Inf, 2015, 34 (6-7), 348-356  

Project request
Visualization and Analysis of Chemical Data

Project request
Fragment Optimization (hit-to-lead)

Over the past few decades, hit identification has been greatly facilitated by advances in high-throughput and fragment-based screenings. One major hurdle remaining in drug discovery is process automation of hit-to-lead (H2L) optimization. Computational chemistry or molecular modeling can play an important role during this H2L stage by both suggesting putative optimizations and decreasing the number of compounds to be experimentally synthesized and evaluated. However, it is also crucial to consider the feasibility of organically synthesizing these virtually designed compounds. Furthermore, the generated molecules should have reasonable physicochemical properties and be medicinally relevant.

We have developed a highly automated integrated strategy (DOTS) for H2L optimization of fragments identified through experimental screening. It combines computational (chemoinformatics, molecular modeling and virtual screening) and experimental (parallel organic synthesis, in vitro evaluation of the compounds) approaches that are partially or totally automated to accelerate the process.



The DOTS protocol [1,2] is divided into several steps:

i/ the binding mode of the initial fragment is identified (usually by solving the X-ray structure of the complex with its biological target).

ii/ a diverse focused-chemical library is designed by coupling an activated form of the initial fragment with a collection of commercially available building blocks. The chemical reactions are selected among a set of robust and well-accepted organic reactions commonly used in medicinal chemistry [3]. All compounds within the virtual library can be efficiently synthesized in one or two steps.

iii/ S4MPLE, a molecular-modeling tool that relies on a Lamarckian genetic algorithm for conformational sampling, is then used for virtual screening [4-6]. This stage is performed under restraints to maintain the binding mode of the initial fragment according to the generic hit-growing paradigm. The relative binding energy of each ligand is estimated by computing the energy difference of the best pose in the bound and free states using AMBER/GAFF force field. Compounds are ranked according to this computed energy and a representative set in the top list is selected.

iv/ the corresponding building blocks are purchased and the prioritized compounds are synthesized in parallel using an automated robotic platform (Accelerator Synthetizer SLT100).

v/ the newly synthesized compounds are evaluated in vitro using a Labcyte Access/Echo robotic platform.


Required Information

Structure of the fragment to be optimized.



All in silico procedures are developed in house.



 1. Hoffer L, Voitovich YV, Raux B, Carrasco K, Muller C, Fedorov AY, Derviaux C, Amouric A, Betzi S, Horvath D, et al.: Integrated Strategy for Lead Optimization Based on Fragment Growing: The Diversity-Oriented-Target-Focused-Synthesis Approach. J Med Chem 2018, 61:5719-5732.

2. Hoffer L, Muller C, Roche P, Morelli X: Chemistry-Driven Hit-To-Lead Optimization Guided by Structure-Based Approaches. Mol Inform 2018.

3. Hartenfeller M, Eberle M, Meier P, Nieto-Oberhuber C, Altmann KH, Schneider G, Jacoby E, Renner S: A collection of robust organic synthesis reactions for in silico molecule design. J Chem Inf Model 2011, 51:3093-3098.

4. Hoffer L, Chira C, Marcou G, Varnek A, Horvath D: S4MPLE--Sampler for Multiple Protein-Ligand Entities: Methodology and Rigid-Site Docking Benchmarking. Molecules 2015, 20:8997-9028.

5. Hoffer L, Renaud JP, Horvath D: In silico fragment-based drug discovery: setup and validation of a fragment-to-lead computational protocol using S4MPLE. J Chem Inf Model 2013, 53:836-851.

6. Hoffer L, Horvath D: S4MPLE--sampler for multiple protein-ligand entities: simultaneous docking of several entities. J Chem Inf Model 2013, 53:88-102.

Project request
Prediction of the residence time (RT) of compounds


Residence time of a compound is becoming an important parameter in the drug design process. It corresponds to the time the compound is spending into the target. Its duration of action is strongly related to its pharmacological effect. Indeed, a long residence time seems to be linked to an extended pharmacological effect but may also result in an increased toxicity of the compound. The residence time of a compound, which is inversely proportional to the koff, could constitute a relevant indicator of its in vivo efficacy.



Two methodologies have been currently developed to predict the relative residence times of a set of compounds.

·        A method based on steered molecular dynamics (SMD) simulations. In this protocol, a constraint is applied to push the ligand from the binding site to the outside of the protein. This simulation is iterated ten times to provide a statistically relevant estimation of the process. Profiles of the free energy of binding are built from these simulations and the mean of the free energy of dissociation is extracted from these profiles. This free energy of dissociation is used to estimate the residence time of the compound. The set of the ligands associated with their estimated residence times is then provided to the customer.


·        A method based on targeted molecular dynamics (TMD) simulations. In this protocol, a constraint is applied to enforce the ligand to change its conformation, which finally drives the ligand out of the binding site of the protein. This simulation is iterated eleven times. An estimator of the residence time is calculated from the obtained trajectories, based on the constraint forces used to achieve the ligand exit. The set of the ligands associated with their value of the estimator of the residence times is then provided to the customer.

The two methodologies have been validated on three therapeutic targets by using datasets containing between 8 to 20 molecules. Excellent correlations with experimental data have been obtained (publications in preparation)

Required information

Name of the target. If no crystallographic structure is available, a model will be built by homology modeling.

2D structures of the molecules on which residence time will be predicted. A set of few referenceswith their residence time is advised but not compulsory.


The software used are commercial software (Amber) or home-made programs from ChemBioFrance.


Aci-Sèche, S. ; Ziada, S. ; Braka, A. ; Arora, R. ; Bonnet, P. "Advanced molecular dynamics simulation methods for kinase drug discovery", Future Med. Chem. 2016, 8, 545-566.


Project request