Publication Information
+Information about the source
+ Computational Structural Biology (CSB) is the scientific domain concerned with the development of algorithms
+ and software to understand and predict the structure and function of biological macromolecules. This research
+ field is inherently multi-disciplinary. On the experimental side, biology and medicine provide the objects
+ studied, while biophysics and bioinformatics supply experimental data, which are of two main kinds. On the one
+ hand, genome sequencing projects give supply protein sequences, and ~200 millions of sequences have been
+ archived in UniProtKB/TrEMBL – which collects the protein sequences
+ yielded by genome sequencing projects. On the other hand, structure determination experiments (notably X-ray
+ crystallography, nuclear magnetic resonance, and cryo-electron microscopy) give access to geometric models of
+ molecules – atomic coordinates. Alas, only ~150,000 structures have been solved and deposited in the Protein
+ Data Bank (PDB), a number to be compared against the UniProtKB/TrEMBL. With one structure for ~1000 sequences, we hardly know anything
+ about biological functions at the atomic/structural level. Complementing experiments, physical
+ chemistry/chemical physics supply the required models (energies, thermodynamics, etc). More specifically, let us
+ recall that proteins with lock-and-key
+ metaphor for interacting molecules, Biology is based on the interactions stable conformations make with each
+ other. Turning these intuitive notions into quantitative ones requires delving into statistical physics, as
+ macroscopic properties are average properties computed over ensembles of conformations. Developing effective
+ algorithms to perform accurate simulations is especially challenging for two main reasons. The first one is the
+ high dimension of conformational spaces – see tour de force rarely achieved .
The first challenge, sequence-to-structure prediction, aims to infer the possible
+ structure(s) of a protein from its amino acid sequence. While recent progress has been made recently using in
+ particular deep learning techniques , the models obtained so far
+ are static and coarse-grained.
The second one is protein function prediction. Given a protein with known structure i.e. 3D coordinates, the goal is to predict the partners of this protein, in terms of stability
+ and specificity. This understanding is fundamental to biology and medicine, as illustrated by the example of the
+ SARS-CoV-2 virus responsible of the Covid19 pandemic. To infect a host, the virus first fuses its envelope with
+ the membrane of a target cell, and then injects its genetic material into that cell. Fusion is achieved by a
+ so-called class I fusion protein, also found in other viruses (influenza, SARS-CoV-1, HIV, etc). The fusion
+ process is a highly dynamic process involving large amplitude conformational changes of the molecules. It is
+ poorly understood, which hinders our ability to design therapeutics to block it.
Finally, the third one, large assembly reconstruction, aims at solving (coarse-grain)
+ structures of molecular machines involving tens or even hundreds of subunits. This research vein was promoted
+ about 15 years back by the work on the nuclear pore complex . It is often
+ referred to as reconstruction by data integration, as it necessitates to combine coarse-grain
+ models (notably from cryo-electron microscopy (cryo-EM) and native mass spectrometry) with atomic models of
+ subunits obtained from X ray crystallography. Fitting the latter into the former requires exploring the
+ conformation space of subunits, whence the importance of protein dynamics.
As an illustration of these three challenges, consider the problem of designing proteins blocking the entry of
+ SARS-CoV-2 into our cells (Fig. ). The first challenge is illustrated
+ by the problem of predicting the structure of a blocker protein from its sequence of amino-acids – a tractable
+ problem here since the mini proteins used only comprise of the order of 50 amino-acids (Fig. (A), ). The second
+ challenge is illustrated by the calculation of the binding modes and the binding affinity of the designed
+ proteins for the RBD of SARS-CoV-2 (Fig. (B)). Finally, the last challenge is
+ illustrated by the problem of solving structures of the virus with a cell, to understand how many spikes are
+ involved in the fusion mechanism leading to infection. In , the promising
+ designs suggested by modeling have been assessed by an array of wet lab experiments (affinity measurements,
+ circular dichroism for thermal stability assessment, structure resolution by cryo-EM). The hyperstable minibinders identified provide starting points for SARS-CoV-2 therapeutics . We note in passing that this is truly remarkable work, yet,
+ the designed proteins stem from a template (the bottom helix from ACE2), and are rather
+ small.
To present challenges in structural modeling, let us recall the following ingredients. First, a molecular model
+ with d.o.f.). Second,
+ recall that the potential energy landscape (PEL) is the mapping CHARMM, AMBER, MARTINI, etc. Such PE belong to the realm of
+ molecular mechanics, and implement atomic or coarse-grain models. They may embark a solvent model, either
+ explicit or implicit. Their definition requires a significant number of parameters (up to
These PE are usually considered good enough to study non covalent interactions – our focus, even tough they do + not cover the modification of chemical bonds. In any case, we take such a function for granted .
+The PEL codes all structural, thermodynamic, and kinetic properties,
+ which can be obtained by averaging properties of conformations over so-called thermodynamic
+ ensembles. The structure of a macromolecular system requires the characterization of
+ active conformations and important intermediates in functional pathways involving significant basins. In
+ assigning occupation probabilities to these conformations by integrating Boltzmann's distribution, one treats
+ thermodynamics. Finally, transitions between the states, modeled, say, by a master
+ equation (a continuous-time Markov process), correspond to kinetics. Classical simulation
+ methods based on molecular dynamics (MD) and Monte Carlo sampling (MC) are developed in the lineage of the
+ seminal work by the 2013 recipients of the Nobel prize in chemistry (Karplus, Levitt, Warshel), which was
+ awarded “for the development of multiscale models for complex chemical systems”. However,
+ except for highly specialized cases where massive calculations have been used , neither MD nor MC give access to the aforementioned time
+ scales. In fact, the main limitation of such methods is that they treat structural, thermodynamic and kinetic
+ aspects at once . The absence of specific insights on these three
+ complementary pieces of the puzzle makes it impossible to optimize simulation methods, and results in general in
+ the inability to obtain converged simulations on biologically relevant time-scales.
The hardness of structural modeling owes to three intertwined reasons.
+First, PELs of biomolecules usually exhibit a number of critical points exponential in the dimension ; fortunately, they enjoy a multi-scale structure . Intuitively, the significant local minima/basins are those
+ which are deep or isolated/wide, two notions which are mathematically
+ qualified by the concepts of persistence and prominence. Mathematically, problems are plagued with the curse of
+ dimensionality and measure concentration phenomena. Second, biomolecular processes are inherently multi-scale,
+ with motions spanning i.e.
+ observables, are average properties computed over ensembles of conformations, which calls for a multi-scale
+ statistical treatment both of thermodynamics and kinetics.
A natural and critical question naturally concerns the validation of models proposed in structural + bioinformatics. For all three types of questions of interest (structures, thermodynamics, kinetics), there exist + experiments to which the models must be confronted – when the experiments can be conducted.
+For structures, the models proposed can readily be compared against experimental results stemming from X ray
+ crystallography, NMR, or cryo electron microscopy. For thermodynamics, which we illustrate here with binding
+ affinities, predictions can be compared against measurements provided by calorimetry or surface plasmon
+ resonance. Lastly, kinetic predictions can also be assessed by various experiments such as binding affinity
+ measurements (for the prediction of
Our research program ambition to develop a comprehensive set of novel concepts and algorithms to study protein + dynamics, based on the modular framework of PEL.
+As noticed while discussing Protein dynamics: core CS - maths challenges, the integrated
+ nature of simulation methods such as MD or MC is such that these methods do not in general give access to
+ biologically relevant time scales. The framework of energy landscapes , (Fig. ) is much more
+ modular, yet, large biomolecular systems remain out of reach.
To make a definitive step towards solving the prediction of protein dynamics, we will serialize the discovery
+ and the exploitation of a PEL , , . Ideas and
+ concepts from computational geometry/geometric motion planning, machine learning, probabilistic algorithms, and
+ numerical probability will be used to develop two classes of probabilistic algorithms. The first deals with
+ algorithms to discover/sketch PELs i.e. enumerate all significant (persistent or prominent)
+ local minima and their connections across saddles, a difficult task since the number of all local
+ minima/critical points is generally exponential in the dimension. To this end, we will develop a hierarchical
+ data structure coding PELs as well as multi-scale proposals to explore molecular conformations. (Nb: in Monte
+ Carlo methods, a proposal generates a new conformation from an existing one.) The second focuses on methods to
+ exploit/sample PELs i.e. compute so-called densities of states, from which all thermodynamic
+ quantities are given by standard relations
+ . This is a hard problem akin to high-dimensional
+ numerical integration. To solve this problem, we will develop a learning based strategy for the Wang-Landau
+ algorithm –an adaptive Monte Carlo Markov Chain (MCMC) algorithm, as
+ well as a generalization of multi-phase Monte Carlo methods for convex/polytope volume calculations , , for
+ non convex strata of PELs.
As discussed in the previous Section, the study of PEL and protein dynamics raises difficult algorithmic / + mathematical questions. As an illustration, one may consider our recent work on the comparison of high + dimensional distribution , statistical tests / + two-sample tests , , + the comparison of clustering , the complexity study of + graph inference problems for low-resolution reconstruction of assemblies , the analysis of partition (or clustering) stability + in large networks, the complexity of the representation of simplicial complexes . Making progress on such questions is + fundamental to advance the state-of-the art on protein dynamics.
+We will continue to work on such questions, motivated by CSB / theoretical biophysics, both in the continuous + (geometric) and discrete settings. The developments will be based on a combination of ideas and concepts from + computational geometry, machine learning (notably on non linear dimensionality reduction, the reconstruction of + cell complexes, and sampling methods), graph algorithms, probabilistic algorithms, optimization, numerical + probability, and also biophysics.
+While our main ambition is to advance the algorithmic foundations of molecular simulation, a major challenge
+ will be to ensure that the theoretical and algorithmic developments will change the fate of applications, as
+ illustrated by our case studies. To foster such a symbiotic relationship between theory, algorithms and
+ simulation, we will pursue high quality software development and integration within the SBL,
+ and will also take the appropriate measures for the software to be widely adopted.
Software development for structural bioinformatics is especially challenging, combining advanced geometric,
+ numerical and combinatorial algorithms, with complex biophysical models for PEL and related
+ thermodynamic/kinetic properties. Specific features of the proteins studied must also be accommodated. About
+ 50 years after the development of force fields and simulation methods (see the 2013 Nobel prize in chemistry),
+ the software implementing such methods has a profound impact on molecular science at large. One can indeed
+ cite packages such as CHARMM, AMBER, gromacs, gmin, MODELLER, Rosetta, VMD, PyMol, .... On the other hand, these packages are goal oriented, each tackling a (small set
+ of) specific goal(s). In fact, no real modular software design and integration has taken place. As a result,
+ despite the high quality software packages available, inter-operability between algorithmic building blocks
+ has remained very limited.
Predicting the dynamics of large molecular systems requires the integration of advanced algorithmic building
+ blocks / complex software components. To achieve a sufficient level of integration, we undertook the
+ development of the Structural Bioinformatics Library (SBL, ) , a generic C++/python
+ cross-platform library providing software to solve complex problems in structural bioinformatics. For
+ end-users, the SBL provides ready to use, state-of-the-art applications to model
+ macro-molecules and their complexes at various resolutions, and also to store results in perennial and easy to
+ use data formats (). For developers, the SBL provides a
+ broad C++/python toolbox with modular design (). This hybrid status targeting both
+ end-users and developers stems from an advanced software design involving four software components, namely
+ applications, core algorithms, biophysical models, and modules (). This modular design makes it possible to optimize
+ robustness and the performance of individual components, which can then be assembled within a goal oriented
+ application.
Our methods will be validated on various systems for which flexibility operates at various scales. Example such + systems are antibody-antigen complexes, (viral) polymerases, (membrane) transporters.
+Even very complex biomolecular systems are deterministic in prescribed conditions (temperature, pH, etc),
+ demonstrating that despite their high dimensionality, all d.o.f. are not at play at the same
+ time. This insight suggests three classes of systems of particular interest. The first class consists of systems
+ defined from (essentially) rigid blocks whose relative positions change thanks to conformational changes of
+ linkers; a Newton cradle provides an interesting way to envision such as system. We have recently worked on one
+ such system, a membrane proteins involve in antibiotic resistance (AcrB, see . The second class consists of cases where relative
+ positions of subdomains do not significantly change, yet, their intrinsic dynamics are significantly altered. A
+ classical illustration is provided by antibodies, whose binding affinity owes to dynamics localized in six
+ specific loops , .
+ The third class, consisting of composite cases, will greatly benefit from insights on the first two classes. As
+ an example, we may consider the spikes of the SARS-CoV-2 virus, whose function (performing infection) involves
+ both large amplitude conformational changes and subtle dynamics of the so-called receptor binding domain. We
+ have started to investigate this system, in collaboration with B. Delmas (INRAe) .
In ABS, we will investigate systems in these three tiers, in collaboration with expert collaborators, to + hopefully open new perspectives in biology and medicine. Along the way, we will also collaborate on selected + questions at the interface between CSB and systems biology, as it is now clear that the structural level and the + systems level (pathways of interacting molecules) can benefit from one another.
+The main application domain is Computational Structural Biology, as underlined in the Research
+ Program.
In October 2021, Edoardo Sarti has joined ABS as Chargé de Recherche de Classe Normale. His
+ expertise comprises a diverse set of interests spanning from algorithmic questions about geometrical, functional
+ and evolutionary aspects of biomolecules (latest study: ), to
+ the collection and analysis of large collections of molecular structural data. From the very start, E. Sarti has
+ started taking part in several research and technical projects of ABS.
See report on the Structural Bioinformatics Library.
+In this work , we introduce Multiple Interface
+ String Alignment (MISA), a visualization tool to display coherently various sequence and structure
+ based statistics at protein-protein interfaces (SSE elements, buried surface area,
Illustrations are provided on the receptor binding domain (RBD) of coronaviruses, in complex with their + cognate partner or (neutralizing) antibodies. MISA computed with a minimal number of structures complement and + enrich findings previously reported.
+The corresponding package is available from the Structural Bioinformatics Library ( +
+ +On December 2019, the Chinese Center for Disease Control reported several cases of severe pneumonia that + resists usual treatments in the city of Wuhan. This was the beginning of the COVID-19 pandemic which caused + more than 80 millions infection cases and 1.7 millions deaths during the year 2020 alone1 . This major + outbreak has given rise to global public health responses as well as an international research effort of + unprecedented scope and speed. This scientific mobilization has led to remarkable results, which have enabled + a great deal of knowledge to be accumulated in just a few months on this novel pathogen: identification of the + virus, of its main proteins, analysis of its origin and its functionning. This basic biological knowledge is + mandatory to medical advances: design tests, find a vaccine or a cure.
+In this document , one year after the beginning of the worldwide + spread of the disease, we wish to shed particular light on the contribution of bioinformatics in all this + work. Bioinformatics is a discipline at crossroads of computer sciences, mathematics and biology that has + taken on an inestimable importance in modern biology and medicine. It provides computational models, + algorithms and software to the scientific community, that are both operational and effective. The discovery + and study of the SARS-Cov-2 coronavirus is an emblematic example. The utilization of bioinformatics methods + has been at the heart of essential milestones : from the sequencing of the virus genome and its annotation to + the history of its origin, the modelisation of interacting biological entities both at the molecular scale and + at the network scale, and the study of the host genetic susceptibility. All these studies, as a whole, have + made it possible to elucidate the nature and the functionning of the novel pathogen and have greatly + contributed to the fight against COVID-19.
+Prioritizing genes for their role in drug sensitivity, is an important step in understanding drugs mechanisms
+ of action and discovering new molecular targets for co-treatment. In this work , we formalize this problem by considering
+ two sets of genes Genetrank, a method to prioritize
+ the genes in
+ Genetrank uses asymmetric random walks with restarts, absorbing states, and a suitable
+ renormalization scheme. Using novel so-called saturation indices, we show that the conjunction of absorbing
+ states and renormalization yields an exploration of the PPIN which is much more progressive than that afforded
+ by random walks with restarts only. Using MINT as underlying network, we apply Genetrank to
+ a predictive gene signature of cancer cells sensitivity to tumor-necrosis-factor-related apoptosis-inducing
+ ligand (TRAIL), performed in single-cells. Our ranking provides biological insights on drug sensitivity and a
+ gene set considerably enriched in genes regulating TRAIL pharmacodynamics when compared to the most
+ significant differentially expressed genes obtained from a statistical analysis framework alone. We also
+ introduce gene expression radars, a visualization tool to assess all pairwise interactions
+ at a glance.
+ Genetrank is made available in the Structural Bioinformatics Library (). It should prove useful for mining gene sets in
+ conjunction with a signaling pathway, whenever other approaches yield relatively large sets of genes.
Tripeptide loop closure (TLC) is a standard procedure to reconstruct protein backbone conformations, by + solving a polynomial system in a single variable yielding up to 16 real solutions.
+In this work , we first show that multiprecision is required in
+ a TLC solver to guarantee the existence and the accuracy of solutions. We then compare solutions yielded by
+ the TLC solver against tripeptides from the Protein Data Bank. We show that these solutions are geometrically
+ diverse (up to
We anticipate that these insights, coupled to our robust implementation in the Structural Bioinformatics + Library (), will help understanding the + properties of TLC reconstructions, with potential applications to the generation of conformations of flexible + loops in particular.
+The center of mass of a point set lying on a manifold generalizes the celebrated Euclidean centroid, and is + ubiquitous in statistical analysis in non Euclidean spaces.
+In this work , we give a complete characterization of the weighted
+
Our derivations are of interest in two respects. First, efficient
Computing the volume of a high dimensional polytope is a fundamental problem in geometry, also connected to + the calculation of densities of states in statistical physics, and a central building block of such algorithms + is the method used to sample a target probability distribution.
+This paper studies Hamiltonian Monte Carlo (HMC) with
+ reflections on the boundary of a domain, providing an enhanced alternative to Hit-and-run (HAR) to sample a
+ target distribution restricted to the polytope. We make three contributions. First, we provide a convergence
+ bound, paving the way to more precise mixing time analysis. Second, we present a robust implementation based
+ on multi-precision arithmetic, a mandatory ingredient to guarantee exact predicates and robust constructions.
+ We however allow controlled failures to happen, introducing the Sweeten Exact Geometric
+ Computing (SEGC) paradigm. Third, we use our HMC random walk to perform H-polytope volume calculations,
+ using it as an alternative to HAR within the volume algorithm by Cousins and Vempala. The systematic tests
+ conducted up to dimension
We analyze a generalization of the minimum connectivity inference problem (MCI) that models the computation + of low-resolution structures of macro-molecular assemblies, based on data obtained by native mass + spectrometry. The generalization studied in this work, allows us to consider more refined constraints for the + characterization of low resolution models of large assemblies, such as degree constraints (e.g. a protein has + a limited number of other proteins in contact).
+More precisely, let
+
+
+ -overlays a hyperedge
+
Given a graph Conflict
+ Coloring consists in deciding whether exists a conflict coloring, that is a coloring in which Conflict Coloring is motivated by computational
+ structural biology problems, high resolution determination of molecular assemblies. The graph represents the
+ subunits and the interaction between them, the colors are the given conformations, and the edges of the
+ bipartite graphs are the incompatible conformations of two subunits.
In this work, we first establish the complexity dichotomies (polynomial vs Conflict Coloring and its variants. We provide some experiments in which we build
+ instances of Conflict Coloring associated to Voronoi diagram in the
+ plane, and we then analyse the existences of a solution related to parameters used in our experimental
+ setup.
+
Frédéric Cazals participated to the following program committees:
+
+
+
+
+
PhD thesis:
+Interns:
+Frédéric Cazals participated to the following committees:
+Dorian Mazauric participated to the following committees:
+Dorian Mazauric:
+Frédéric Cazals:
+Dorian Mazauric:
+ +Dorian Mazauric - Fête de la Science 2021:
+Dorian Mazauric - Interventions at Maison de l'Intelligence Artificielle:
+Dorian Mazauric - Cordées de la réussite (coordonné par Université Côte d'Azur):
+Dorian Mazauric - Programme Chiche:
+Dorian Mazauric - Formations:
+Dorian Mazauric - In schools:
+Dorian Mazauric - Internships:
+ Computational Structural Biology (CSB) is the scientific domain concerned with the development of algorithms
+ and software to understand and predict the structure and function of biological macromolecules. This research
+ field is inherently multi-disciplinary. On the experimental side, biology and medicine provide the objects
+ studied, while biophysics and bioinformatics supply experimental data, which are of two main kinds. On the one
+ hand, genome sequencing projects give supply protein sequences, and ~200 millions of sequences have been
+ archived in UniProtKB/TrEMBL – which collects the protein sequences
+ yielded by genome sequencing projects. On the other hand, structure determination experiments (notably X-ray
+ crystallography, nuclear magnetic resonance, and cryo-electron microscopy) give access to geometric models of
+ molecules – atomic coordinates. Alas, only ~150,000 structures have been solved and deposited in the Protein
+ Data Bank (PDB), a number to be compared against the UniProtKB/TrEMBL. With one structure for ~1000 sequences, we hardly know anything
+ about biological functions at the atomic/structural level. Complementing experiments, physical
+ chemistry/chemical physics supply the required models (energies, thermodynamics, etc). More specifically, let us
+ recall that proteins with lock-and-key
+ metaphor for interacting molecules, Biology is based on the interactions stable conformations make with each
+ other. Turning these intuitive notions into quantitative ones requires delving into statistical physics, as
+ macroscopic properties are average properties computed over ensembles of conformations. Developing effective
+ algorithms to perform accurate simulations is especially challenging for two main reasons. The first one is the
+ high dimension of conformational spaces – see tour de force rarely achieved .
The first challenge, sequence-to-structure prediction, aims to infer the possible
+ structure(s) of a protein from its amino acid sequence. While recent progress has been made recently using in
+ particular deep learning techniques , the models obtained so far
+ are static and coarse-grained.
The second one is protein function prediction. Given a protein with known structure i.e. 3D coordinates, the goal is to predict the partners of this protein, in terms of stability
+ and specificity. This understanding is fundamental to biology and medicine, as illustrated by the example of the
+ SARS-CoV-2 virus responsible of the Covid19 pandemic. To infect a host, the virus first fuses its envelope with
+ the membrane of a target cell, and then injects its genetic material into that cell. Fusion is achieved by a
+ so-called class I fusion protein, also found in other viruses (influenza, SARS-CoV-1, HIV, etc). The fusion
+ process is a highly dynamic process involving large amplitude conformational changes of the molecules. It is
+ poorly understood, which hinders our ability to design therapeutics to block it.
Finally, the third one, large assembly reconstruction, aims at solving (coarse-grain)
+ structures of molecular machines involving tens or even hundreds of subunits. This research vein was promoted
+ about 15 years back by the work on the nuclear pore complex . It is often
+ referred to as reconstruction by data integration, as it necessitates to combine coarse-grain
+ models (notably from cryo-electron microscopy (cryo-EM) and native mass spectrometry) with atomic models of
+ subunits obtained from X ray crystallography. Fitting the latter into the former requires exploring the
+ conformation space of subunits, whence the importance of protein dynamics.
As an illustration of these three challenges, consider the problem of designing proteins blocking the entry of
+ SARS-CoV-2 into our cells (Fig. ). The first challenge is illustrated
+ by the problem of predicting the structure of a blocker protein from its sequence of amino-acids – a tractable
+ problem here since the mini proteins used only comprise of the order of 50 amino-acids (Fig. (A), ). The second
+ challenge is illustrated by the calculation of the binding modes and the binding affinity of the designed
+ proteins for the RBD of SARS-CoV-2 (Fig. (B)). Finally, the last challenge is
+ illustrated by the problem of solving structures of the virus with a cell, to understand how many spikes are
+ involved in the fusion mechanism leading to infection. In , the promising
+ designs suggested by modeling have been assessed by an array of wet lab experiments (affinity measurements,
+ circular dichroism for thermal stability assessment, structure resolution by cryo-EM). The hyperstable minibinders identified provide starting points for SARS-CoV-2 therapeutics . We note in passing that this is truly remarkable work, yet,
+ the designed proteins stem from a template (the bottom helix from ACE2), and are rather
+ small.
To present challenges in structural modeling, let us recall the following ingredients. First, a molecular model
+ with d.o.f.). Second,
+ recall that the potential energy landscape (PEL) is the mapping CHARMM, AMBER, MARTINI, etc. Such PE belong to the realm of
+ molecular mechanics, and implement atomic or coarse-grain models. They may embark a solvent model, either
+ explicit or implicit. Their definition requires a significant number of parameters (up to
These PE are usually considered good enough to study non covalent interactions – our focus, even tough they do + not cover the modification of chemical bonds. In any case, we take such a function for granted .
+The PEL codes all structural, thermodynamic, and kinetic properties,
+ which can be obtained by averaging properties of conformations over so-called thermodynamic
+ ensembles. The structure of a macromolecular system requires the characterization of
+ active conformations and important intermediates in functional pathways involving significant basins. In
+ assigning occupation probabilities to these conformations by integrating Boltzmann's distribution, one treats
+ thermodynamics. Finally, transitions between the states, modeled, say, by a master
+ equation (a continuous-time Markov process), correspond to kinetics. Classical simulation
+ methods based on molecular dynamics (MD) and Monte Carlo sampling (MC) are developed in the lineage of the
+ seminal work by the 2013 recipients of the Nobel prize in chemistry (Karplus, Levitt, Warshel), which was
+ awarded “for the development of multiscale models for complex chemical systems”. However,
+ except for highly specialized cases where massive calculations have been used , neither MD nor MC give access to the aforementioned time
+ scales. In fact, the main limitation of such methods is that they treat structural, thermodynamic and kinetic
+ aspects at once . The absence of specific insights on these three
+ complementary pieces of the puzzle makes it impossible to optimize simulation methods, and results in general in
+ the inability to obtain converged simulations on biologically relevant time-scales.
The hardness of structural modeling owes to three intertwined reasons.
+First, PELs of biomolecules usually exhibit a number of critical points exponential in the dimension ; fortunately, they enjoy a multi-scale structure . Intuitively, the significant local minima/basins are those
+ which are deep or isolated/wide, two notions which are mathematically
+ qualified by the concepts of persistence and prominence. Mathematically, problems are plagued with the curse of
+ dimensionality and measure concentration phenomena. Second, biomolecular processes are inherently multi-scale,
+ with motions spanning i.e.
+ observables, are average properties computed over ensembles of conformations, which calls for a multi-scale
+ statistical treatment both of thermodynamics and kinetics.
A natural and critical question naturally concerns the validation of models proposed in structural + bioinformatics. For all three types of questions of interest (structures, thermodynamics, kinetics), there exist + experiments to which the models must be confronted – when the experiments can be conducted.
+For structures, the models proposed can readily be compared against experimental results stemming from X ray
+ crystallography, NMR, or cryo electron microscopy. For thermodynamics, which we illustrate here with binding
+ affinities, predictions can be compared against measurements provided by calorimetry or surface plasmon
+ resonance. Lastly, kinetic predictions can also be assessed by various experiments such as binding affinity
+ measurements (for the prediction of
Our research program ambition to develop a comprehensive set of novel concepts and algorithms to study protein + dynamics, based on the modular framework of PEL.
+As noticed while discussing Protein dynamics: core CS - maths challenges, the integrated
+ nature of simulation methods such as MD or MC is such that these methods do not in general give access to
+ biologically relevant time scales. The framework of energy landscapes , (Fig. ) is much more
+ modular, yet, large biomolecular systems remain out of reach.
To make a definitive step towards solving the prediction of protein dynamics, we will serialize the discovery
+ and the exploitation of a PEL , , . Ideas and
+ concepts from computational geometry/geometric motion planning, machine learning, probabilistic algorithms, and
+ numerical probability will be used to develop two classes of probabilistic algorithms. The first deals with
+ algorithms to discover/sketch PELs i.e. enumerate all significant (persistent or prominent)
+ local minima and their connections across saddles, a difficult task since the number of all local
+ minima/critical points is generally exponential in the dimension. To this end, we will develop a hierarchical
+ data structure coding PELs as well as multi-scale proposals to explore molecular conformations. (Nb: in Monte
+ Carlo methods, a proposal generates a new conformation from an existing one.) The second focuses on methods to
+ exploit/sample PELs i.e. compute so-called densities of states, from which all thermodynamic
+ quantities are given by standard relations . This is a hard problem akin to high-dimensional
+ numerical integration. To solve this problem, we will develop a learning based strategy for the Wang-Landau
+ algorithm –an adaptive Monte Carlo Markov Chain (MCMC) algorithm, as
+ well as a generalization of multi-phase Monte Carlo methods for convex/polytope volume calculations , , for
+ non convex strata of PELs.
As discussed in the previous Section, the study of PEL and protein dynamics raises difficult algorithmic / + mathematical questions. As an illustration, one may consider our recent work on the comparison of high + dimensional distribution , statistical tests / + two-sample tests , , + the comparison of clustering , the complexity study of + graph inference problems for low-resolution reconstruction of assemblies , the analysis of partition (or clustering) stability + in large networks, the complexity of the representation of simplicial complexes . Making progress on such questions is + fundamental to advance the state-of-the art on protein dynamics.
+We will continue to work on such questions, motivated by CSB / theoretical biophysics, both in the continuous + (geometric) and discrete settings. The developments will be based on a combination of ideas and concepts from + computational geometry, machine learning (notably on non linear dimensionality reduction, the reconstruction of + cell complexes, and sampling methods), graph algorithms, probabilistic algorithms, optimization, numerical + probability, and also biophysics.
+While our main ambition is to advance the algorithmic foundations of molecular simulation, a major challenge
+ will be to ensure that the theoretical and algorithmic developments will change the fate of applications, as
+ illustrated by our case studies. To foster such a symbiotic relationship between theory, algorithms and
+ simulation, we will pursue high quality software development and integration within the SBL,
+ and will also take the appropriate measures for the software to be widely adopted.
Software development for structural bioinformatics is especially challenging, combining advanced geometric,
+ numerical and combinatorial algorithms, with complex biophysical models for PEL and related
+ thermodynamic/kinetic properties. Specific features of the proteins studied must also be accommodated. About
+ 50 years after the development of force fields and simulation methods (see the 2013 Nobel prize in chemistry),
+ the software implementing such methods has a profound impact on molecular science at large. One can indeed
+ cite packages such as CHARMM, AMBER, gromacs, gmin, MODELLER, Rosetta, VMD, PyMol, .... On the other hand, these packages are goal oriented, each tackling a (small set
+ of) specific goal(s). In fact, no real modular software design and integration has taken place. As a result,
+ despite the high quality software packages available, inter-operability between algorithmic building blocks
+ has remained very limited.
Predicting the dynamics of large molecular systems requires the integration of advanced algorithmic building
+ blocks / complex software components. To achieve a sufficient level of integration, we undertook the
+ development of the Structural Bioinformatics Library (SBL, ) , a generic C++/python
+ cross-platform library providing software to solve complex problems in structural bioinformatics. For
+ end-users, the SBL provides ready to use, state-of-the-art applications to model
+ macro-molecules and their complexes at various resolutions, and also to store results in perennial and easy to
+ use data formats (). For developers, the SBL provides a
+ broad C++/python toolbox with modular design (). This hybrid status targeting both
+ end-users and developers stems from an advanced software design involving four software components, namely
+ applications, core algorithms, biophysical models, and modules (). This modular design makes it possible to optimize
+ robustness and the performance of individual components, which can then be assembled within a goal oriented
+ application.
Our methods will be validated on various systems for which flexibility operates at various scales. Example such + systems are antibody-antigen complexes, (viral) polymerases, (membrane) transporters.
+Even very complex biomolecular systems are deterministic in prescribed conditions (temperature, pH, etc),
+ demonstrating that despite their high dimensionality, all d.o.f. are not at play at the same
+ time. This insight suggests three classes of systems of particular interest. The first class consists of systems
+ defined from (essentially) rigid blocks whose relative positions change thanks to conformational changes of
+ linkers; a Newton cradle provides an interesting way to envision such as system. We have recently worked on one
+ such system, a membrane proteins involve in antibiotic resistance (AcrB, see . The second class consists of cases where relative
+ positions of subdomains do not significantly change, yet, their intrinsic dynamics are significantly altered. A
+ classical illustration is provided by antibodies, whose binding affinity owes to dynamics localized in six
+ specific loops , .
+ The third class, consisting of composite cases, will greatly benefit from insights on the first two classes. As
+ an example, we may consider the spikes of the SARS-CoV-2 virus, whose function (performing infection) involves
+ both large amplitude conformational changes and subtle dynamics of the so-called receptor binding domain. We
+ have started to investigate this system, in collaboration with B. Delmas (INRAe) .
In ABS, we will investigate systems in these three tiers, in collaboration with expert collaborators, to + hopefully open new perspectives in biology and medicine. Along the way, we will also collaborate on selected + questions at the interface between CSB and systems biology, as it is now clear that the structural level and the + systems level (pathways of interacting molecules) can benefit from one another.
+The main application domain is Computational Structural Biology, as underlined in the Research
+ Program.
In October 2021, Edoardo Sarti has joined ABS as Chargé de Recherche de Classe Normale. His
+ expertise comprises a diverse set of interests spanning from algorithmic questions about geometrical, functional
+ and evolutionary aspects of biomolecules (latest study: ), to
+ the collection and analysis of large collections of molecular structural data. From the very start, E. Sarti has
+ started taking part in several research and technical projects of ABS.
See report on the Structural Bioinformatics Library.
+In this work , we introduce Multiple Interface
+ String Alignment (MISA), a visualization tool to display coherently various sequence and structure
+ based statistics at protein-protein interfaces (SSE elements, buried surface area,
Illustrations are provided on the receptor binding domain (RBD) of coronaviruses, in complex with their + cognate partner or (neutralizing) antibodies. MISA computed with a minimal number of structures complement and + enrich findings previously reported.
+The corresponding package is available from the Structural Bioinformatics Library (
+ +On December 2019, the Chinese Center for Disease Control reported several cases of severe pneumonia that + resists usual treatments in the city of Wuhan. This was the beginning of the COVID-19 pandemic which caused + more than 80 millions infection cases and 1.7 millions deaths during the year 2020 alone1 . This major + outbreak has given rise to global public health responses as well as an international research effort of + unprecedented scope and speed. This scientific mobilization has led to remarkable results, which have enabled + a great deal of knowledge to be accumulated in just a few months on this novel pathogen: identification of the + virus, of its main proteins, analysis of its origin and its functionning. This basic biological knowledge is + mandatory to medical advances: design tests, find a vaccine or a cure.
+In this document , one year after the beginning of the worldwide + spread of the disease, we wish to shed particular light on the contribution of bioinformatics in all this + work. Bioinformatics is a discipline at crossroads of computer sciences, mathematics and biology that has + taken on an inestimable importance in modern biology and medicine. It provides computational models, + algorithms and software to the scientific community, that are both operational and effective. The discovery + and study of the SARS-Cov-2 coronavirus is an emblematic example. The utilization of bioinformatics methods + has been at the heart of essential milestones : from the sequencing of the virus genome and its annotation to + the history of its origin, the modelisation of interacting biological entities both at the molecular scale and + at the network scale, and the study of the host genetic susceptibility. All these studies, as a whole, have + made it possible to elucidate the nature and the functionning of the novel pathogen and have greatly + contributed to the fight against COVID-19.
+Prioritizing genes for their role in drug sensitivity, is an important step in understanding drugs mechanisms
+ of action and discovering new molecular targets for co-treatment. In this work , we formalize this problem by considering
+ two sets of genes Genetrank, a method to prioritize
+ the genes in
Genetrank uses asymmetric random walks with restarts, absorbing states, and a suitable
+ renormalization scheme. Using novel so-called saturation indices, we show that the conjunction of absorbing
+ states and renormalization yields an exploration of the PPIN which is much more progressive than that afforded
+ by random walks with restarts only. Using MINT as underlying network, we apply Genetrank to
+ a predictive gene signature of cancer cells sensitivity to tumor-necrosis-factor-related apoptosis-inducing
+ ligand (TRAIL), performed in single-cells. Our ranking provides biological insights on drug sensitivity and a
+ gene set considerably enriched in genes regulating TRAIL pharmacodynamics when compared to the most
+ significant differentially expressed genes obtained from a statistical analysis framework alone. We also
+ introduce gene expression radars, a visualization tool to assess all pairwise interactions
+ at a glance.
Genetrank is made available in the Structural Bioinformatics Library (). It should prove useful for mining gene sets in
+ conjunction with a signaling pathway, whenever other approaches yield relatively large sets of genes.
Tripeptide loop closure (TLC) is a standard procedure to reconstruct protein backbone conformations, by + solving a polynomial system in a single variable yielding up to 16 real solutions.
+In this work , we first show that multiprecision is required in
+ a TLC solver to guarantee the existence and the accuracy of solutions. We then compare solutions yielded by
+ the TLC solver against tripeptides from the Protein Data Bank. We show that these solutions are geometrically
+ diverse (up to
We anticipate that these insights, coupled to our robust implementation in the Structural Bioinformatics + Library (), will help understanding the + properties of TLC reconstructions, with potential applications to the generation of conformations of flexible + loops in particular.
+The center of mass of a point set lying on a manifold generalizes the celebrated Euclidean centroid, and is + ubiquitous in statistical analysis in non Euclidean spaces.
+In this work , we give a complete characterization of the weighted
+
Our derivations are of interest in two respects. First, efficient
Computing the volume of a high dimensional polytope is a fundamental problem in geometry, also connected to + the calculation of densities of states in statistical physics, and a central building block of such algorithms + is the method used to sample a target probability distribution.
+This paper studies Hamiltonian Monte Carlo (HMC) with
+ reflections on the boundary of a domain, providing an enhanced alternative to Hit-and-run (HAR) to sample a
+ target distribution restricted to the polytope. We make three contributions. First, we provide a convergence
+ bound, paving the way to more precise mixing time analysis. Second, we present a robust implementation based
+ on multi-precision arithmetic, a mandatory ingredient to guarantee exact predicates and robust constructions.
+ We however allow controlled failures to happen, introducing the Sweeten Exact Geometric
+ Computing (SEGC) paradigm. Third, we use our HMC random walk to perform H-polytope volume calculations,
+ using it as an alternative to HAR within the volume algorithm by Cousins and Vempala. The systematic tests
+ conducted up to dimension
We analyze a generalization of the minimum connectivity inference problem (MCI) that models the computation + of low-resolution structures of macro-molecular assemblies, based on data obtained by native mass + spectrometry. The generalization studied in this work, allows us to consider more refined constraints for the + characterization of low resolution models of large assemblies, such as degree constraints (e.g. a protein has + a limited number of other proteins in contact).
+More precisely, let -overlays a hyperedge
+
Given a graph Conflict
+ Coloring consists in deciding whether exists a conflict coloring, that is a coloring in which Conflict Coloring is motivated by computational
+ structural biology problems, high resolution determination of molecular assemblies. The graph represents the
+ subunits and the interaction between them, the colors are the given conformations, and the edges of the
+ bipartite graphs are the incompatible conformations of two subunits.
In this work, we first establish the complexity dichotomies (polynomial vs Conflict Coloring and its variants. We provide some experiments in which we build
+ instances of Conflict Coloring associated to Voronoi diagram in the
+ plane, and we then analyse the existences of a solution related to parameters used in our experimental
+ setup.
Frédéric Cazals participated to the following program committees:
+PhD thesis:
+Interns:
+Frédéric Cazals participated to the following committees:
+Dorian Mazauric participated to the following committees:
+Dorian Mazauric:
+Frédéric Cazals:
+Dorian Mazauric:
+Dorian Mazauric - Fête de la Science 2021:
+Dorian Mazauric - Interventions at Maison de l'Intelligence Artificielle:
+Dorian Mazauric - Cordées de la réussite (coordonné par Université Côte d'Azur):
+Dorian Mazauric - Programme Chiche:
+Dorian Mazauric - Formations:
+Dorian Mazauric - In schools:
+Dorian Mazauric - Internships:
+Publication Information
+Information about the source
+