|
Open Access
Abstract: The accuracy of the information in the Protein Data Bank (PDB) is of great importance for the myriad downstream applications that make use of protein structural information. Despite best efforts, the occasional introduction of errors is inevitable, especially where the experimental data are of limited resolution. A novel protein structure validation approach based on spotting inconsistencies between the residue contacts and distances observed in a structural model and those computationally predicted by methods such as AlphaFold2 has previously been established. It is particularly well suited to the detection of register errors. Importantly, this new approach is orthogonal to traditional methods based on stereochemistry or map–model agreement, and is resolution independent. Here, thousands of likely register errors are identified by scanning 3–5 Å resolution structures in the PDB. Unlike most methods, the application of this approach yields suggested corrections to the register of affected regions, which it is shown, even by limited implementation, lead to improved refinement statistics in the vast majority of cases. A few limitations and confounding factors such as fold-switching proteins are characterized, but this approach is expected to have broad application in spotting potential issues in current accessions and, through its implementation and distribution in CCP4, helping to ensure the accuracy of future depositions.
|
Nov 2024
|
|
|
Jon
Agirre
,
Mihaela
Atanasova
,
Haroldas
Bagdonas
,
Charles B.
Ballard
,
Arnaud
Basle
,
James
Beilsten-Edmands
,
Rafael J.
Borges
,
David G.
Brown
,
J. Javier
Burgos-Marmol
,
John M.
Berrisford
,
Paul S.
Bond
,
Iracema
Caballero
,
Lucrezia
Catapano
,
Grzegorz
Chojnowski
,
Atlanta G.
Cook
,
Kevin D.
Cowtan
,
Tristan I.
Croll
,
Judit É.
Debreczeni
,
Nicholas E.
Devenish
,
Eleanor J.
Dodson
,
Tarik R.
Drevon
,
Paul
Emsley
,
Gwyndaf
Evans
,
Phil R.
Evans
,
Maria
Fando
,
James
Foadi
,
Luis
Fuentes-Montero
,
Elspeth F.
Garman
,
Markus
Gerstel
,
Richard J.
Gildea
,
Kaushik
Hatti
,
Maarten L.
Hekkelman
,
Philipp
Heuser
,
Soon Wen
Hoh
,
Michael A.
Hough
,
Huw T.
Jenkins
,
Elisabet
Jiménez
,
Robbie P.
Joosten
,
Ronan M.
Keegan
,
Nicholas
Keep
,
Eugene B.
Krissinel
,
Petr
Kolenko
,
Oleg
Kovalevskiy
,
Victor S.
Lamzin
,
David M.
Lawson
,
Andrey
Lebedev
,
Andrew G. W.
Leslie
,
Bernhard
Lohkamp
,
Fei
Long
,
Martin
Maly
,
Airlie
Mccoy
,
Stuart J.
Mcnicholas
,
Ana
Medina
,
Claudia
Millán
,
James W.
Murray
,
Garib N.
Murshudov
,
Robert A.
Nicholls
,
Martin E. M.
Noble
,
Robert
Oeffner
,
Navraj S.
Pannu
,
James M.
Parkhurst
,
Nicholas
Pearce
,
Joana
Pereira
,
Anastassis
Perrakis
,
Harold R.
Powell
,
Randy J.
Read
,
Daniel J.
Rigden
,
William
Rochira
,
Massimo
Sammito
,
Filomeno
Sanchez Rodriguez
,
George M.
Sheldrick
,
Kathryn L.
Shelley
,
Felix
Simkovic
,
Adam J.
Simpkin
,
Pavol
Skubak
,
Egor
Sobolev
,
Roberto A.
Steiner
,
Kyle
Stevenson
,
Ivo
Tews
,
Jens M. H.
Thomas
,
Andrea
Thorn
,
Josep Triviño
Valls
,
Ville
Uski
,
Isabel
Uson
,
Alexei
Vagin
,
Sameer
Velankar
,
Melanie
Vollmar
,
Helen
Walden
,
David
Waterman
,
Keith S.
Wilson
,
Martyn
Winn
,
Graeme
Winter
,
Marcin
Wojdyr
,
Keitaro
Yamashita
Open Access
Abstract: The Collaborative Computational Project No. 4 (CCP4) is a UK-led international collective with a mission to develop, test, distribute and promote software for macromolecular crystallography. The CCP4 suite is a multiplatform collection of programs brought together by familiar execution routines, a set of common libraries and graphical interfaces. The CCP4 suite has experienced several considerable changes since its last reference article, involving new infrastructure, original programs and graphical interfaces. This article, which is intended as a general literature citation for the use of the CCP4 software suite in structure determination, will guide the reader through such transformations, offering a general overview of the new features and outlining future developments. As such, it aims to highlight the individual programs that comprise the suite and to provide the latest references to them for perusal by crystallographers around the world.
|
Jun 2023
|
|
VMXm-Versatile Macromolecular Crystallography microfocus
|
Abstract: The work of this thesis focuses on the development of novel computational methods for the determination of protein structures through macromolecular X-ray crystallography (MX). The main focus of Chapters 2 and 3 is the development of alternative molecular replacement (MR) approaches in cases where no related structure is available as a search model. In Chapter 2, the performance of a library of helical ensembles created by clustering helical segments is explored. A 30% increase in the number of solutions obtained using these search models was observed when compared with the performance recorded for single-model ideal helices. In Chapter 3, SWAMP is presented: a novel pipeline for the solution of structures of transmembrane proteins. SWAMP includes a library of ensembles built by clustering commonly observed packings of transmembrane helical pairs in close contact. The search models in this library are then ranked based on the similarity between their observed residue contacts and the contacts predicted for the unknown structure. Results show that SWAMP is capable of detecting valid search models originating from unrelated solved structures solely exploiting this contact information. In Chapter 4, the main focus of the work presented remains MR, particularly, the importance of experimental data collection and the quality of the obtained diffraction data. Specifically, the relation between data completeness, the distribution of missing reflections and the quality of the maps obtained through MR is studied. For this purpose, a set of new metrics for the distribution of missing reflections in the reciprocal lattice are proposed, and a large-scale study to assess the effects of data incompleteness on MR outcome is carried out. Results revealed low resolution completeness as a major factor affecting the quality of the maps obtained through MR, highlighting the importance of low resolution reflections in the process of MR. Overall data completeness, signal-to-noise ratio and search model quality were also other factors observed to determine, in conjunction MR outcome. In Chapter 5, new metrics for model validation are presented. These metrics are based on the availability of accurate inter-residue distance predictions, which are compared with the distances observed in the emerging model. These metrics were fed into a support vector machine classifier that was trained to detect model errors based on historical data from the EM Validation Challenges. Further analysis of the possible register errors is done by performing an alignment of the predicted contact map and the map inferred from the contacts observed in the model. Regions of the model where the maximum contact overlap is achieved through a sequence register different to that observed in the model are flagged and the optimal sequence register can then be used to fix the register error. Results suggest that both the detection of model errors and the correction of sequence register errors is possible, even in challenging cases, through the use of the trained classifier in conjunction with the contact map alignment. The approach, implemented in ConKit, thus provides a new tool for protein structure validation that is orthogonal to existing methods. Lastly in Chapter 6 ConPlot is presented: a web-based application which uses the typically empty space near the residue contact map diagonal to display multiple coloured tracks representing other sequence-based predictions. The integration of these different sources of information enables researchers to easily analyse a variety of data simultaneously.
|
Feb 2023
|
|
|
Open Access
Abstract: Determination of protein structures typically entails building a model that satisfies the collected experimental observations and its deposition in the Protein Data Bank. Experimental limitations can lead to unavoidable uncertainties during the process of model building, which result in the introduction of errors into the deposited model. Many metrics are available for model validation, but most are limited to consideration of the physico-chemical aspects of the model or its match to the experimental data. The latest advances in the field of deep learning have enabled the increasingly accurate prediction of inter-residue distances, an advance which has played a pivotal role in the recent improvements observed in the field of protein ab initio modelling. Here, new validation methods are presented based on the use of these precise inter-residue distance predictions, which are compared with the distances observed in the protein model. Sequence-register errors are particularly clearly detected and the register shifts required for their correction can be reliably determined. The method is available in the ConKit package (https://www.conkit.org).
|
Dec 2022
|
|
I04-Macromolecular Crystallography
|
Olga V.
Moroz
,
Elena
Blagova
,
Andrey A.
Lebedev
,
Filomeno
Sanchez Rodriguez
,
Daniel J.
Rigden
,
Jeppe
Wegener Tams
,
Reinhard
Wilting
,
Jan Kjølhede
Vester
,
Emily
Longhi
,
Gustav
Hammerich Hansen
,
Kristian
Bertel Rømer Mørkeberg Krogh
,
Roland A.
Pache
,
Gideon
Davies
,
Keith S.
Wilson
Diamond Proposal Number(s):
[18598]
Abstract: β-Galactosidases catalyse the hydrolysis of lactose into galactose and glucose; as an alternative reaction, some β-galactosidases also catalyse the formation of galactooligosaccharides by transglycosylation. Both reactions have industrial importance: lactose hydrolysis is used to produce lactose-free milk, while galactooligosaccharides have been shown to act as prebiotics. For some multi-domain β-galactosidases, the hydrolysis/transglycosylation ratio can be modified by the truncation of carbohydrate-binding modules. Here, an analysis of BbgIII, a multidomain β-galactosidase from Bifidobacterium bifidum, is presented. The X-ray structure has been determined of an intact protein corresponding to a gene construct of eight domains. The use of evolutionary covariance-based predictions made sequence docking in low-resolution areas of the model spectacularly easy, confirming the relevance of this rapidly developing deep-learning-based technique for model building. The structure revealed two alternative orientations of the CBM32 carbohydrate-binding module relative to the GH2 catalytic domain in the six crystallographically independent chains. In one orientation the CBM32 domain covers the entrance to the active site of the enzyme, while in the other orientation the active site is open, suggesting a possible mechanism for switching between the two activities of the enzyme, namely lactose hydrolysis and transgalactosylation. The location of the carbohydrate-binding site of the CBM32 domain on the opposite site of the module to where it comes into contact with the catalytic GH2 domain is consistent with its involvement in adherence to host cells. The role of the CBM32 domain in switching between hydrolysis and transglycosylation modes offers protein-engineering opportunities for selective β-galactosidase modification for industrial purposes in the future.
|
Dec 2021
|
|
|
Open Access
Abstract: We report here an assessment of the model refinement category of the 14th round of Critical Assessment of Structure Prediction (CASP14). As before, predictors submitted up to five ranked refinements, along with associated residue-level error estimates, for targets that had a wide range of starting quality. The ability of groups to accurately rank their submissions and to predict coordinate error varied widely. Overall only four groups out-performed a “naïve predictor” corresponding to resubmission of the starting model. Among the top groups there are interesting differences of approach and in the spread of improvements seen: some methods are more conservative, others more adventurous. Some targets were “double-barrelled” for which predictors were offered a high-quality AlphaFold 2 (AF2)-derived prediction alongside another of lower quality. The AF2-derived models were largely unimprovable, many of their apparent errors being found to reside at domain and, especially, crystal lattice contacts. Refinement is shown to have a mixed impact overall on structure-based function annotation methods to predict nucleic acid binding, spot catalytic sites and dock protein structures.
|
Jul 2021
|
|
|
Open Access
Abstract: Covariance-based predictions of residue contacts and inter-residue distances are an increasingly popular data type in protein bioinformatics. Here we present ConPlot, a web-based application for convenient display and analysis of contact maps and distograms. Integration of predicted contact data with other predictions is often required to facilitate inference of structural features. ConPlot can therefore use the empty space near the contact map diagonal to display multiple coloured tracks representing other sequence-based predictions. Popular file formats are natively read and bespoke data can also be flexibly displayed. This novel visualisation will enable easier interpretation of predicted contact maps.
|
Jan 2021
|
|
|
Open Access
Abstract: The conventional approach in molecular replacement is the use of a related structure as a search model. However, this is not always possible as the availability of such structures can be scarce for poorly characterized families of proteins. In these cases, alternative approaches can be explored, such as the use of small ideal fragments that share high, albeit local, structural similarity with the unknown protein. Earlier versions of AMPLE enabled the trialling of a library of ideal helices, which worked well for largely helical proteins at suitable resolutions. Here, the performance of libraries of helical ensembles created by clustering helical segments is explored. The impacts of different B-factor treatments and different degrees of structural heterogeneity are explored. A 30% increase in the number of solutions obtained by AMPLE was observed when using this new set of ensembles compared with the performance with ideal helices. The boost in performance was notable across three different fold classes: transmembrane, globular and coiled-coil structures. Furthermore, the increased effectiveness of these ensembles was coupled to a reduction in the time required by AMPLE to reach a solution. AMPLE users can now take full advantage of this new library of search models by activating the `helical ensembles' mode.
|
Oct 2020
|
|