Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEQRES parsing #576

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

csbrasnett
Copy link
Collaborator

@csbrasnett csbrasnett commented Feb 28, 2024

First go at adding parsing of SEQRES section of PDB files to check the residues present with coordinates in the pdb vs. the SEQRES entry. The check will also do basic sequence alignment to try and advise which residues are missing, and raise a warning to say which ones are.

For the future, would be good to propagate the missing residues in some form for coordinate rebuilding.

Copy link
Member

@pckroon pckroon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a different alignment implementation, since you based your implementation on code that is way stricter licensed than the apache license we use.

vermouth/pdb/nwalign.py Outdated Show resolved Hide resolved
@csbrasnett
Copy link
Collaborator Author

csbrasnett commented Mar 1, 2024

A solution to make everyone happy, remove the NW alignment altogether and just check the seqres directly against the pdb entries. (thanks @Tsjerk for the discussion)

Copy link
Member

@pckroon pckroon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it :)
I do have some questions and comments, but in general I'm a fan.
I also approve of the decision of just ripping out the alignment for now.

Comment on lines +466 to +469
chains = list(list(np.unique(list(mol.nodes[idx]['chain'] for idx in mol)))
for mol in self.molecules)

properties = {a: [] for a in [x for xs in chains for x in xs]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least this needs a comment with what chains looks like.

The line to parse.

"""
self._seqres.append(line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would be better to deal with the syntax of the SEQRES line here, and indeed deal with the semantics in do_seqres

Comment on lines +471 to +473
for line in self._seqres:
chain = line[11]
resnames = line[19:].split()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, I suggest moving this to the method seqres. This also makes dealing with malformatted seqres lines easier

Comment on lines +485 to +487
# this to make a flat list for each chain in the file
for chain in properties.keys():
properties[chain] = [x for xs in properties[chain] for x in xs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honest question: do we want a list or str?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this as a list makes element comparison/indexing by resid easier later on

vermouth/pdb/pdb.py Show resolved Hide resolved
vermouth/pdb/pdb.py Show resolved Hide resolved
LOGGER.info("Checking pdb SEQRES entry for missing residues", type="step")
for mol in self.molecules:
resids = np.array([mol.nodes[idx]['resid'] for idx in mol],dtype = int)
chain = list(set([mol.nodes[idx]['chain'] for idx in mol]))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you select only the first item?

Suggested change
chain = list(set([mol.nodes[idx]['chain'] for idx in mol]))[0]
chain = next(iter(set([mol.nodes[idx]['chain'] for idx in mol])))

Copy link
Collaborator Author

@csbrasnett csbrasnett Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the list should only contain a single item, it could equally say mol.nodes[0]['chain']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Might be worth adding an assert then

Comment on lines +500 to +501
missing_res_nos = np.setdiff1d(np.arange(len(properties[chain]))+1,
np.unique(resids)[np.unique(resids) < len(properties[chain])+1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment :)
Does this assume resid numbers are sane?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. The < in the pdb residues array ensures that any residues that are larger (eg. for solvent molecules attached to the chain) are excluded, if that's what you mean by sanity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the residnumbers have to be in order? Do solvents etc always have to be last? I guess the real solution is to do the alignment?

Copy link
Collaborator Author

@csbrasnett csbrasnett Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the resids have to necessarily be in order, they just have to exist. I agree that it's a problem if there are solvent residues listed first though, not sure how to get around that...

I think increasingly there may have to be some kind of alignment included. I've now found a problem if there are eg. expression tags, which will be listed in the seqres, but not actually be part of the 'proper' sequence. For this, either we bring back some kind of alignment, or I parse SEQADV for this purpose. I think the latter should be easy enough to do given it will only be used internally for SEQRES comparison. This is also related to your next comment about checking that the apparently matching resnames are correct. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your call. I don't enough about the pdb format (and not really time to look it up) to say anything about the SEQADV. I do think the alignment is (going to be) important, but that can be medium-long term. For now I see 2 options: 1) SEQADV; 2) if it doesn't match, complain. You have to see whether that would give too many false positives, in which case you may want to make it an INFO message rather than warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why/when would alignment be important? In files that have sequence, it should correspond in a simple, literal way, so can be done without fancy alignment. But those files will also have REMARK 465, i dicating which residues in the sequence are not resolved. For files without sequence, there is nothing to align to, fancy or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iff the seqres does not match the found residues, where is it wrong? And can you guarantee that all pdb files that have a seqres have remark 465? I don't dare make any assumptions about pdb files, based on my experience so far.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the seqres does not match, someone has deliberately changed the coordinates section, and did not update SEQRES. I guess a deliberate change is meant to be processed. If there is a SEQRES, it's probably good to use it for sanity check, but not using NW or other alignment, because breaks may be assigned wrongly. Use it only for simple verbatim checking, preferably in conjunction with REMARK 465, and if it doesn't match, it makes sense to issue an error; the user can then discard the SEQRES record if that's what was meant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is more about what you do when the seqres doesn't match.

  1. You warn: "Seqres is wrong, good luck!"
  2. You assume Remark 465 is correct and info after checking: "Seqres is wrong, but the deviations are specified in R465"
  3. You warn: "Seqres is wrong and this is not described in R465. Probably, residues X-Z are missing, residue ALA53 was changed to PHE. Sort out your mess."

As user, I would like warning 3 most. But I'm completely fine with just 1 or 2.

Partly this also tied in with how much we trust CHAIN identifiers, but I guess it would be sane to assume they're at least consistent within the pdb file (i.e. between seqres and the coordinate section)

Comment on lines +503 to +507
#find consecutive sequences of missing residues
nums = sorted(set(missing_res_nos))
gaps = [[s, e] for s, e in zip(nums, nums[1:]) if s + 1 < e]
edges = iter(nums[:1] + sum(gaps, []) + nums[-1:])
series = list(zip(edges, edges))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this gets the TODO: alignment?
In addition, do we want to check that the resnames are correct?

ATOM 11 CB SER A 2 25.020 11.833 71.196 1.00 33.13 C
ATOM 12 OG SER A 2 24.049 10.863 71.549 1.00 34.53 O
''', True)))
def test_seqres(caplog, pdbstr, status):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test without a SEQRES entry, and one where there are enough residues, but they have different resnames

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants