-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEQRES parsing #576
base: master
Are you sure you want to change the base?
SEQRES parsing #576
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a different alignment implementation, since you based your implementation on code that is way stricter licensed than the apache license we use.
A solution to make everyone happy, remove the NW alignment altogether and just check the seqres directly against the pdb entries. (thanks @Tsjerk for the discussion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it :)
I do have some questions and comments, but in general I'm a fan.
I also approve of the decision of just ripping out the alignment for now.
chains = list(list(np.unique(list(mol.nodes[idx]['chain'] for idx in mol))) | ||
for mol in self.molecules) | ||
|
||
properties = {a: [] for a in [x for xs in chains for x in xs]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least this needs a comment with what chains
looks like.
The line to parse. | ||
|
||
""" | ||
self._seqres.append(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it would be better to deal with the syntax of the SEQRES line here, and indeed deal with the semantics in do_seqres
for line in self._seqres: | ||
chain = line[11] | ||
resnames = line[19:].split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, I suggest moving this to the method seqres
. This also makes dealing with malformatted seqres lines easier
# this to make a flat list for each chain in the file | ||
for chain in properties.keys(): | ||
properties[chain] = [x for xs in properties[chain] for x in xs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honest question: do we want a list or str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping this as a list makes element comparison/indexing by resid easier later on
LOGGER.info("Checking pdb SEQRES entry for missing residues", type="step") | ||
for mol in self.molecules: | ||
resids = np.array([mol.nodes[idx]['resid'] for idx in mol],dtype = int) | ||
chain = list(set([mol.nodes[idx]['chain'] for idx in mol]))[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you select only the first item?
chain = list(set([mol.nodes[idx]['chain'] for idx in mol]))[0] | |
chain = next(iter(set([mol.nodes[idx]['chain'] for idx in mol]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the list should only contain a single item, it could equally say mol.nodes[0]['chain']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Might be worth adding an assert
then
missing_res_nos = np.setdiff1d(np.arange(len(properties[chain]))+1, | ||
np.unique(resids)[np.unique(resids) < len(properties[chain])+1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a comment :)
Does this assume resid numbers are sane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added. The < in the pdb residues array ensures that any residues that are larger (eg. for solvent molecules attached to the chain) are excluded, if that's what you mean by sanity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the residnumbers have to be in order? Do solvents etc always have to be last? I guess the real solution is to do the alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the resids have to necessarily be in order, they just have to exist. I agree that it's a problem if there are solvent residues listed first though, not sure how to get around that...
I think increasingly there may have to be some kind of alignment included. I've now found a problem if there are eg. expression tags, which will be listed in the seqres, but not actually be part of the 'proper' sequence. For this, either we bring back some kind of alignment, or I parse SEQADV for this purpose. I think the latter should be easy enough to do given it will only be used internally for SEQRES comparison. This is also related to your next comment about checking that the apparently matching resnames are correct. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your call. I don't enough about the pdb format (and not really time to look it up) to say anything about the SEQADV. I do think the alignment is (going to be) important, but that can be medium-long term. For now I see 2 options: 1) SEQADV; 2) if it doesn't match, complain. You have to see whether that would give too many false positives, in which case you may want to make it an INFO message rather than warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why/when would alignment be important? In files that have sequence, it should correspond in a simple, literal way, so can be done without fancy alignment. But those files will also have REMARK 465, i dicating which residues in the sequence are not resolved. For files without sequence, there is nothing to align to, fancy or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iff the seqres does not match the found residues, where is it wrong? And can you guarantee that all pdb files that have a seqres have remark 465? I don't dare make any assumptions about pdb files, based on my experience so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the seqres does not match, someone has deliberately changed the coordinates section, and did not update SEQRES. I guess a deliberate change is meant to be processed. If there is a SEQRES, it's probably good to use it for sanity check, but not using NW or other alignment, because breaks may be assigned wrongly. Use it only for simple verbatim checking, preferably in conjunction with REMARK 465, and if it doesn't match, it makes sense to issue an error; the user can then discard the SEQRES record if that's what was meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is more about what you do when the seqres doesn't match.
- You warn: "Seqres is wrong, good luck!"
- You assume Remark 465 is correct and info after checking: "Seqres is wrong, but the deviations are specified in R465"
- You warn: "Seqres is wrong and this is not described in R465. Probably, residues X-Z are missing, residue ALA53 was changed to PHE. Sort out your mess."
As user, I would like warning 3 most. But I'm completely fine with just 1 or 2.
Partly this also tied in with how much we trust CHAIN identifiers, but I guess it would be sane to assume they're at least consistent within the pdb file (i.e. between seqres and the coordinate section)
#find consecutive sequences of missing residues | ||
nums = sorted(set(missing_res_nos)) | ||
gaps = [[s, e] for s, e in zip(nums, nums[1:]) if s + 1 < e] | ||
edges = iter(nums[:1] + sum(gaps, []) + nums[-1:]) | ||
series = list(zip(edges, edges)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this gets the TODO: alignment?
In addition, do we want to check that the resnames are correct?
ATOM 11 CB SER A 2 25.020 11.833 71.196 1.00 33.13 C | ||
ATOM 12 OG SER A 2 24.049 10.863 71.549 1.00 34.53 O | ||
''', True))) | ||
def test_seqres(caplog, pdbstr, status): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test without a SEQRES entry, and one where there are enough residues, but they have different resnames
First go at adding parsing of SEQRES section of PDB files to check the residues present with coordinates in the pdb vs. the SEQRES entry. The check will also do basic sequence alignment to try and advise which residues are missing, and raise a warning to say which ones are.
For the future, would be good to propagate the missing residues in some form for coordinate rebuilding.