SEQRES parsing #576

csbrasnett · 2024-02-28T09:50:31Z

First go at adding parsing of SEQRES section of PDB files to check the residues present with coordinates in the pdb vs. the SEQRES entry. The check will also do basic sequence alignment to try and advise which residues are missing, and raise a warning to say which ones are.

For the future, would be good to propagate the missing residues in some form for coordinate rebuilding.

pckroon

We need a different alignment implementation, since you based your implementation on code that is way stricter licensed than the apache license we use.

vermouth/pdb/nwalign.py

csbrasnett · 2024-03-01T13:22:21Z

A solution to make everyone happy, remove the NW alignment altogether and just check the seqres directly against the pdb entries. (thanks @Tsjerk for the discussion)

pckroon

I like it :)
I do have some questions and comments, but in general I'm a fan.
I also approve of the decision of just ripping out the alignment for now.

pckroon · 2024-03-01T13:39:04Z

vermouth/pdb/pdb.py

+        chains = list(list(np.unique(list(mol.nodes[idx]['chain'] for idx in mol)))
+                      for mol in self.molecules)
+
+        properties = {a: [] for a in [x for xs in chains for x in xs]}


At the very least this needs a comment with what chains looks like.

pckroon · 2024-03-01T13:40:16Z

vermouth/pdb/pdb.py

+            The line to parse.
+
+        """
+        self._seqres.append(line)


IMHO it would be better to deal with the syntax of the SEQRES line here, and indeed deal with the semantics in do_seqres

pckroon · 2024-03-01T13:40:56Z

vermouth/pdb/pdb.py

+        for line in self._seqres:
+            chain = line[11]
+            resnames = line[19:].split()


As mentioned earlier, I suggest moving this to the method seqres. This also makes dealing with malformatted seqres lines easier

pckroon · 2024-03-01T13:42:56Z

vermouth/pdb/pdb.py

+        # this to make a flat list for each chain in the file
+        for chain in properties.keys():
+            properties[chain] = [x for xs in properties[chain] for x in xs]


Honest question: do we want a list or str?

keeping this as a list makes element comparison/indexing by resid easier later on

vermouth/pdb/pdb.py

pckroon · 2024-03-01T13:44:50Z

vermouth/pdb/pdb.py

+        LOGGER.info("Checking pdb SEQRES entry for missing residues", type="step")
+        for mol in self.molecules:
+            resids = np.array([mol.nodes[idx]['resid'] for idx in mol],dtype = int)
+            chain = list(set([mol.nodes[idx]['chain'] for idx in mol]))[0]


Why do you select only the first item?

Suggested change

chain = list(set([mol.nodes[idx]['chain'] for idx in mol]))[0]

chain = next(iter(set([mol.nodes[idx]['chain'] for idx in mol])))

the list should only contain a single item, it could equally say mol.nodes[0]['chain']

Alright. Might be worth adding an assert then

pckroon · 2024-03-01T13:45:52Z

vermouth/pdb/pdb.py

+            missing_res_nos = np.setdiff1d(np.arange(len(properties[chain]))+1,
+                                       np.unique(resids)[np.unique(resids) < len(properties[chain])+1])


Needs a comment :)
Does this assume resid numbers are sane?

added. The < in the pdb residues array ensures that any residues that are larger (eg. for solvent molecules attached to the chain) are excluded, if that's what you mean by sanity?

Does this mean that the residnumbers have to be in order? Do solvents etc always have to be last? I guess the real solution is to do the alignment?

I don't think the resids have to necessarily be in order, they just have to exist. I agree that it's a problem if there are solvent residues listed first though, not sure how to get around that...

I think increasingly there may have to be some kind of alignment included. I've now found a problem if there are eg. expression tags, which will be listed in the seqres, but not actually be part of the 'proper' sequence. For this, either we bring back some kind of alignment, or I parse SEQADV for this purpose. I think the latter should be easy enough to do given it will only be used internally for SEQRES comparison. This is also related to your next comment about checking that the apparently matching resnames are correct. Thoughts?

Your call. I don't enough about the pdb format (and not really time to look it up) to say anything about the SEQADV. I do think the alignment is (going to be) important, but that can be medium-long term. For now I see 2 options: 1) SEQADV; 2) if it doesn't match, complain. You have to see whether that would give too many false positives, in which case you may want to make it an INFO message rather than warning.

Why/when would alignment be important? In files that have sequence, it should correspond in a simple, literal way, so can be done without fancy alignment. But those files will also have REMARK 465, i dicating which residues in the sequence are not resolved. For files without sequence, there is nothing to align to, fancy or not.

Iff the seqres does not match the found residues, where is it wrong? And can you guarantee that all pdb files that have a seqres have remark 465? I don't dare make any assumptions about pdb files, based on my experience so far.

If the seqres does not match, someone has deliberately changed the coordinates section, and did not update SEQRES. I guess a deliberate change is meant to be processed. If there is a SEQRES, it's probably good to use it for sanity check, but not using NW or other alignment, because breaks may be assigned wrongly. Use it only for simple verbatim checking, preferably in conjunction with REMARK 465, and if it doesn't match, it makes sense to issue an error; the user can then discard the SEQRES record if that's what was meant.

The question is more about what you do when the seqres doesn't match.

You warn: "Seqres is wrong, good luck!"

You assume Remark 465 is correct and info after checking: "Seqres is wrong, but the deviations are specified in R465"

You warn: "Seqres is wrong and this is not described in R465. Probably, residues X-Z are missing, residue ALA53 was changed to PHE. Sort out your mess."

As user, I would like warning 3 most. But I'm completely fine with just 1 or 2.

Partly this also tied in with how much we trust CHAIN identifiers, but I guess it would be sane to assume they're at least consistent within the pdb file (i.e. between seqres and the coordinate section)

pckroon · 2024-03-01T13:47:31Z

vermouth/pdb/pdb.py

+            #find consecutive sequences of missing residues
+            nums = sorted(set(missing_res_nos))
+            gaps = [[s, e] for s, e in zip(nums, nums[1:]) if s + 1 < e]
+            edges = iter(nums[:1] + sum(gaps, []) + nums[-1:])
+            series = list(zip(edges, edges))


I guess this gets the TODO: alignment?
In addition, do we want to check that the resnames are correct?

pckroon · 2024-03-01T13:48:13Z

vermouth/tests/pdb/test_read_pdb.py

+    ATOM     11  CB  SER A   2      25.020  11.833  71.196  1.00 33.13           C  
+    ATOM     12  OG  SER A   2      24.049  10.863  71.549  1.00 34.53           O  
+''', True)))
+def test_seqres(caplog, pdbstr, status):


Add a test without a SEQRES entry, and one where there are enough residues, but they have different resnames

csbrasnett added 4 commits February 26, 2024 11:51

added seqres parsing

6483f5b

corrected seqres check

883db46

added seqres read tests

be8e437

added sequence alginment check

33cba26

csbrasnett requested review from pckroon and fgrunewald February 28, 2024 09:50

pckroon requested changes Feb 28, 2024

View reviewed changes

vermouth/pdb/nwalign.py Outdated Show resolved Hide resolved

csbrasnett added 2 commits February 28, 2024 16:29

fixed and added tests

d6a961c

changed seqres check

1e107e0

pckroon requested changes Mar 1, 2024

View reviewed changes

Merge branch 'marrink-lab:master' into coord-rebuild

73ae16c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEQRES parsing #576

SEQRES parsing #576

csbrasnett commented Feb 28, 2024 •

edited

Loading

pckroon left a comment

csbrasnett commented Mar 1, 2024 •

edited

Loading

pckroon left a comment

pckroon Mar 1, 2024

pckroon Mar 1, 2024

pckroon Mar 1, 2024

pckroon Mar 1, 2024

csbrasnett Mar 1, 2024

pckroon Mar 1, 2024

csbrasnett Mar 1, 2024 •

edited

Loading

pckroon Mar 1, 2024

pckroon Mar 1, 2024

csbrasnett Mar 1, 2024

pckroon Mar 1, 2024

csbrasnett Mar 1, 2024 •

edited

Loading

pckroon Mar 5, 2024

Tsjerk Mar 5, 2024

pckroon Mar 6, 2024

Tsjerk Mar 6, 2024

pckroon Mar 6, 2024

pckroon Mar 1, 2024

pckroon Mar 1, 2024

	chain = list(set([mol.nodes[idx]['chain'] for idx in mol]))[0]
	chain = next(iter(set([mol.nodes[idx]['chain'] for idx in mol])))

		missing_res_nos = np.setdiff1d(np.arange(len(properties[chain]))+1,
		np.unique(resids)[np.unique(resids) < len(properties[chain])+1])

SEQRES parsing #576

Are you sure you want to change the base?

SEQRES parsing #576

Conversation

csbrasnett commented Feb 28, 2024 • edited Loading

pckroon left a comment

Choose a reason for hiding this comment

csbrasnett commented Mar 1, 2024 • edited Loading

pckroon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csbrasnett Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csbrasnett Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csbrasnett commented Feb 28, 2024 •

edited

Loading

csbrasnett commented Mar 1, 2024 •

edited

Loading

csbrasnett Mar 1, 2024 •

edited

Loading

csbrasnett Mar 1, 2024 •

edited

Loading