Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't find my sample names #31

Open
jsaintvanne opened this issue Jan 18, 2023 · 5 comments
Open

Can't find my sample names #31

jsaintvanne opened this issue Jan 18, 2023 · 5 comments

Comments

@jsaintvanne
Copy link

Hi,

I'm trying some worklows using ramclustR just very fast and I can't find my sample names in the output...
After take a look at the script of ramclustR function, I can see that you have a lot of results table containing all that you need (rt, intensity, cluster, etc...) But I can't find the sample names in the results MSP file (whereas the rownames of table are my sample names

Someone can help me please ?

Thanks a lot !

@cbroeckl
Copy link
Owner

@jsaintvanne - the spectra are stored in the .msp output file. The spectra that are exported are representative of all the files. While not perfectly accurate, you can picture each individual spectrum as the average spectrum for that compound, taking into account all the data in the dataset. So given that, every spectrum is associated with every sample - it is only the signal intensity that changes, which is stored in the SpecAbund data matrix and exported .csv file.

@jsaintvanne
Copy link
Author

Thanks for your really fast answer @cbroeckl !

Here we work with a samplemetadata and different conditions but we analyze all that at the same time and I thought that ramclustR can differenciate that. So sorry or this stupid previous question and now an other one : should it be great for you to have something that take conditions of samples as input to be able to differenciate them and have the cluster in each condition (that can change a lot between blank and standard for example).

@cbroeckl
Copy link
Owner

@jsaintvanne - your sample names are delivered from the xcms object. Generally my approach to sample naming is to use concatentated factors. i.e.

treatment-4hr-rep1
control-2hr-rep3

such that the sample name can be split into separate factors. There is a function to enable the splitting, rc.expand.sample.names. Above, this would split the two sample names into a data frame with three columns, when i use '-' as the delimiter.

That said, i think that class specific clustering will be a difficult path forward:

  1. ramclustR's algorithm is dependent on quantitative variation in the feature data. the less variation the less clear the relationships between features.
  2. The feature grouping behavior is likely to be slightly different in different sample groups, and rectifying those discrepencies is not trivial. i.e. what if feature 1211 (for example) is part of C003 in one group and C008 in another group? That isn't to say that there aren't solutions, but they require a good deal of thought before implementing.

An alternative path which may alleviate your concerns is to switch from pearson's correlation to spearman's. Rank correlation will be much less prone to the influcence of the ouliers (blanks, for example) than pearsons. This is enabled in the main ramclustr function as option cor.method. Pearson's is default, but you could set it to cor.method = spearman.

@hechth
Copy link
Collaborator

hechth commented Jan 24, 2023

@cbroeckl thanks for the explanation - didn't know this about the spearman correlation!

I assume that another option would be to actually run the individual conditions independently and then build networks/find identical or similar features across the groups using spectral matching?

@cbroeckl
Copy link
Owner

cbroeckl commented Jan 24, 2023

@hechth - absolutely could be done. A few items to consider:

  1. are there enough samples in each group for correlation-based clustering to be meaningful? If not, it would be best to develop a peak shape based clustering as well. i had actually started down this path and lost steam and ultimately abandoned it, for lack of time to validate it. There is a clear path forward for it though. You can simultaneously use all the similarity metrics, retention time of the feature, correlation, and peak shape by expanding the existing similarity product score. In theory IMS data could also be incorporated, if available.
  2. if you perform RAMClustR by sample groups then cluster spectra, how do you deal with feature assignments which are in conflict?
  3. How do you deal with missing spectra in the blanks (NA values are a bit of a nuisance...).
  4. If you are going to be performing clustering by sample type, would be be best to perform XCMS by sample type as well?
  5. If two spectra from two groups align pretty well but imperfectly, what set of features should be used in the quantitative assignments - only the overlapping features or all features?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants