Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running test dataset #3

Open
apdavid opened this issue Sep 2, 2021 · 13 comments
Open

Error when running test dataset #3

apdavid opened this issue Sep 2, 2021 · 13 comments

Comments

@apdavid
Copy link

apdavid commented Sep 2, 2021

after setting up the conda environment, I tried running the command

snakemake -p --config datasets="test" --restart-times=0

and then got the following error:

Traceback (most recent call last):
  File "scripts/rijk_zscore.py", line 474, in <module>
    main()
  File "scripts/rijk_zscore.py", line 116, in main
    df = pd.read_parquet(args.parquet,columns=["juncPosR1A","geneR1A_uniq","juncPosR1B","numReads","cell","splice_ann","tissue","compartment","free_annotation","refName_newR1","called","chrR1A","exon_annR1A","exon_annR1B","strand"])
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 317, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 142, in read
    path, columns=columns, filesystem=fs, **kwargs
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1896, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1746, in read
    use_threads=use_threads
  File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 440, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2946, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2854, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(splice_ann) in cell: string
chrR1A: string
geneR1A_uniq: string
strand: string
juncPosR1A: int64
juncPosR1B: int64
numReads: int64
called: int64
free_annotation: string
compartment: string
tissue: string
@apdavid
Copy link
Author

apdavid commented Sep 2, 2021

Is the data missing a variable splice_ann?

@juliaolivieri
Copy link
Owner

Thanks for pointing these things out. I'm checking up on this and your other comment and I'll reply by the end of the day.

@juliaolivieri
Copy link
Owner

After looking into this, I think it might be a version issue. Do you know what version of pandas and pyarrow you're using? I don't get this error when I use pandas=1.0.4 and pyarrow=0.15.1 , but I do when I use pandas=1.3.0 and pyarrow=5.0.0.

Regardless, I just pushed a change that should make the test work even if you don't change the package versions (no need to change/add columns). Let me know if you continue to have problems.

@apdavid
Copy link
Author

apdavid commented Sep 3, 2021

thank you for the quick reply, that resolved that error and now it is throwing another error in the same section, rule rijk_zscore and this is the *.err output:

100%|██████████| 2/2 [00:00<00:00, 11.30it/s]
Traceback (most recent call last):
  File "scripts/rijk_zscore.py", line 486, in <module>
    main()
  File "scripts/rijk_zscore.py", line 358, in main
    df["cov"] = df["geneR1A_uniq"].map(grouped.apply(lambda x: x['z_A'].cov(x['z_B'])))
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/series.py", line 3983, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/base.py", line 1118, in _map_values
    mapper, dtype_if_empty=np.float64
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/construction.py", line 632, in create_series_with_explicit_dtype
    if is_empty_data(data) and dtype is None:
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/construction.py", line 596, in is_empty_data
    is_simple_empty = is_list_like_without_dtype and not data
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 1330, in __nonzero__
    f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

@apdavid
Copy link
Author

apdavid commented Sep 4, 2021

Also I set up a conda environment using the environment.yml file and it appears that the pyarrow=0.15.1 and pandas=1.0.4 versions were selected and loaded correctly when I checked conda list -n spliz_env.

@juliaolivieri
Copy link
Owner

I'm not getting this issue when I run the test data. Looking at your python path in the traceback, it looks like you may be running with your local python rather than conda. Could you double check this? It may be useful to add the following print statement at line 358 of rijk_zscore.py: print("pandas version:", pd.__version__)

@apdavid
Copy link
Author

apdavid commented Sep 7, 2021

Thank you I removed the local python packages so that it wouldn’t interfere, and cleared and reinstall the anaconda environment packages from scratch using the environment.yml.

Also I added the line to print the pandas version and it is writing out 1.0.4, so that looks correct and it is using the anaconda install python 3.6.7.

$ cat rijk_zscore_test_0.1_0.0_5.err
Traceback (most recent call last):
  File "scripts/rijk_zscore.py", line 487, in <module>
    main()
  File "scripts/rijk_zscore.py", line 124, in main
    raise RuntimeError("required column '{}' is missing".format(rc))
RuntimeError: required column 'refName_newR1' is missing

@juliaolivieri
Copy link
Owner

Thanks for you patience with this. We actually have a parallel implementation of the SpliZ in nextflow: https://github.com/salzmanlab/SpliZ
Can you try the "quick start" steps here and see if you get the same problem?

@apdavid
Copy link
Author

apdavid commented Sep 10, 2021

Thank you I actually discoverd teh NF implementation today and currently installing it. I'll let you know how it goes.

@apdavid
Copy link
Author

apdavid commented Sep 10, 2021

I was able to install nextflow and create the conda environment with the new environment.yml file and activated it, ran the test data and it generated this error:

N E X T F L O W  ~  version 21.04.0
Pulling salzmanlab/spliz ...
 Already-up-to-date
Launching `salzmanlab/spliz` [festering_torvalds] - revision: 6c708f518b [main]


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/spliz v1.0dev
------------------------------------------------------


WARN: Found unexpected parameters:
* --outdir: ./results/test
* --numGenes: null
* --domain: null
- Ignore this warning: params.schema_ignore_params = "outdir,numGenes,domain" 

Core Nextflow options
  revision              : main
  runName               : festering_torvalds
  container             : kaitlinchaung/spliz:dev
  launchDir             :
  workDir               : 
  projectDir            : 
  userName              :
  profile               : standard
  configFiles           :

Input/output options
  dataname              : test
  input_file            : small.pq
  SICILIAN              : true
  pin_S                 : 0.1
  pin_z                 : 0.0
  bounds                : 5
  light                 : false
  svd_type              : normdonor
  grouping_level_1      : tissue
  grouping_level_2      : compartment
  n_perms               : 100
  libraryType           : 10X
  run_analysis          : true

Max job request options
  max_memory            : 800 GB
  max_time              : 10d

Other parameters
  max_multiqc_email_size: 25 MB

------------------------------------------------------
 Only displaying parameters that differ from defaults.
------------------------------------------------------
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:SPLIZ:CALC_SPLIZVD         -
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:ANALYSIS:PVAL_PERMUTATIONS -


A process input channel evaluates to null -- Invalid declaration `path domain`

 -- Check script '.nextflow/assets/salzmanlab/spliz/./workflows/./../subworkflows/local/analysis.nf' at line: 49 or see '.nextflow.log' file for more details

Nextflow log:

$nextflow log

2021-09-10 16:42:19     -               festering_torvalds      -       6c708f518b      807cca83-2a1e-41d0-add7-9d0eba060db7    nextflow run salzmanlab/spliz -r main -latest -c small.config

@kaitlinchaung
Copy link

I was able to install nextflow and create the conda environment with the new environment.yml file and activated it, ran the test data and it generated this error:

N E X T F L O W  ~  version 21.04.0
Pulling salzmanlab/spliz ...
 Already-up-to-date
Launching `salzmanlab/spliz` [festering_torvalds] - revision: 6c708f518b [main]


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/spliz v1.0dev
------------------------------------------------------


WARN: Found unexpected parameters:
* --outdir: ./results/test
* --numGenes: null
* --domain: null
- Ignore this warning: params.schema_ignore_params = "outdir,numGenes,domain" 

Core Nextflow options
  revision              : main
  runName               : festering_torvalds
  container             : kaitlinchaung/spliz:dev
  launchDir             :
  workDir               : 
  projectDir            : 
  userName              :
  profile               : standard
  configFiles           :

Input/output options
  dataname              : test
  input_file            : small.pq
  SICILIAN              : true
  pin_S                 : 0.1
  pin_z                 : 0.0
  bounds                : 5
  light                 : false
  svd_type              : normdonor
  grouping_level_1      : tissue
  grouping_level_2      : compartment
  n_perms               : 100
  libraryType           : 10X
  run_analysis          : true

Max job request options
  max_memory            : 800 GB
  max_time              : 10d

Other parameters
  max_multiqc_email_size: 25 MB

------------------------------------------------------
 Only displaying parameters that differ from defaults.
------------------------------------------------------
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:SPLIZ:CALC_SPLIZVD         -
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:ANALYSIS:PVAL_PERMUTATIONS -


A process input channel evaluates to null -- Invalid declaration `path domain`

 -- Check script '.nextflow/assets/salzmanlab/spliz/./workflows/./../subworkflows/local/analysis.nf' at line: 49 or see '.nextflow.log' file for more details

Nextflow log:

$nextflow log

2021-09-10 16:42:19     -               festering_torvalds      -       6c708f518b      807cca83-2a1e-41d0-add7-9d0eba060db7    nextflow run salzmanlab/spliz -r main -latest -c small.config

Hi there! I've pushed some changes to the nextflow pipeline and was able to run the test data set. Can you please try running again?

@apdavid
Copy link
Author

apdavid commented Sep 13, 2021

@kaitlinchaung
Thank you I was able to run the test data set after you made those changes. I am now going to try to run my own data from SICILIAN.

Is there a *.tsv version of the small dataset, just to verify my data is formatted correctly (even though it came straight out of SICILIAN?)

@juliaolivieri @kaitlinchaung
Also are there any processing or filtering steps that you performed post-SICILIAN and pre-SpliZ that I should be aware of? My data is SS2 data if that is helpful.

@kaitlinchaung
Copy link

@apdavid
I've added a small tsv to the nextflow pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants