Error when running test dataset #3

apdavid · 2021-09-02T20:47:11Z

after setting up the conda environment, I tried running the command

snakemake -p --config datasets="test" --restart-times=0

and then got the following error:

Traceback (most recent call last):
  File "scripts/rijk_zscore.py", line 474, in <module>
    main()
  File "scripts/rijk_zscore.py", line 116, in main
    df = pd.read_parquet(args.parquet,columns=["juncPosR1A","geneR1A_uniq","juncPosR1B","numReads","cell","splice_ann","tissue","compartment","free_annotation","refName_newR1","called","chrR1A","exon_annR1A","exon_annR1B","strand"])
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 317, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 142, in read
    path, columns=columns, filesystem=fs, **kwargs
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1896, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1746, in read
    use_threads=use_threads
  File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 440, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2946, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2854, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(splice_ann) in cell: string
chrR1A: string
geneR1A_uniq: string
strand: string
juncPosR1A: int64
juncPosR1B: int64
numReads: int64
called: int64
free_annotation: string
compartment: string
tissue: string

The text was updated successfully, but these errors were encountered:

apdavid · 2021-09-02T20:49:57Z

Is the data missing a variable splice_ann?

juliaolivieri · 2021-09-02T20:51:41Z

Thanks for pointing these things out. I'm checking up on this and your other comment and I'll reply by the end of the day.

juliaolivieri · 2021-09-03T07:51:01Z

After looking into this, I think it might be a version issue. Do you know what version of pandas and pyarrow you're using? I don't get this error when I use pandas=1.0.4 and pyarrow=0.15.1 , but I do when I use pandas=1.3.0 and pyarrow=5.0.0.

Regardless, I just pushed a change that should make the test work even if you don't change the package versions (no need to change/add columns). Let me know if you continue to have problems.

apdavid · 2021-09-03T21:03:01Z

thank you for the quick reply, that resolved that error and now it is throwing another error in the same section, rule rijk_zscore and this is the *.err output:

100%|██████████| 2/2 [00:00<00:00, 11.30it/s]
Traceback (most recent call last):
  File "scripts/rijk_zscore.py", line 486, in <module>
    main()
  File "scripts/rijk_zscore.py", line 358, in main
    df["cov"] = df["geneR1A_uniq"].map(grouped.apply(lambda x: x['z_A'].cov(x['z_B'])))
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/series.py", line 3983, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/base.py", line 1118, in _map_values
    mapper, dtype_if_empty=np.float64
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/construction.py", line 632, in create_series_with_explicit_dtype
    if is_empty_data(data) and dtype is None:
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/construction.py", line 596, in is_empty_data
    is_simple_empty = is_list_like_without_dtype and not data
  File "/wynton/home/tjan/adavid/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 1330, in __nonzero__
    f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

apdavid · 2021-09-04T05:19:44Z

Also I set up a conda environment using the environment.yml file and it appears that the pyarrow=0.15.1 and pandas=1.0.4 versions were selected and loaded correctly when I checked conda list -n spliz_env.

juliaolivieri · 2021-09-04T23:47:41Z

I'm not getting this issue when I run the test data. Looking at your python path in the traceback, it looks like you may be running with your local python rather than conda. Could you double check this? It may be useful to add the following print statement at line 358 of rijk_zscore.py: print("pandas version:", pd.__version__)

apdavid · 2021-09-07T21:05:04Z

Thank you I removed the local python packages so that it wouldn’t interfere, and cleared and reinstall the anaconda environment packages from scratch using the environment.yml.

Also I added the line to print the pandas version and it is writing out 1.0.4, so that looks correct and it is using the anaconda install python 3.6.7.

$ cat rijk_zscore_test_0.1_0.0_5.err
Traceback (most recent call last):
  File "scripts/rijk_zscore.py", line 487, in <module>
    main()
  File "scripts/rijk_zscore.py", line 124, in main
    raise RuntimeError("required column '{}' is missing".format(rc))
RuntimeError: required column 'refName_newR1' is missing

juliaolivieri · 2021-09-10T23:18:34Z

Thanks for you patience with this. We actually have a parallel implementation of the SpliZ in nextflow: https://github.com/salzmanlab/SpliZ
Can you try the "quick start" steps here and see if you get the same problem?

apdavid · 2021-09-10T23:20:08Z

Thank you I actually discoverd teh NF implementation today and currently installing it. I'll let you know how it goes.

apdavid · 2021-09-10T23:47:06Z

I was able to install nextflow and create the conda environment with the new environment.yml file and activated it, ran the test data and it generated this error:

N E X T F L O W  ~  version 21.04.0
Pulling salzmanlab/spliz ...
 Already-up-to-date
Launching `salzmanlab/spliz` [festering_torvalds] - revision: 6c708f518b [main]


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/spliz v1.0dev
------------------------------------------------------


WARN: Found unexpected parameters:
* --outdir: ./results/test
* --numGenes: null
* --domain: null
- Ignore this warning: params.schema_ignore_params = "outdir,numGenes,domain" 

Core Nextflow options
  revision              : main
  runName               : festering_torvalds
  container             : kaitlinchaung/spliz:dev
  launchDir             :
  workDir               : 
  projectDir            : 
  userName              :
  profile               : standard
  configFiles           :

Input/output options
  dataname              : test
  input_file            : small.pq
  SICILIAN              : true
  pin_S                 : 0.1
  pin_z                 : 0.0
  bounds                : 5
  light                 : false
  svd_type              : normdonor
  grouping_level_1      : tissue
  grouping_level_2      : compartment
  n_perms               : 100
  libraryType           : 10X
  run_analysis          : true

Max job request options
  max_memory            : 800 GB
  max_time              : 10d

Other parameters
  max_multiqc_email_size: 25 MB

------------------------------------------------------
 Only displaying parameters that differ from defaults.
------------------------------------------------------
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:SPLIZ:CALC_SPLIZVD         -
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:ANALYSIS:PVAL_PERMUTATIONS -


A process input channel evaluates to null -- Invalid declaration `path domain`

 -- Check script '.nextflow/assets/salzmanlab/spliz/./workflows/./../subworkflows/local/analysis.nf' at line: 49 or see '.nextflow.log' file for more details

Nextflow log:

$nextflow log

2021-09-10 16:42:19     -               festering_torvalds      -       6c708f518b      807cca83-2a1e-41d0-add7-9d0eba060db7    nextflow run salzmanlab/spliz -r main -latest -c small.config

kaitlinchaung · 2021-09-13T05:03:54Z

I was able to install nextflow and create the conda environment with the new environment.yml file and activated it, ran the test data and it generated this error:

N E X T F L O W  ~  version 21.04.0
Pulling salzmanlab/spliz ...
 Already-up-to-date
Launching `salzmanlab/spliz` [festering_torvalds] - revision: 6c708f518b [main]


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/spliz v1.0dev
------------------------------------------------------


WARN: Found unexpected parameters:
* --outdir: ./results/test
* --numGenes: null
* --domain: null
- Ignore this warning: params.schema_ignore_params = "outdir,numGenes,domain" 

Core Nextflow options
  revision              : main
  runName               : festering_torvalds
  container             : kaitlinchaung/spliz:dev
  launchDir             :
  workDir               : 
  projectDir            : 
  userName              :
  profile               : standard
  configFiles           :

Input/output options
  dataname              : test
  input_file            : small.pq
  SICILIAN              : true
  pin_S                 : 0.1
  pin_z                 : 0.0
  bounds                : 5
  light                 : false
  svd_type              : normdonor
  grouping_level_1      : tissue
  grouping_level_2      : compartment
  n_perms               : 100
  libraryType           : 10X
  run_analysis          : true

Max job request options
  max_memory            : 800 GB
  max_time              : 10d

Other parameters
  max_multiqc_email_size: 25 MB

------------------------------------------------------
 Only displaying parameters that differ from defaults.
------------------------------------------------------
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:SPLIZ:CALC_SPLIZVD         -
[-        ] process > NFCORE_SPLIZ:SPLIZ_PIPELINE:ANALYSIS:PVAL_PERMUTATIONS -


A process input channel evaluates to null -- Invalid declaration `path domain`

 -- Check script '.nextflow/assets/salzmanlab/spliz/./workflows/./../subworkflows/local/analysis.nf' at line: 49 or see '.nextflow.log' file for more details

Nextflow log:

$nextflow log

2021-09-10 16:42:19     -               festering_torvalds      -       6c708f518b      807cca83-2a1e-41d0-add7-9d0eba060db7    nextflow run salzmanlab/spliz -r main -latest -c small.config

Hi there! I've pushed some changes to the nextflow pipeline and was able to run the test data set. Can you please try running again?

apdavid · 2021-09-13T19:20:56Z

@kaitlinchaung
Thank you I was able to run the test data set after you made those changes. I am now going to try to run my own data from SICILIAN.

Is there a *.tsv version of the small dataset, just to verify my data is formatted correctly (even though it came straight out of SICILIAN?)

@juliaolivieri @kaitlinchaung
Also are there any processing or filtering steps that you performed post-SICILIAN and pre-SpliZ that I should be aware of? My data is SS2 data if that is helpful.

kaitlinchaung · 2021-09-13T19:47:12Z

@apdavid
I've added a small tsv to the nextflow pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when running test dataset #3

Error when running test dataset #3

apdavid commented Sep 2, 2021

apdavid commented Sep 2, 2021

juliaolivieri commented Sep 2, 2021

juliaolivieri commented Sep 3, 2021

apdavid commented Sep 3, 2021

apdavid commented Sep 4, 2021

juliaolivieri commented Sep 4, 2021

apdavid commented Sep 7, 2021 •

edited

Loading

juliaolivieri commented Sep 10, 2021

apdavid commented Sep 10, 2021

apdavid commented Sep 10, 2021 •

edited

Loading

kaitlinchaung commented Sep 13, 2021

apdavid commented Sep 13, 2021 •

edited

Loading

kaitlinchaung commented Sep 13, 2021

Error when running test dataset #3

Error when running test dataset #3

Comments

apdavid commented Sep 2, 2021

apdavid commented Sep 2, 2021

juliaolivieri commented Sep 2, 2021

juliaolivieri commented Sep 3, 2021

apdavid commented Sep 3, 2021

apdavid commented Sep 4, 2021

juliaolivieri commented Sep 4, 2021

apdavid commented Sep 7, 2021 • edited Loading

juliaolivieri commented Sep 10, 2021

apdavid commented Sep 10, 2021

apdavid commented Sep 10, 2021 • edited Loading

kaitlinchaung commented Sep 13, 2021

apdavid commented Sep 13, 2021 • edited Loading

kaitlinchaung commented Sep 13, 2021

apdavid commented Sep 7, 2021 •

edited

Loading

apdavid commented Sep 10, 2021 •

edited

Loading

apdavid commented Sep 13, 2021 •

edited

Loading