Adding Xena Data Option #2

gwaybio · 2016-09-26T15:40:43Z

With this option, controlled access is no longer required. The pull request incorporates the download, processing, and usage of publicly available data from the UCSC Xena Database.

also increase modularity

dhimmel · 2016-09-27T14:42:13Z

download_xena_data.py

+if not os.path.exists('data/xena'):
+    os.makedirs('data/xena')
+
+expression = 'https://ndownloader.figshare.com/files/5864859'


I'd recommend a more sophisticated method than hardcoding file IDs. For example, something similar to what we're doing in 1.download.ipynb for cognoma/machine-learning. This uses the figshare API to retrieve file info. The benefit is that you specify a single ID, either the article ID to get the latest version or a versioned ID such as DOI, and then you get all the files.

Nice! I will update

dhimmel · 2016-09-27T14:55:45Z

pancancer_classifier.py


 # Subset data
-rnaseq_df.drop('SLC35E2', axis=0, inplace=True)
+if xena:
+    gene_map_dict = dict((str(v), str(k)) for k, v in


See dictionary comprehension.

dhimmel · 2016-09-27T14:57:39Z

pancancer_classifier.py

+    expr_fh = 'data/xena/expression.tsv.bz2'
+    mut_fh = 'data/xena/xena_mutation_table.tsv'
+    clin_fh = 'data/xena/samples.tsv'
+    gene_map = pd.read_table('data/xena/HiSeqV2-gene-map.tsv', index_col=0)


I see the purpose of dataframe indexes sometimes (like when we use pandas for matrices). But here I'm with @hadley -- just make all variables regular columns.

dhimmel · 2016-09-27T15:00:22Z

subset_datasets.py

+
+xena_mut_df = (xena_mut_df.pivot_table(index='#sample', columns='gene',
+               values='mutation', fill_value=0) .astype(bool).astype(int))
+
 # Write subsets to file


header=True is default for pandas.DataFrame.to_csv. Did you mean something else?

Nope, I guess that snuck in

dhimmel · 2016-09-27T15:03:54Z

download_xena_data.py

+if not os.path.exists('data/xena/samples.tsv'):
+    request.urlretrieve(samples, os.path.join('data', 'xena', 'samples.tsv'))
+
+gene_map = 'https://raw.githubusercontent.com/cognoma/cancer-data/master/'\


A bit dangerous to use versioned data links and then an unversioned URL here. Protip: press y on a GitHub page to switch to the versioned URL.

Going to reference cognoma/cancer-data#23 here -- a gene information table that got uploaded to figshare would make your life easier.

Yes, it would make it easier. Thanks for the versioned URL tip!

dhimmel · 2016-09-27T15:09:20Z

pancancer_classifier.py


 # Generate file names for output
-base_fh = 'tissues_' + args.tissues.replace(',', '_') + '_genes_' + \
-          args.genes.replace(',', '_')
+base_fh = base_add + '_tissues_' + args.tissues.replace(',', '_') + \


I like this approach:

'{}_tissues_{}_genes_{}'.format( base_add, args.tissues.replace(',', '_'), args.genes.replace(',', '_') }

While not much simpler ATM, it is good for your training with Python 3.6 around the corner and the ability to use the advanced Format Specification Mini-Language.

Python 3.6 will change everything.

Coming soon to a terminal near you.

PEP 498 -- Literal String Interpolation

2016-12-16

wow! That's great, ok thanks. I will adopt this format for ease in transitioning to mini language

figshare api to download xena data, update subseting script, and update pancan classifier script for using cognoma data

add options to subset xena or synapse data

# This is the 1st commit message: remove checkpoint call from tp53 viz remove checkpoint and convert to pdf update copy burden analysis add two additional notebook scripts for tp53 analysis # This is the commit message greenelab#2: remove old figure generation script

Add --version flag

gwaybio added 6 commits September 26, 2016 11:34

update readme for xena

1c42d2a

add xena data logic

b4c93c2

download xena data script

25c9c5b

update pipeline script for xena data analysis

0c1c102

add xena flag

b37a3a6

also increase modularity

remove checkbox

6ae051a

gwaybio assigned cgreene and dhimmel Sep 26, 2016

dhimmel suggested changes Sep 27, 2016

View reviewed changes

gwaybio added 2 commits September 28, 2016 11:09

address review comments

f389940

figshare api to download xena data, update subseting script, and update pancan classifier script for using cognoma data

update pipeline

087de9e

add options to subset xena or synapse data

dhimmel approved these changes Sep 28, 2016

View reviewed changes

gwaybio merged commit 0766a89 into greenelab:master Oct 13, 2016

gwaybio deleted the xena branch October 13, 2016 13:23

blankenberg pushed a commit to blankenberg/pancancer that referenced this pull request Jan 31, 2020

Merge pull request greenelab#2 from blankenberg/vijay_pan-d

4634048

Add --version flag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Xena Data Option #2

Adding Xena Data Option #2

gwaybio commented Sep 26, 2016

dhimmel Sep 27, 2016

gwaybio Sep 27, 2016

dhimmel Sep 27, 2016

dhimmel Sep 27, 2016

dhimmel Sep 27, 2016

gwaybio Sep 28, 2016

dhimmel Sep 27, 2016

gwaybio Sep 27, 2016

dhimmel Sep 27, 2016

dhimmel Sep 27, 2016 •

edited

Loading

gwaybio Sep 27, 2016

Adding Xena Data Option #2

Adding Xena Data Option #2

Conversation

gwaybio commented Sep 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Sep 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Sep 27, 2016 •

edited

Loading