-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Xena Data Option #2
Conversation
if not os.path.exists('data/xena'): | ||
os.makedirs('data/xena') | ||
|
||
expression = 'https://ndownloader.figshare.com/files/5864859' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend a more sophisticated method than hardcoding file IDs. For example, something similar to what we're doing in 1.download.ipynb
for cognoma/machine-learning
. This uses the figshare API to retrieve file info. The benefit is that you specify a single ID, either the article ID to get the latest version or a versioned ID such as DOI, and then you get all the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I will update
|
||
# Subset data | ||
rnaseq_df.drop('SLC35E2', axis=0, inplace=True) | ||
if xena: | ||
gene_map_dict = dict((str(v), str(k)) for k, v in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expr_fh = 'data/xena/expression.tsv.bz2' | ||
mut_fh = 'data/xena/xena_mutation_table.tsv' | ||
clin_fh = 'data/xena/samples.tsv' | ||
gene_map = pd.read_table('data/xena/HiSeqV2-gene-map.tsv', index_col=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the purpose of dataframe indexes sometimes (like when we use pandas for matrices). But here I'm with @hadley -- just make all variables regular columns.
|
||
xena_mut_df = (xena_mut_df.pivot_table(index='#sample', columns='gene', | ||
values='mutation', fill_value=0) .astype(bool).astype(int)) | ||
|
||
# Write subsets to file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
header=True
is default for pandas.DataFrame.to_csv
. Did you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I guess that snuck in
if not os.path.exists('data/xena/samples.tsv'): | ||
request.urlretrieve(samples, os.path.join('data', 'xena', 'samples.tsv')) | ||
|
||
gene_map = 'https://raw.githubusercontent.com/cognoma/cancer-data/master/'\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit dangerous to use versioned data links and then an unversioned URL here. Protip: press y
on a GitHub page to switch to the versioned URL.
Going to reference cognoma/cancer-data#23 here -- a gene information table that got uploaded to figshare would make your life easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would make it easier. Thanks for the versioned URL tip!
|
||
# Generate file names for output | ||
base_fh = 'tissues_' + args.tissues.replace(',', '_') + '_genes_' + \ | ||
args.genes.replace(',', '_') | ||
base_fh = base_add + '_tissues_' + args.tissues.replace(',', '_') + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach:
'{}_tissues_{}_genes_{}'.format(
base_add,
args.tissues.replace(',', '_'),
args.genes.replace(',', '_')
}
While not much simpler ATM, it is good for your training with Python 3.6 around the corner and the ability to use the advanced Format Specification Mini-Language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3.6 will change everything.
Coming soon to a terminal near you.
PEP 498 -- Literal String Interpolation
2016-12-16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow! That's great, ok thanks. I will adopt this format for ease in transitioning to mini language
figshare api to download xena data, update subseting script, and update pancan classifier script for using cognoma data
add options to subset xena or synapse data
# This is the 1st commit message: remove checkpoint call from tp53 viz remove checkpoint and convert to pdf update copy burden analysis add two additional notebook scripts for tp53 analysis # This is the commit message greenelab#2: remove old figure generation script
Add --version flag
With this option, controlled access is no longer required. The pull request incorporates the download, processing, and usage of publicly available data from the UCSC Xena Database.