Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Xena Data Option #2

Merged
merged 8 commits into from
Oct 13, 2016
Merged

Adding Xena Data Option #2

merged 8 commits into from
Oct 13, 2016

Conversation

gwaybio
Copy link
Collaborator

@gwaybio gwaybio commented Sep 26, 2016

With this option, controlled access is no longer required. The pull request incorporates the download, processing, and usage of publicly available data from the UCSC Xena Database.

if not os.path.exists('data/xena'):
os.makedirs('data/xena')

expression = 'https://ndownloader.figshare.com/files/5864859'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend a more sophisticated method than hardcoding file IDs. For example, something similar to what we're doing in 1.download.ipynb for cognoma/machine-learning. This uses the figshare API to retrieve file info. The benefit is that you specify a single ID, either the article ID to get the latest version or a versioned ID such as DOI, and then you get all the files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I will update


# Subset data
rnaseq_df.drop('SLC35E2', axis=0, inplace=True)
if xena:
gene_map_dict = dict((str(v), str(k)) for k, v in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expr_fh = 'data/xena/expression.tsv.bz2'
mut_fh = 'data/xena/xena_mutation_table.tsv'
clin_fh = 'data/xena/samples.tsv'
gene_map = pd.read_table('data/xena/HiSeqV2-gene-map.tsv', index_col=0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the purpose of dataframe indexes sometimes (like when we use pandas for matrices). But here I'm with @hadley -- just make all variables regular columns.


xena_mut_df = (xena_mut_df.pivot_table(index='#sample', columns='gene',
values='mutation', fill_value=0) .astype(bool).astype(int))

# Write subsets to file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header=True is default for pandas.DataFrame.to_csv. Did you mean something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I guess that snuck in

if not os.path.exists('data/xena/samples.tsv'):
request.urlretrieve(samples, os.path.join('data', 'xena', 'samples.tsv'))

gene_map = 'https://raw.githubusercontent.com/cognoma/cancer-data/master/'\
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit dangerous to use versioned data links and then an unversioned URL here. Protip: press y on a GitHub page to switch to the versioned URL.

Going to reference cognoma/cancer-data#23 here -- a gene information table that got uploaded to figshare would make your life easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would make it easier. Thanks for the versioned URL tip!


# Generate file names for output
base_fh = 'tissues_' + args.tissues.replace(',', '_') + '_genes_' + \
args.genes.replace(',', '_')
base_fh = base_add + '_tissues_' + args.tissues.replace(',', '_') + \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach:

'{}_tissues_{}_genes_{}'.format(
    base_add,
    args.tissues.replace(',', '_'),
    args.genes.replace(',', '_')
}

While not much simpler ATM, it is good for your training with Python 3.6 around the corner and the ability to use the advanced Format Specification Mini-Language.

Copy link

@dhimmel dhimmel Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.6 will change everything.

Coming soon to a terminal near you.

PEP 498 -- Literal String Interpolation

2016-12-16

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! That's great, ok thanks. I will adopt this format for ease in transitioning to mini language

figshare api to download xena data, update subseting script, and update pancan classifier script for using cognoma data
add options to subset xena or synapse data
@gwaybio gwaybio merged commit 0766a89 into greenelab:master Oct 13, 2016
@gwaybio gwaybio deleted the xena branch October 13, 2016 13:23
gwaybio added a commit to gwaybio/pancancer that referenced this pull request Oct 10, 2018
# This is the 1st commit message:

remove checkpoint call from tp53 viz

remove checkpoint and convert to pdf

update copy burden analysis

add two additional notebook scripts for tp53 analysis

# This is the commit message greenelab#2:

remove old figure generation script
blankenberg pushed a commit to blankenberg/pancancer that referenced this pull request Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants