Accessing GTEx V7/V8 raw RNA-Seq data in the cloud

Hi all, since I posted an answer on Biostars about how to access the raw RNA-Seq GTEx data (The Genotype-Tissue Expression Consortium), I have fielded questions from researchers at many institutions asking for more information. Disclaimer, I am not part of the Sequence Read Archive staff or the National Library of Medicine. But I seem to be one of the first people who was granted access through dbGaP to GTEx, after the data was moved to AWS and GCP cloud providers in late 2018/early 2019. The following is my experience. There is a beta version software, called fusera, that you must use to access and transfer these special, access-protected files. The official documentation is also beta. So here is everything I know and how I did it.

dbGaP access

You must have a successful data access request through dbGaP. Only PIs can apply. My PI then named me as an authorized downloader. You will need the access key file, which ends in “.ngc”.

dbgap

dbgap2

Save it to a secure location only you have access to for later use on the cloud machine.

Selecting which samples to download

I was interested in getting only the GTEx brain regions samples, of which there were 1049 samples from many donors at the time of writing. Go to the SRA Run Selector [link searches for ‘GTEx’, it will take a minute] and follow these instructions, copied from the draft documentation, courtesy of Adam Stine at SRA/NLM (thank you). The SRA Run Selector was recently updated, but here are instructions from the classic version.

Data files located on a cloud provider can be found using the DATASTORE provider, location, and filetype in the SRA Run Selector. For most data sets, the files are only accessible to compute instances using the same DATASTORE provider and DATASTORE region that the cloud storage bucket is hosted in.

To choose a list of runs available on cloud storage:

  1. On the “My Projects” page click “run selector” to open a window to select data included in your project. Run Selector may take a minute or more to open for projects with a large data set.
  2. In the Run Selector “Facets” menu (left side of the scree), click the checkbox next to “DATASTORE provider.”
  3. In the DATASTORE provider menu that appears, check the box of the desired cloud provider. (gs = Google Cloud Storage, s3 = Amazon Simple Storage Service)
  4. Note the DATASTORE region(s) the files are stored in. You will need to use a compute instance that is compatible with the region. (Generally by using the same region as the data is stored in.)
  5. Use any additional facets to reduce the list of runs to the specific data you would like to work with in the cloud.
  6. Either select the runs individually using check boxes for the runs or click the green + button to add all the runs filtered by the current facets to your “Selected” list.
  7. Click the “RunInfo Table” button on the Selected row to download a comma separated table selected runs. Save this file either directly on your cloud storage or to your computer for later upload to your cloud compute/storage.

I searched for ‘GTEx’ then filtered by DATASTORE location and data type. I wanted only AWS location and RNA-Seq data.

SRA_run_selector

Download this RunInfo Table and Accession List. Then you should filter by which sample type you want in the RunInfo Table. What you need are the SRR numbers, which is one SRR per sequencing sample. In excel, I filtered also by molecular_data_type (‘RNA Seq (NGS)’) and made sure they are in s3, and are non-tumor. Create a text file with one SRR per line for the filtered list of interest. You should then sum up the file sizes from the MBytes column so you have an estimate of how much storage you need, which will be in the terabytes range.

Setting up the AWS EC2

My Center (The National Center for Advancing Translational Sciences) has an AWS account for the Informatics group. I am privileged to be able to spin up as many EC2 instances as I need. Here is a generic tutorial for launching a linux virtual machine on AWS. Please note, I explicitly specified I was going to work with this data on AWS and the NIH Biowulf HPC in the data access request. You cannot store the raw sequencing data anywhere else than you are authorized to. Billing? I can’t comment much about billing estimates because I am not in charge of it — it is handled by my department. I tried estimating the cost for just my ec2 instance, but could not figure it out on the AWS Cost Explorer. I think you might need an active “Tag” label on your instance to track only its costs.

You will need to launch your ec2 in the exact same region the data is in, which for GTEx is us-east-1 N. Virginia. I created a m4.4xlarge machine with two 16 TB attached volumes.  It has 16 vCPUs and 64 GB RAM. It is Amazon Linux OS. The total storage size of the files I accessed was 3.9 TB for the CRAMs. In BAM format it is 6.3 TB. You may need more or less depending on how many samples you plan to access and work with.

When you create an ec2, you get an access key “.pem” file instead of a password. The permissions on this file should be 600.

ssh -i "AWS_keyfile.pem" ec2-user@ec2-###.compute-1.amazonaws.com

Run sudo yum update when you first login. Then I located my extra volumes on the filesystem, formatted them, and mounted them to a folder I made so they are available for storage.

sudo su
lsblk
mkfs -t xfs /dev/xvdb
mount /dev/xvdb /home/ec2-user/ec2_volume/

Copy your dbGaP .ngc keyfile and accessions list to your workspace on the machine.

Install tools

I am combining instructions here from the SRA documentation and from github. This software was developed by MITRE and SRA. Here is the quick install for these tools, from the github:

bash <(curl -fsSL https://raw.githubusercontent.com/mitre/fusera/master/install.sh)
yum install -y fuse.x86_64
chmod +x fusera
chmod +x sracp

I added these tools to my $PATH so they can be called by name instead of full location to executable.

I also installed sratoolkit.

wget -q https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
tar xzf sratoolkit.2.9.4-centos_linux64.tar.gz
rm -f sratoolkit.2.9.4-centos_linux64.tar.gz
mv ./sratoolkit.2.9.4-centos_linux64 /opt
echo -e "export PATH=/opt/sratoolkit.2.9.4-centos_linux64/bin:\$PATH" > /etc/profile.d/sratoolkit.sh
chmod 755 /etc/profile.d/sratoolkit.sh

Install anything else you need, if you intend to do the computation/analysis on this ec2 as well.

Mount the accessions

Fusera mounts the data as if it is a temporary USB stick plugged in. You can’t work directly on those files, and they have very low permissions for security. You mount, then copy the data to your ec2 storage. I mounted to a new folder called mount_data. This step only takes a minute.

fusera mount -t prj_##.ngc -a accessions.txt -f "cram,crai" /home/ec2-user/mount_data > output.log  2>&1 &

Copy the data to storage

I used sracp to copy the mounted data. Sracp simply asks for a list of accessions, your access key file, the file type (there was only cram available), and where to put it:

sracp -a Accessions.txt -t keyfile.ngc -f "cram,crai" /outputlocation

Since sracp does NOT support parallelism or pick up where it left off if the connection is interrupted (like rsync), I split up my accessions list into 16 chunks of about 66 samples each. Rsync does not work for GTEx data, by the way. I ran these in 16 screen sessions because AWS will terminate your interactive command line session with just about one hour of inactivity. Here is my R script to chunk up and run these sracp commands with my specifics X-ed out.

# Split up an accessions list into 16 parts and run sracp in parallel.

library(data.table)
accessions <- fread('accessions.txt', header=F)
accessions

chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) 

chunk2(accessions$V1, 16)

chunks <- chunk2(accessions$V1, 16)

for (i in 1:length(chunks)){
  chunk <- data.table('chunk'=unlist(chunks[i]))
  fwrite(chunk, col.names = F, row.names = F, quote=F, paste0('batches/Chunk_',i,'.txt' )) 
}


for (i in 1:16){
  cat(paste0('screen -dm bash -c \'sracp -a /home/ec2-user/Analysis/batches/Chunk_', i, '.txt -t /home/ec2-user/Analysis/prj_##.ngc -f "cram,crai" /home/ec2-user/Analysis/CRAMS ; exec sh\' ', '\n\n'))
}

# Basic syntax for running a detached screen session:
# screen -dm bash -c 'sleep 5; exec sh'

# Full sracp syntax:
# screen -dm bash -c 'sracp -a /home/ec2-user/Analysis/batches/Chunk_16.txt -t /home/ec2-user/Analysis/prj_##.ngc -f "cram,crai" /home/ec2-user/Analysis/CRAMS ; exec sh'

The last loop with cat prints the 16 screen commands to run.

“That’s it, so easy!”

Just kidding. This took a few months of trying to get right. At this point I had a working copy of the authorized files in my ec2 storage space. I was then able to securely transfer the CRAM files to a secure location on Biowulf, where I proceeded with analysis. I used a screen session on Helix with a scp -i AWS_keyfile.pem style command. It took about 2 days to transfer.

In the future I may add a section here or a new post with how I converted CRAM -> BAM -> FASTQ in order to realign these to hg38. I ended up needing 16 TB of space for this project with so many file conversions.

I am actively analyzing the brain RNA-Seq samples I downloaded now and it will be written up into a proper publication when we are ready. I hope this helps some people and thanks for reading. -CM

 

 

Comments are closed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: