Process raw sequencing files

Get data

Data is stored on the RDS under ‘raw data’

Transfer to the HPC for processing

Inflate the compressed files (and can run the md5 checksum to ensure that the transfer was successful)

Then move on to processing using nextflow. A tutorial can be found here:

https://github.com/dgrfs/NFCore_RNASeq_Cardiff_University_Hawk

# lang: sh

tar -xvf 161006_Run35_fastq.tar.gz

Add sequencing run ID to the files

for run35 - AH2572BGXY
for run36 - AH33YJBGXY

# lang: sh

for file in *.fastq.gz; do
    mv "$file" "${file%.fastq.gz}_AH2572BGXY.fastq.gz"
done

for file in *.fastq.gz; do
    mv "$file" "${file%.fastq.gz}_AH33YJBGXY.fastq.gz"
done

Create samplesheet

# lang: sh
wget https://zenodo.org/records/12790718/files/create_samplesheet.R

mamba activate renv
# requires R environment with tidyverse_2.0.0
R --vanilla < create_samplesheet.R
mamba deactivate

# lang: sh

mkdir ref
cd ref

cat << 'EOF' > get-refs.sh
 #!/bin/bash
 VERSION=108
 wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
 wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz
EOF

bash get-refs.sh

# back to working directory
cd ..

# get config
wget https://zenodo.org/records/12790718/files/adapted.scw.config

# get params
wget https://zenodo.org/records/12790718/files/params.yaml

# set up screen/tmux to run in background
module load tmux
tmux

# load dependencies
module load singularity-ce/3.11.4
module load nextflow/22.10.6

nextflow run nf-core/rnaseq \
    -r 3.12.0 \
    -params-file params.yaml \
    -profile singularity \
    -c adapted.scw.config \
    --fasta $PWD/ref/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \
    --gtf $PWD/ref/Homo_sapiens.GRCh38.108.gtf.gz

MultiQC

MultiQC Report - Run 35 (Opens in new window)

MultiQC Report - Run 36 (Opens in new window)

Get subread scripts

mkdir scripts
cd scripts

wget https://zenodo.org/records/12790718/files/slurm_subread.sh
wget https://zenodo.org/records/12790718/files/subread.R

Make sure to change the /path/to/workspace to the actual workspace in the slurm and R file.

Run subread for a count table.

There are many options here in terms of getting a count table. I am most familiar with using RSubread FeatureCounts, applying it manually to the .bam files after the pipeline is finished up to hisat2 alignment. Actually, NF-Core RNASeq runs FeatureCounts, but I just make my own with my own code that I understand and can make sure what reference genome is used. A more streamlined approach would be to use the FeatureCounts that is run by NF-Core, or to use star_salmon alignment.

HPC RNA Sequencing Processing using NF-CORE RNASeq Pipeline

Daniel D Griffiths

2024-07-22

Process raw sequencing files

Get data

Create samplesheet

MultiQC

Get subread scripts