M. capitata RNAseq QC, Alignment, Assembly Bioinformatic Pipeline
Data uploaded and analyzed on the URI HPC bluewaves server
The following document contains the bioinformatic pipeline used for cleaning, aligning and assembling our raw RNA sequences. The most recent scripts based on this pipeline are available on the project repository. Note that the scripts are numbered in order of execution.
Project overview
Bioinformatic tools used in analysis:
Quality check: FastQC, MultiQC
Quality trimming: Fastp
Alignment to reference genome: HISAT2
Preparation of alignment for assembly: SAMtools
Transcript assembly and quantification: StringTie
Prepare work space
- Upload raw reads and reference genome to server
- Assess that your files have all uploaded correctly
- Prepare your working directory
- Install all necessary programs
Upload raw reads and reference genome to server
This is done with the scp
or “secure copy” linux command. SCP allows the secure transferring of files between a local host and a remote host or between two remote hosts using ssh authorization.
++Secure Copy (scp) Options++:
- -P - Identifies the port number
- -r - Recursively copy entire directories
scp -r -P xxxx <path_to_raw_reads> echille@kitt.uri.edu:<path_to_storage> scp -P xxxx <path_to_reference> echille@kitt.uri.edu:<path_to_storage>
Check to make sure you have all of your files, and that they all follow the same naming convention. There should be 96 fastq.gz files. First we will look at our list of files in our read storage directory, and then we will count the number of fastq.gz files.
ls
ls -l | cat | wc -l
Assess that your files have all uploaded correctly
Check to make sure the files downloaded correctly using the md5sum command. First store the md5checksum in a file then verify the contents of the new md5sum file.
md5sum *.fastq.gz > raw_checksum.md5
md5sum -c raw_checksum.md5
- ++Md5 Output++:
All files “OK”
Check number of reads per file
zgrep -c "@GWNJ" *.fastq.gz
- ++Raw Read Counts++:
Forward raw read | Count | Reverse raw read | Count |
---|---|---|---|
119_R1_001.fastq.gz | 21430810 | 119_R2_001.fastq.gz | 21430810 |
120_R1_001.fastq.gz | 22587714 | 120_R2_001.fastq.gz | 22587714 |
121_R1_001.fastq.gz | 19342285 | 121_R2_001.fastq.gz | 19342285 |
127_R1_001.fastq.gz | 23494302 | 127_R2_001.fastq.gz | 23494302 |
128_R1_001.fastq.gz | 19003021 | 128_R2_001.fastq.gz | 19003021 |
129_R1_001.fastq.gz | 18364685 | 129_R2_001.fastq.gz | 18364685 |
130_R1_001.fastq.gz | 19545789 | 130_R2_001.fastq.gz | 19545789 |
131_R1_001.fastq.gz | 18473442 | 131_R2_001.fastq.gz | 18473442 |
132_R1_001.fastq.gz | 18818657 | 132_R2_001.fastq.gz | 18818657 |
133_R1_001.fastq.gz | 20820329 | 133_R2_001.fastq.gz | 20820329 |
134_R1_001.fastq.gz | 19009948 | 134_R2_001.fastq.gz | 19009948 |
153_R1_001.fastq.gz | 18739012 | 153_R2_001.fastq.gz | 18739012 |
154_R1_001.fastq.gz | 25865443 | 154_R2_001.fastq.gz | 25865443 |
155_R1_001.fastq.gz | 23884925 | 155_R2_001.fastq.gz | 23884925 |
156_R1_001.fastq.gz | 23346656 | 156_R2_001.fastq.gz | 23346656 |
157_R1_001.fastq.gz | 19122552 | 157_R2_001.fastq.gz | 19122552 |
158_R1_001.fastq.gz | 19082407 | 158_R2_001.fastq.gz | 19082407 |
159_R1_001.fastq.gz | 19524641 | 159_R2_001.fastq.gz | 19524641 |
160_R1_001.fastq.gz | 19606641 | 160_R2_001.fastq.gz | 19606641 |
162_R1_001.fastq.gz | 19809873 | 162_R2_001.fastq.gz | 19809873 |
163_R1_001.fastq.gz | 17708842 | 163_R2_001.fastq.gz | 17708842 |
164_R1_001.fastq.gz | 18134442 | 164_R2_001.fastq.gz | 18134442 |
165_R1_001.fastq.gz | 22428255 | 165_R2_001.fastq.gz | 22428255 |
166_R1_001.fastq.gz | 19475406 | 166_R2_001.fastq.gz | 19475406 |
167_R1_001.fastq.gz | 19437286 | 167_R2_001.fastq.gz | 19437286 |
168_R1_001.fastq.gz | 20184280 | 168_R2_001.fastq.gz | 20184280 |
169_R1_001.fastq.gz | 16229966 | 169_R2_001.fastq.gz | 16229966 |
179_R1_001.fastq.gz | 17619983 | 179_R2_001.fastq.gz | 17619983 |
180_R1_001.fastq.gz | 16093732 | 180_R2_001.fastq.gz | 16093732 |
181_R1_001.fastq.gz | 15181783 | 181_R2_001.fastq.gz | 15181783 |
182_R1_001.fastq.gz | 23523998 | 182_R2_001.fastq.gz | 23523998 |
183_R1_001.fastq.gz | 15687664 | 183_R2_001.fastq.gz | 15687664 |
184_R1_001.fastq.gz | 15938817 | 184_R2_001.fastq.gz | 15938817 |
185_R1_001.fastq.gz | 16001863 | 185_R2_001.fastq.gz | 16001863 |
186_R1_001.fastq.gz | 17647767 | 186_R2_001.fastq.gz | 17647767 |
212_R1_001.fastq.gz | 18557355 | 212_R2_001.fastq.gz | 18557355 |
215_R1_001.fastq.gz | 17131176 | 215_R2_001.fastq.gz | 17131176 |
218_R1_001.fastq.gz | 20737823 | 218_R2_001.fastq.gz | 20737823 |
221_R1_001.fastq.gz | 18375097 | 221_R2_001.fastq.gz | 18375097 |
359_R1_001.fastq.gz | 19316079 | 359_R2_001.fastq.gz | 19316079 |
361_R1_001.fastq.gz | 19161682 | 361_R2_001.fastq.gz | 19161682 |
363_R1_001.fastq.gz | 23704031 | 363_R2_001.fastq.gz | 23704031 |
365_R1_001.fastq.gz | 20178446 | 365_R2_001.fastq.gz | 20178446 |
367_R1_001.fastq.gz | 18859459 | 367_R2_001.fastq.gz | 18859459 |
371_R1_001.fastq.gz | 19914348 | 371_R2_001.fastq.gz | 19914348 |
373_R1_001.fastq.gz | 16196572 | 373_R2_001.fastq.gz | 16196572 |
375_R1_001.fastq.gz | 17074897 | 375_R2_001.fastq.gz | 17074897 |
379_R1_001.fastq.gz | 18041698 | 379_R2_001.fastq.gz | 18041698 |
1101_R1_001.fastq.gz | 21636831 | 1101_R2_001.fastq.gz | 21636831 |
1548_R1_001.fastq.gz | 17651137 | 1548_R2_001.fastq.gz | 17651137 |
1628_R1_001.fastq.gz | 21647915 | 1628_R2_001.fastq.gz | 21647915 |
Prepare your working directory
Create your working directory. Within your working directory make subdirectories for scripts, data, and output. Enter the data directory and make a subdirectory to place raw reads and reference files.
mkdir mcap2019
cd mcap2019
mkdir scripts
mkdir output
mkdir data
cd data
mkdir raw
mkdir ref
Create symbolic links to raw reads and reference sequence
ln -s <path_to_raw_reads>/*.fastq.gz ./raw/
ln -s <path_to_reference> ./ref/
Install all necessary programs
Install programs within your conda environment, when possible.
Create and activate a conda environment. Must have miniconda installed.
conda create -n mcap2019
conda activate mcap2019
Install all necessary programs within your conda environment
conda install fastqc
conda install multiqc
conda install fastp
conda install hisat2
conda install samtools
The version of StringTie available on Bioconda is not the most recent version (v2.1.0). The version installed in conda (v2.0) has errors when running with the ‘-e’ option that we need for this next step in StringTie. We will have to install StringTie outside of the conda environment. The following commands will install the latest version and test the binary. This only took about 3 min to run.
git clone https://github.com/gpertea/stringtie
cd stringtie
make release
make test
Quality control and read trimming
- Initial quality check of raw reads
- Quality-trimming of reads
- Post-trimming quality check of reads
Initial quality check of raw reads
FastQC is a bioinformatic tool that generates sequence quality information of your reads. Multiqc summarizes FastQC analysis logs and summarizes results in an html report.
Run FastQC in the raw directory.
fastqc ./*.fastq.gz
Make a data subdirectory for your raw fastqc results and move FastQC results into there. Then compile MultiQC report.
mkdir ../fastqc_raw
cd fastqc_raw
mv ../raw/*fastqc* ./
multiqc ./
Move the MultiQC report to output directory. Then, from the local host securely copy the MultiQC report to a local directory.
mv ./multiqc_report.html ../../ouput/multiqc_report_raw.html
scp -P xxxx echille@kitt.uri.edu:<path_to_output>/multiqc_report_raw.html /Users/user/<path_to_local_directory>
- ++Output++:
All raw sequences 150 bp
Quality-trimming of reads
To clean our reads we will be using a program called FastP, a tool designed to provide fast all-in-one preprocessing for FastQ files.
++Goals of quality trimming++:
- Remove adapters
- Remove low-quality reads
- Remove reads with high abundance of unknown bases
Make a subdirectory for cleaned reads within the data directory.
mkdir cleaned_reads
++FastP Arguments/Options Used++:
- –in1 - Path to forward read input
- –in2 - Path to reverse read input
- –out1 - Path to forward read output
- –out2 - Path to reservse read output
- –failed_out - Specify file to store reads that fail filters
- –qualified_quality_phred - Phred quality >= -q is qualified (20)
- –unqualified_percent_limit - % of bases allowed to be unqualified (10)
- –length_required - Set required sequence length (100)
- –detect_adapter_for_pe - Adapters can be trimmed by overlap analysis, however, –detect_adapter_for_pe will usually result in slightly cleaner output than overlap detection alone. This results in a slightly slower run time
- –cut_right - Move a sliding window from front to tail. Use cut_right_window_size to set the window size (5), and cut_right_mean_quality (20) to set the mean quality threshold.
- –html - The html format report file name
sh -c 'for file in "119" "120" "121" "127" "128" "129" "130" "131" "132" "133" "134" "153" "154" "155" "156" "157" "158" "159" "160" "162" "163" "164" "165" "166" "167" "168" "169" "179" "180" "181" "182" "183" "184" "185" "186" "212" "215" "218" "221" "359" "361" "363" "365" "367" "371" "373" "375" "379"
do
fastp --in1 ${file}_R1_001.fastq.gz --in2 ${file}_R2_001.fastq.gz --out1 ../cleaned_reads/${file}_R1_001.clean.fastq.gz --out2 ../cleaned_reads/${file}_R2_001.clean.fastq.gz --failed_out ../cleaned_reads/${file}_failed.txt --qualified_quality_phred 20 --unqualified_percent_limit 10 --length_required 100 detect_adapter_for_pe --cut_right cut_right_window_size 5 cut_right_mean_quality 20
done'
Post-trimming quality check of reads
Now that we’ve trimmed the adapters, low-quality reads and reass with many unknown bases, we will again check our sequence quality. first, we will check the trimmed sequence lengths, and then run FastQC again to examine our GC and adapter content, and our phred quality scores.
Check the clean read count.
zgrep -c "@GWNJ" *.fastq.gz
- ++Clean Read Counts++:
Forward clean read | Count | Reverse clean read | Count |
---|---|---|---|
119_R1_001.clean.fastq.gz | 15735471 | 119_R2_001.clean.fastq.gz | 15735471 |
120_R1_001.clean.fastq.gz | 16746988 | 120_R2_001.clean.fastq.gz | 16746988 |
121_R1_001.clean.fastq.gz | 14738013 | 121_R2_001.clean.fastq.gz | 14738013 |
127_R1_001.clean.fastq.gz | 16571299 | 127_R2_001.clean.fastq.gz | 16571299 |
128_R1_001.clean.fastq.gz | 13859037 | 128_R2_001.clean.fastq.gz | 13859037 |
129_R1_001.clean.fastq.gz | 13206196 | 129_R2_001.clean.fastq.gz | 13206196 |
130_R1_001.clean.fastq.gz | 14162329 | 130_R2_001.clean.fastq.gz | 14162329 |
131_R1_001.clean.fastq.gz | 13293825 | 131_R2_001.clean.fastq.gz | 13293825 |
133_R1_001.clean.fastq.gz | 14505821 | 133_R2_001.clean.fastq.gz | 14505821 |
132_R1_001.clean.fastq.gz | 13809914 | 132_R2_001.clean.fastq.gz | 13809914 |
134_R1_001.clean.fastq.gz | 13596112 | 134_R2_001.clean.fastq.gz | 13596112 |
153_R1_001.clean.fastq.gz | 13186807 | 153_R2_001.clean.fastq.gz | 13186807 |
154_R1_001.clean.fastq.gz | 18637363 | 154_R2_001.clean.fastq.gz | 18637363 |
155_R1_001.clean.fastq.gz | 16751962 | 155_R2_001.clean.fastq.gz | 16751962 |
156_R1_001.clean.fastq.gz | 16717897 | 156_R2_001.clean.fastq.gz | 16717897 |
157_R1_001.clean.fastq.gz | 13851434 | 157_R2_001.clean.fastq.gz | 13851434 |
158_R1_001.clean.fastq.gz | 13354481 | 158_R2_001.clean.fastq.gz | 13354481 |
159_R1_001.clean.fastq.gz | 14457629 | 159_R2_001.clean.fastq.gz | 14457629 |
160_R1_001.clean.fastq.gz | 14489484 | 160_R2_001.clean.fastq.gz | 14489484 |
162_R1_001.clean.fastq.gz | 14700108 | 162_R2_001.clean.fastq.gz | 14700108 |
163_R1_001.clean.fastq.gz | 13185051 | 163_R2_001.clean.fastq.gz | 13185051 |
164_R1_001.clean.fastq.gz | 12773001 | 164_R2_001.clean.fastq.gz | 12773001 |
165_R1_001.clean.fastq.gz | 16091087 | 165_R2_001.clean.fastq.gz | 16091087 |
166_R1_001.clean.fastq.gz | 14487569 | 166_R2_001.clean.fastq.gz | 14487569 |
167_R1_001.clean.fastq.gz | 14479586 | 167_R2_001.clean.fastq.gz | 14479586 |
168_R1_001.clean.fastq.gz | 14473684 | 168_R2_001.clean.fastq.gz | 14473684 |
169_R1_001.clean.fastq.gz | 11824893 | 169_R2_001.clean.fastq.gz | 11824893 |
179_R1_001.clean.fastq.gz | 12913165 | 179_R2_001.clean.fastq.gz | 12913165 |
180_R1_001.clean.fastq.gz | 11919642 | 180_R2_001.clean.fastq.gz | 11919642 |
181_R1_001.clean.fastq.gz | 11187477 | 181_R2_001.clean.fastq.gz | 11187477 |
182_R1_001.clean.fastq.gz | 17048929 | 182_R2_001.clean.fastq.gz | 17048929 |
183_R1_001.clean.fastq.gz | 11526429 | 183_R2_001.clean.fastq.gz | 11526429 |
184_R1_001.clean.fastq.gz | 12030136 | 184_R2_001.clean.fastq.gz | 12030136 |
185_R1_001.clean.fastq.gz | 11744187 | 185_R2_001.clean.fastq.gz | 11744187 |
186_R1_001.clean.fastq.gz | 13327721 | 186_R2_001.clean.fastq.gz | 13327721 |
212_R1_001.clean.fastq.gz | 13858209 | 212_R2_001.clean.fastq.gz | 13858209 |
215_R1_001.clean.fastq.gz | 12766725 | 215_R2_001.clean.fastq.gz | 12766725 |
218_R1_001.clean.fastq.gz | 15156983 | 218_R2_001.clean.fastq.gz | 15156983 |
221_R1_001.clean.fastq.gz | 13847948 | 221_R2_001.clean.fastq.gz | 13847948 |
359_R1_001.clean.fastq.gz | 14742426 | 359_R2_001.clean.fastq.gz | 14742426 |
361_R1_001.clean.fastq.gz | 14274364 | 361_R2_001.clean.fastq.gz | 14274364 |
363_R1_001.clean.fastq.gz | 17167695 | 363_R2_001.clean.fastq.gz | 17167695 |
365_R1_001.clean.fastq.gz | 15092858 | 365_R2_001.clean.fastq.gz | 15092858 |
367_R1_001.clean.fastq.gz | 13882376 | 367_R2_001.clean.fastq.gz | 13882376 |
371_R1_001.clean.fastq.gz | 14862977 | 371_R2_001.clean.fastq.gz | 14862977 |
373_R1_001.clean.fastq.gz | 11650596 | 373_R2_001.clean.fastq.gz | 11650596 |
375_R1_001.clean.fastq.gz | 12509060 | 375_R2_001.clean.fastq.gz | 12509060 |
379_R1_001.clean.fastq.gz | 13669693 | 379_R2_001.clean.fastq.gz | 13669693 |
1101_R1_001.clean.fastq.gz | 16961235 | 1101_R2_001.clean.fastq.gz | 16961235 |
1548_R1_001.clean.fastq.gz | 13570117 | 1548_R2_001.clean.fastq.gz | 13570117 |
1628_R1_001.clean.fastq.gz | 16640766 | 1628_R2_001.clean.fastq.gz | 16640766 |
Run FastQC on clean reads
fastqc ./*.fastq.gz
Make a data subdirectory for your clean fastqc results and move FastQC results into there. Then compile MultiQC report.
mkdir fastqc_clean
cd fastqc_clean
mv ../cleaned_reads/*fastqc* ./
multiqc ./
Move the MultiQC report to output directory. Then, from the local host securely copy the MultiQC report to a local directory.
mv ./multiqc_report.html ../../ouput/multiqc_report_clean.html
scp -P xxxx echille@kitt.uri.edu:<path_to_output>/multiqc_report_clean.html /Users/user/<path_to_local_directory>
- ++Output++:
Alignment of clean reads to reference genome
HISAT2 is a fast and sensitive alignment program for mapping next-generation DNA and RNA sequencing reads to a reference genome.
- Index the reference genome
- Alignment of clean reads to the reference genome
Create a subdirectory within data for HISAT2 and symbolically link ot your clean fastq files.
mkdir hisat2
cd hisat2
ln -s ../cleaned_reads/*fastq* ./
Index the reference genome
Index the reference genome in the reference directory.
++HISAT2-build Alignment Arguments Used++:
-
- name of reference files -
- basename of index files to write - -f - reference file is a FASTA file
cd ref
hisat2-build -f ../ref/Mcap.genome_assembly.fa ./Mcap_ref
Alignment of clean reads to the reference genome
Align your reads to the index files. We will do this by writing a script we will call McapHISAT2.sh
. This script will also take the output SAM files from our HISAT2 alignment and covert them into the sorted BAM files that are the necessary input for our assembly tool, StringTie. We do this by calling SAMtools in our script.
++HISAT2 Alignment Arguments Used++:
- -x
- Basename of index files to read - -1
- List of forward sequence files - -2
- List of reverse sequence files - -S - Name of output files
- -q - Input files are in FASTQ format
- -p - Number processors
- –rf - Reads are stranded
- –dta - Adds the XS tag to indicate the genomic strand that produced the RNA from which the read was sequenced. As noted by StringTie… “be sure to run HISAT2 with the –dta option for alignment, or your results will suffer.”
++SAMtools Options Arguments Used++:
- -@ - Number threads
- -o - Output file
nano McapHISAT2.sh
##!/bin/bash
#Specify working directory
F=/home/echille/mcap2019/data/hisat2
#Aligning paired end reads
#Has the R1 in array1 because the sed in the for loop changes it to an R2. SAM files are of both forward and reverse reads
array1=($(ls $F/*_R1_001.clean.fastq.gz))
# This then makes it into a bam file
# And then also sorts the bam file because Stringtie takes a sorted file for input
# And then removes the sam file because I don't need it anymore
for i in ${array1[@]}; do
hisat2 -p 8 --rf --dta -q -x Mcap_ref -1 ${i} -2 $(echo ${i}|sed s/_R1/_R2/) -S ${i}.sam
samtools sort -@ 8 -o ${i}.bam ${i}.sam
echo "${i}_bam"
rm ${i}.sam
echo "HISAT2 PE ${i}" $(date)
done
Now, make the file executable by the user (you) and run the script.
chmod u+x McapHISAT2.sh
./McapHISAT2.sh
Now we’ve got some sorted BAM files that can be used in our assembly!!
Assemble aligned reads and quantify transcripts
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.
- Reference-guided assembly with novel transcript discovery
- Merge output GTF files and assess the assembly performance
- Compilation of GTF-files into gene and transcript count matrices
Reference-guided assembly with novel transcript discovery
First, create and enter into StringTie directory. Then create a symbolic link to our reference genome and copy our BAM files to a special directory inside our stringtie directory. This is where our output GTF files will live too.
mkdir ../stringtie
cd stringtie
ln -s ../ref/Mcap.GFFannotation.gff ./
mkdir BAM
cd BAM
ln -s ../../hisat2/*.bam ./
cd ../
Create the StringTie reference-guided assembly script, McapStringTie-assembly.sh
inside of the StringTie program directory.
++StringTie Arguments Used++:
- -p - Specify number of processers
- –rf - Reads are stranded
- -e - Limit the estimation and output of transcripts to only those that match the reference (in this case, our merged GTF)
- -G - Specify annotation file
- -o - Name of output file
cd stringtie
nano ./McapStringTie-assembly.sh
##!/bin/bash
#Specify working directory
F=/home/echille/mcap2019/data/stringtie
#StringTie reference-guided assembly
#Has the R1 in array1 because of the naming convention in the former script. However, these BAM files contain both forward and reverse reads.
array1=($(ls $F/BAM/*_R1_001.clean.fastq.gz.bam))
for i in ${array1[@]}; do
./stringtie -p 8 --rf -e -G Mcap.GFFannotation.gff -o ${i}.gtf ${i}
mv /ref-guided-gtfs/${i}.gtf
echo "StringTie-assembly-to-ref ${i}" $(date)
done
Now, make the file executable by the user and run the script.
chmod u+x McapStringTie-assembly.sh
./McapStringTie-assembly.sh
Assess the performance of the assembly
Gffcompare is a tool that can compare, merge, annotate and estimate accuracy of GFF/GTF files when compared with a reference annotation
Using the StringTie merge mode, merge the assembly-generated GTF files to assess how well the predicted transcripts track to the reference annotation file. This step requires the TXT file, mergelist.txt
. This file lists all of the file names to be merged. Make sure mergelist.txt
is in the StringTie program directory.
++StringTie Arguments Used++:
- –merge - Distinct from the assembly usage mode used above, in the merge mode, StringTie takes as input a list of GTF/GFF files and merges/assembles these transcripts into a non-redundant set of transcripts.
- -p - Specify number of processers
- -G - Specify reference annotation file. With this option, StringTie assembles the transfrags from the input GTF files with the reference sequences
- -o - Name of output file
-
- File listing all filenames to be merged. Include full path.
./stringtie --merge -p 8 -G ../Mcap.GFFannotation.gff -o ../stringtie_merged.gtf mergelist.txt
Now we can use the program gffcompare to compare the merged GTF to our reference genome.
++Gffcompare Arguments Used++:
- -r - Specify reference annotation file
- -G - Compare all the transcripts in our input file
stringtie_merged.gtf
- -o - Prefix of all output files
gffcompare -r ../Mcap.GFFannotation.gff -G -o ../merged ../stringtie_merged.gtf
Some of the output files you will see are…
- merged.stats
- merged.tracking
- merged.annotated.gtf
- merged.stringtie_merged.gtf.refmap
- merged.loci
- merged.stringtie_merged.gtf.tmap
Move all of the gffcompare output files to the output directory. We are most interested in the files merged.annotation.gtf
and merged.stats
. The file merged.annotation.gtf
tells you how well the predicted transcripts track to the reference annotation file and the file merged.stats
file shows the sensitivity and precision statistics and total number for different features (genes, exons, transcripts). Then, from the local host securely copy merged.stats
to a local directory. Unfortunately, merged.annotation.gtf
is too big to store locally, but we can view it remotely.
scp -P xxxx echille@kitt.uri.edu:<path_to_output>/merged.stats /Users/user/<path_to_local_directory>
Compilation of GTF-files into gene and transcript count matrices
The StringTie program includes a script, prepDE.py
that compiles your assembly files into gene and transcript count matrices. This script requires as input the list of sample names and their full file paths, sample_list.txt
. This file will live in StringTie program directory.
Go back into your stringtie directory (the one I should have named assembly). Run prepDE.py
to merge assembled files together into a DESeq2-friendly version.
++StringTie prepDE.py Arguments Used++:
- -i - Specify that input is a TXT file
- -g - Require output gene count file, default name is
gene_count_matrix.csv
- -t - Require output transcript count gene count file, default name is
transcript_count_matrix.csv
./prepDE.py -g ../gene_count_matrix.csv -i ./sample_list.txt
Finally, move your count matrices into the output directory and securely copy them to your local directory, from your local host.
cd ../
mv ./*.csv ../../output
scp -P xxxx echille@kitt.uri.edu:<path_to_output>/*.csv /Users/user/<path_to_local_directory>