I will comment one of the step that is done with RNA-seq data by researchers from gEuvadis project. They sequenced 1000 HapMap individuals, and did a lot of analyses trying to decipher the transcriptome variability among population. I think that they did a great job with the data having in mind that samples were sequenced in different countries and they had to make sure all of them were comparable. I have to say that the method of the paper is quite complete, and they also have a wiki, like a report of the project. This is a good start, but just let me say, that they should put the code as well (I could not find it).
For those of you who do not play with RNAseq data, I tell you that one of the more problematic step in the analysis is the correction for technical bias, or co-funding factors. We are in the beginning of the story nowadays and few tools are trying to do that, one of them is PEER, that “removes” factors that generate “noise” in the dataset.
gEuvadis used PEER to “normalize” the RNAseq data, remove batch effect, GC contect effect… and then later, performed cis-QTL analysis. As always as you apply a novel method, you should say why, and normally the answer is “because is better”. Well, let’s define “better”. They said that “better” is when you detect more cis-QTL. I just have to say that I disagree. Normally “better” is when you know some truth and you compare and get something that is near to the truth. I know that many time that is impossible, but just say that “the correction worked” because you have more results…for me it is not enough.
I have downloaded TSI bam files , one of the gEuvadis dataset, count the reads for each gene using HTSeq python script, and transform to log2 with DESeq2. After data I applied PEER. (See main code at: GitHub, I will add DESeq2 code as well) I just wanted to know what happens with the data, because I didn’t find any figure or sentence telling me about how the data changed after PEER correction. I did it with default parameters, since there was no more information in the paper or wiki. For my surprise, I saw how a distribution of values from 0 to 14, were converted to a distribution of values from -0.01 to 0.01!! (figure 1). And that only one entire factor was the responsible of removing all my distribution (figure 2). I understand now why in their methods, they said that they added the mean of the gene expression to the residuals after PEER correction (But this is not a normal step in PEER pipeline according to its paper).
I don’t know, but for me this is very weird, and I think that they should try to look more into this. Because assume that this is correct, and then adding the mean, because the distribution changed so much, is not at all justified. At least it would be good to mention that in the method section since it is a Nature paper, and as you all know, Nature is a journal where only the best papers are published…
Although there may be some difference in my pipeline because I could not guess exactly what input they used, I think that results would be pretty similar. I will try directly with count data to compare results in my next post.
For now, just be very careful if you decide to integrate this in your pipeline. gEuvadis researchers are aware of that and they recommend to study first your data to see if you could apply the same strategy or not. According to that advice, I think that they could have done more to see if this strategy is correct for their data as well.