I have a conference coming up in two weeks, the American Society of Human Genetics (ASHG), where I will be presenting a technical poster in the bioinformatics section. The code behind the graphs will be on my github here. For this post I’d like to focus on my ggplot code rather than the science, so it can help a broader audience of data scientists and biologists. You can download the whole poster here: ASHG17_Malley.
This is the main figure. Some jargon: SV = structural variant; EDC = epidermal differentiation complex on chromosome 1; CNV = copy number variation; bp = base pair; Platypus = the Platypus variant caller program.
This figure went through several revisions and was originally split with the first box (EDC) larger and separate from the other three. I put all four regions together with the same Y axis scale but different X axis locations, since they are different genomic locations. The regions are of the same length as noted, though. Each point is a genetic variant, and each colored stripe below is a kind of structural variant found at that position. You can see that where there are denser SVs in the EDC (dark green and blue) there is a dip in mean reads.
R packages used
- Illumina HiSeq 30X whole-genome sequencing of 799 European American individuals, which I cannot share publically unfortunately 😦
- Platypus variant call file quality data using the above
- Database of Genomic Variants data tracks pulled from the UCSC Genome Browser (the Table Browser)
How I made it
Each of the scatterplots is pretty straightforward; I set point size, opacity, margins, theme, limits, and text annotations. The
splitInfoCol() function is a custom one I wrote you can find on the full github R code.
il4r <- fread("ADRN_il4ra_799_platypus_maxvar8_minreads5.QC.vcf", header=T, skip=51L, stringsAsFactors = F)
il4r <- splitInfoCol(il4r)
il4r.plot <- ggplot(data=il4r) + geom_point(aes(x=as.numeric(il4r$POS), y=as.numeric(il4r$TC)), size=1, alpha=0.3 ) + theme_bw()+
annotate(size=4, "text", hjust=1, vjust=1, label="Variants: 17428\nMean reads: 29990", x=28185184, y=50000, color="black")+
annotate("text", size=4, vjust=1, hjust=0, label="chr16:26516211-28185184", x=26516211, y=50000)+
ylim(0, 50000)+ xlim(26516211, 28185184)+
theme(text = element_text(size=15), legend.position="none", axis.title.x=element_blank(), axis.text.x = element_text(margin=margin(0,0,0,0,"pt"), hjust=0), axis.title.y=element_blank())
There are 4 of these similar code chunks. Overplotting is where data points are so dense they show up on top of each other and tend to make a dark mess of points. I tried to lessen this a bit by setting opacity (aka alpha) and with a smaller-than-default point size.
For the stripes, I initially had 2 bare-bones black and white tracks of information for segmental duplications and CpG islands. I realized it wasn’t showing the complexity of kinds of SVs for the region, and that maybe CpG is irrelevant for what I am trying to convey with this figure–which is in part that the EDC has more SVs than the other three regions. I found an incredible single database, the DGV as mentioned, that has 12 kinds of SVs and their start, stop positions. Then I made 4 ggplot objects for just these data:
structvar.il4r <- fread("IL4R-CNVindels-hg19Tables.txt", header=T, stringsAsFactors = F) # Obtained from UCSC table browser schema "DGV Struct Var / hg19.dgvMerged"
structvar.il4r.plot <- ggplot(data=structvar.il4r) + geom_segment(aes(x=structvar.il4r[,chromStart], xend=structvar.il4r[,chromEnd], y=0, yend=0, color=factor(structvar.il4r[,varType])), size=50, alpha=0.15)+
theme(text = element_text(size=15), axis.ticks.y=element_blank(), axis.text.y=element_blank(), axis.line.y = element_blank(), panel.grid.major.y=element_blank(), panel.grid.minor.y = element_blank(), axis.title.y = element_blank(), axis.title.x = element_blank(), legend.position="none", plot.margin = unit(c(-1.5,0,0.5,0), "lines"))+
coord_cartesian(xlim=c(26516211, 28185184)) + ylim(0,1)+ scale_color_viridis(discrete = T)
This piece is a bit intense, so let me break it down. What is plotted are
geom_segment()s with opacity and where color = SV type.
Plot.margin() has a negative 1.5 for top, which magically pushes these striped plots vertically to cover the X axes of the scatterplots, once it is all added together.
Coord_cartesian() was a new function to me, as I did not want to exclude data on the left and right X axis edges. It zooms into the data, rather than chops off extremes as the ggplot standalone
ylim() functions do. I do notice there is some whitespace for the SV plots, past the end of the scatterplot data, but I don’t mind enough to try to change it further.
The discrete color palette is the default viridis scale from the viridis package, which is FANTASTIC because it is color-blind friendly and perceptually consistent across the color scale. That means each color on the scale has equal importance, visually. It is so important for honestly conveying data with color. Further discussion is found on CRAN. I needed the extremes to not be too light, and the midtones to have noticeable shifts in hue since there are 12 separate SV categories.
My one criticism with what I did here is that 12 is maybe too many discrete categories to interpret, especially with opacity. I may need to explain this to poster visitors, so, I would say the figure is not a perfect communication. The bottom line is still interpretable — to compare SV density across regions.
I attempted to dynamically include the legend, but it would not play nice with
theme_bw(). It made horrifyingly large color squares and teeny tiny text:
So I simply graphed the legend alone and edited it manually in PowerPoint.
Graphing the pieces together
I tried several packages for multi-plotting: gridExtra, the ggbio
tracks() function, and cowplot. Cowplot won. It was able to graph all pieces together without extra or irregularly placed margins.
cowplot::plot_grid(edc.plot, structvar.plot, il4r.plot, structvar.il4r.plot, ifng.plot, structvar.ifng.plot, stat6.plot, structvar.stat6.plot, rel_heights=c(4,1,4,1,4,1,4,1), ncol = 1, align = "v")
That is 4 scatterplots and 4 SV striped plots, with rel_heights being a vector of relative plot heights. They are vertically aligned in one column. I tweaked the rel heights till I thought there was enough plot-space for the SVs to be legible but not overpowering. Science aside, my favorite part of the figure is this section of IFNG with an unusual light green color not seen in the other regions. It is an area with many novel sequence insertions relative to the hg19 reference genome.
I’m proud of the result! Ok, that is all I have to say. Thanks for reading!