Examining DeepVariant
To better understand what DeepVariant is learning from its training data, we used a set of simple clustering and visualization methods to summarize the information captured in the model’s high dimensional data. In partnership with collaborators on the Google Genomics team, we first loaded examples into the Integrated Genomics Viewer (IGV), a widely-used tool for inspecting genomes and sequencing data. Then, we applied Uniform Manifold Approximation and Projection (UMAP) to the embeddings of the mixed5 max-pooling layer of the model, which is roughly in the middle of the network and contains a mix of low- and high-level features. This visualization method enables one to visually inspect any emerging structures. We used different colors to represent known sequencing attributes in the input data (e.g., low quality sequence reads and regions that are hard to uniquely map in the genome) and a combined attribute using different value combinations of the basic attribute.
The structures that emerged reveal that some of the attributes’ values are mapped close to each other, naturally forming clusters. We observed that these “natural clusters” form at different levels across model layers, and at times get “forgotten” as the network further processes the input. This suggests that different types of information about the input DNA reads are important to different depths of the network.
Based on this first look, we then used additional clustering methods with the hope of “discovering” previously unknown attributes (clusters). We began by applying k-means clustering to find 10 clusters. K-means is a simple clustering algorithm that groups data points by proximity in vector space, without use of labels that might indicate similarity. This results in visual separation between major clusters, some of which are much more populous than others. To have control of the size of resulting clusters, we then applied hierarchical clustering by running k-means multiple times; first we run 3-cluster k-means, then for each of the three clusters we apply a second round k-means to further divide the clusters, where the cluster number is based on the shape and size of the first round clusters.