The Walk of Genes

After creating my Walk of PI, based on this famous paper, I thought about creating a similar visualization, but this time using genes. This has been done before, using different algorithms. Here, I’ll create a “walk of genes” using D3.js and a simple JavaScript algorithm.

If you are unfamiliar with the “walk of PI” or are too lazy to click the link, this is the basic concept: in the middle of the blank canvas sits one pixel. This pixel will move according to the sequence of nucleotides of a given gene: if the nucleotide is A, it will move up. If the nucleotide is T, it will move down. G makes the pixel move right and, finally, C makes it move left. So, reading all nucleotide sequence of a given gene, we’ll create a path on the canvas. This is the “walk” of that gene.

I chose the genes for four famous human proteins: haemoglobin, collagen, myosin and albumin. For reducing the size of the walk, only the exons are analysed, that is, the nucleotide sequence provided here is not the actual gene, but the sequence that produce the functional mRNA. Use your mouse or trackpad to zoom and drag the path, and click “reset zoom” to, well, reset zoom. Hover the path to check each nucleotide and its position. When the walk finishes, click “draw again” to restart the walk.

1. Hemoglobin, α1 chain

Hemoglobin is an intracellular quaternary protein, which in adults has two α chains and two β chains. The α chain gene, called HBA1, has 824 nucleotides (counting only the exons). This is its walk:

2. Collagen, type I

Collagen is the most abundant protein in human body. Actually, there are several different proteins called “collagen”. This is the walk of Type I Collagen, whose gene (COL1A1) has 6728 nucleotides (counting only the exons).

3. Myosin II

Myosin is a motor protein. In its quaternary structure, myosin is composed of a pair of myosin heavy chains (MYH) and two pairs of nonidentical light chains. This is the walk of myosin’s heavy chain, whose gene (MYH1) has 1881 nucleotides (counting only the exons).

4. Serum Albumin

Serum albumin is the main plasma protein. The serum albumin gene, called ALB, has 2264 nucleotides (counting only the exons). This is its walk:


Additional information

This is how I made it: Just like in the "Walk of PI" I first loaded a string with the gene sequence and, after that, I created an array with all the letters of the gene:

var gene = data.geneName;
var digitsofgenes = gene.split("");

Then I created a rule describing how the pixels should move: A is up, G is right, T is down and C is left. This is pretty much arbitrary, but I kept the Watson-Crick pairs on opposite sides. Given x1, x2, y1 and y2 for the lines, this is the rule:

if (digitsofgenes[i] ==  "A") {
	y2 = y1 - 1; x2 = x1;
	} else if (digitsofgenes[i] ==  "T") {
	y2 = y1 + 1; x2 = x1;
	} else if (digitsofgenes[i] ==  "G") {
	x2 = x1 + 1; y2 = y1;
	} else if (digitsofgenes[i] == "C") {
	x2 = x1 - 1; y2 = y1;
	} else { 
		return; 
	}

After each line being appended, the final coordinates are the initial coordinates for the next line:

x1 = x2;
y1 = y2;

Also, each “move”, that is, each new nucleotide changes the colour of the line according to an RGB rule: red is up, blue is down-left and green is down-right. Using the variables “r”, “g” and “b”:

var color = d3.rgb(r, g, b);

My next project is creating a version of this walk in which the user is able to type (or copy) any given gene sequence.