Update databases in a load balancing cluster using a metaphor

28

JUL

09

Update databases in a load balancing cluster using a metaphor

I previously mentioned that I am responsible for a database system. In this post I want to write about the problems I encountered when dealing with the process of keeping those databases updated and how a metaphor helped when designing and implementing.

The architecture consists of three physical machines. One (Host A) hosts a virtual machine (VM) holding the load balancer (apache plus pound), a VM housing an operational database used for communication between components, user profiles, etc. and a VM dedicated to statistical evaluation which is irrelevant for this consideration.

The two other physical machines (Host B and Host C) each host three VMs holding the database and the web application. A user request is received by the load balancer and distributed to one of those six satellites by pound's smart algorithm.

COMPONENTS OF THE LOAD BALANCING CLUSTER
components of the load balancing cluster
click on image to enlarge

The idea

My idea was to add one additional VM (Reference Machine) - an exact clone of a satellite. This VM would not be part of the load balancing thus not executing any user queries. Its only purpose would be to update itself as soon as new data is available. In that case it would perform the necessary updates and indexing of data. When finished it would check if it still works properly by running test queries and a full selenium test suite. If any error happened the machine would report it to the task monitor and the update procedure would be stopped until the system administrator fixes the problem.

In case everything went fine it would tell the satellites to simple copy any changed files from its file structure. Since the indexing of full texts takes quite some time I save a lot by simply copying as opposed to run the same update procedures on each satellite.

The problems

A couple problems need to be sorted out however. Firstly, when a satellite copies database or index files it is no longer able to serve queries, well, eventually it would try but crash. An easy way to solve that problem is to stop apache a minute before copying. By so the load balancer would recognize that this machine no longer responds and stops sending requests to it.

Secondly, if every machine would start copying at the same time, the database would be down for the time needed to copy. One would instantly think of chaining the copy procedures so that one satellite would only start copying after the preceding one finished. This approach would result in five machines answering to user requests at any time. Granted, this would solve inaccessibility of the service, but at the same time we would introduce inconsistency in search results because a user would get different results depending on which satellite she would hit: an already updated one or one that is still waiting for his turn to start copying.

Inconsistency is one of the biggest delinquencies in our business - no, wait - in anyone's business, no matter where, we have to avoid it like the plaque (I had to sacrifice an entire paragraph for this irrefutable statement *g*).

The solution

To solve this problem I did the following: I divided the satellites in two groups. When the big copying begins one group of satellites would take themselves out of the load balancing by shutting down their web server. The other group would still happily serve queries for users hitting the system. After the first group finished copying they all perform self tests to guarantee full functionality, if all pass they start their web servers. Since the second group would likewise stop apache prior to copying, the load balancer would - at one moment - send requests only to satellites already updated.

Admittedly, there is still a small overlap when the first group finished and started apache and the second group would shut down their web servers. Indeed I have roughly 2 minutes where both groups - the updated ones and the "old" ones - answer to queries, but I consider this short phase of inconsistency inevitable if I want to guarantee zero-downtime.

Another drawback with this approach: during the time those update procedures are running the system's performance is halved. For that reason I checked my web logs to find a proper time window with low database usage.

A positive side effect however: one group would start their web servers only in case all of them run their tests successfully, otherwise the update procedure is halted. This protects the system of e.g. defective data updates or file corruptions when copying to propagate to all satellites. In such a case my system runs at half power until I fix the problem, yet it is still healthy.

Now this was the idea, but how to physically implement? First, I was looking for an appropriate metaphor to help grasping the concept and its dependencies, especially for the lucky or poor guy to take over that system.

The metaphor

It took me some time but I finally came up with a teacher-scholar pattern.

In a convent school boys and girls are separately taught by a teacher. This teacher prepares himself to hold a lesson, meanwhile the children are chatting in the schoolyard. When he's finished he calls the boys into the classroom and equips them with knowledge. After the lesson an examination is conducted. Following which the boys leave the classroom return to the schoolyard and tell the girls they are next. The girls get into the classroom and the teacher holds the very same lesson for the girls and also finishes with an examination.

This metaphor helped me to visualize the procedure big style. In particular I could map certain tasks performed by my components to actions done by protagonists of the convent school. For example, "chatting" boys or girls in the schoolyard would - in my model - correspond to satellites serving queries sent by the load balancer. If the satellites would shut down their web servers to prepare for copying this would translate to "stop chatting".

Here a list of some other analogies I came up with:

Metaphor Actual activity
Teacher Reference machine
Boys and girls The two groups of satellites
Subject matter New data to be updated e.g. new records, new full text documents.
Preparation of a lesson The database of the reference machine is updated with new records and full text documents. Indexes are created.
Lesson Process of copying database files from the reference machine to the satellites.
check the lectureship Check if the machine is registered as reference machine.
prepare for a lesson Set a flag (in the operational database) signalling that the reference machine is in the course of performing updates.
check new subject matter Check if any new data (e.g. new records, full texts) was made available by inputting staff.
prepare the subject matter Process new data (e.g. update the database, create indexes, move full text documents).
proof read the preparation Perform tests on updated system.
file the preparation Backup the database files and any modified application files such as stats on new updates.
call scholars Set a flag to signal backend machines the database update finished and they can copy new files.
finish the preparation Unset the flag indicating that the machine is currently doing updates.
write a report The reference machine updates the task monitor with the log file and status of the update procedure.
get in the classroom Increment the number of boys or girls that are ready to copy.
last on to enter Check if the machine was the last one of its group to be ready for copying.
leave the classroom Decrement the number of boys or girls that are still copying.
have a last cigarette Wait for five minutes.
get a coffee Wait for ten minutes.

One flaw of this metaphor is that the boys and girls are addicted to nicotine and caffeine. But at least I made sure that all of them are of legal age :)

Below you find the code for the controller of the reference machine (teacher) that manages the preparation of the lesson. As you can see those analogies translate well into methods and the controller tells a nice little story:

function prepareLesson() {

	$t = $this;
	$t->c( 'checkLectureship' );
	$t->c( 'checkIfStillHoldingLesson' );
	$t->c( 'prepareForLesson' );
	foreach( $t->lsSubject as $t->scurSubject ) {
		$t->c( 'checkNewSubjectMatter' );
		if( $t->bNewMaterial ) {
			$t->c( 'prepareSubjectMatter' );
			$t->c( 'proofReadPreparation' );
		}
	}
	if( $t->bNewLessonNeeded ) {
		$t->c( 'filePreparation' );
		$t->c( 'callScholars' );
		$t->c( 'handOutBooks' );
		$t->c( 'holdLesson' );
	}
	$t->c( 'finishPreparation' );
	$t->c( "writeReport" );
	$t->c( "closeNotebook" );
}

It is not easy to find a metaphor that properly draws analogies with your system, but it pays off as it helps to visualize complex procedures that deal with a number of components and/or are time-dependent. But above all it is just great fun! Try it out! However, be warned: it is one of those brain-teasers that keep you from falling asleep.

Because this post is becoming too large already I wrote a follow-up post to explain the implementation in more detail...

is the sum of five and one.