Showing posts with label BuildFarm. Show all posts
Showing posts with label BuildFarm. Show all posts

Monday, May 18, 2009

Build Farm Part 3: The core query

Returning to the topic of the Build Farm, recall that the core mission of the Build Farm is to determine the best job that a machine can run. We managed to distill that mission into a single query; this query is executed when a machine asks for a job to do, and the point of the query is to fetch the "best" available job. There may be no jobs currently available to run, or there may be several that the machine could run, and we want to find the "best".

Here's the query:


select wq.id, p.name as project_name, p.depot_root, p.label_name,
i.sync_at_start, wq.label_override,
i.operation, wq.item_id, i.max_duration,
wq.detail,i.status_token,i.working_dir,wq.group_inst_id,
i.failure_token,i.min_reqd_tests
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where wq.started_time is null and
wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and wq.project_id not in
( select distinct project_id from
apt_project_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and (wq.machine_id is null or
wq.machine_id=(select id from apt_machine_desc
where name='MACHINE-NAME'))
order by wq.priority desc


Wow, that's a mouthful! Let's try to break it down into smaller parts.

Firstly, for understanding the query, it doesn't really matter what particular columns are retrieved, so let's just collapse the columns together, and call them "STUFF":


select STUFF
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where wq.started_time is null and
wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and wq.project_id not in ( select distinct project_id from
apt_project_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and (wq.machine_id is null or wq.machine_id=(select id from apt_machine_desc
where name='MACHINE-NAME'))
order by wq.priority desc


Secondly, the condition wq.started_time is null means "job hasn't started yet", and the condition:

and (wq.machine_id is null or
wq.machine_id=(select id from apt_machine_desc
where name='MACHINE-NAME'))


says "the job either doesn't care what machine it runs on, or it has specifically requested this machine". Let's abbreviate that to "and ok-to-run-on-this-machine".

Thirdly, there's a sub-query which occurs twice in this larger query. This query says: "get this machine's capabilities":


select capability_id from apt_machine_desc, apt_machine_cap_map
where apt_machine_desc.id = apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME'


Each capability_id for a particular machine describes some bit of its configuration in a way which is important to the Build Farm scheduler. So for example, one our our build machines might have a capability for "OS=Windows", a capability for "Java=JDK 1.5", and a capability for "DBMS=Oracle 10g", meaning that this machine is running Windows, and has JDK 1.5 and Oracle 10g installed (and so can run jobs which require those resources). So we can replace this sub-query with "this machine's capabilities" and the query now looks like this:



select STUFF
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where job-hasn't-started-yet
and wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in (this-machine's-capabilities))
and wq.project_id not in ( select distinct project_id from apt_project_cap_map
where capability_id not in (this-machine's-capabilities))
and ok-to-run-on-this-machine
order by wq.priority desc


That's good, the query is starting to get a bit easier to read. Let's keep going.

The first remaining complex condition in the WHERE clause:


and wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in (this-machine's-capabilities))


means "job is not one that requires a capability this machine doesn't have", and the other, similar, condition on the job's project means "job's project is not one that requires a capability this machine doesn't have."

So now we have:


select STUFF
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where job-hasn't-started-yet
and job-doesn't-require-a-capability-this-machine-doesn't-have
and job's-project-doesn't-require-a-capability-this-machine-doesn't-have
and ok-to-run-on-this-machine
order by wq.priority desc


That's much better. Breaking down the "NOT IN ... NOT IN ..." double negatives into English hopefully makes them easier to understand, but they are still double negatives, unfortunately, which means you have to read the query forwards and backwards and let it bounce around in your head for a while.

Lastly, the ORDER BY clause arranges all the legitimate candidate jobs in order by priority, descending, so that the first row in the result is the best job for this machine to process.

Hopefully the query makes sense now!

Monday, May 11, 2009

Design your Database

When we started designing the Build Farm implementation, one thing we agreed on almost immediately was that we wanted to explicitly and thoroughly design the database schema. We both had had frustrating encounters with the modern trend toward high-level Object-Relational Mapping systems, in which it seems that the typical approach is to ignore the underlying database, and spend all your energy designing the object layer.

I'm not trying to say that there is no point to designing the object layer; object modeling is quite important and necessary. But it seems as though, more and more, people don't want to talk about the database at all!
  • They don't want to talk about data types, or candidate keys, or normalization,
  • They don't want to talk about check constraints, or referential integrity, or update propagation,
  • They don't want to talk about index definitions, or access patterns, or query planning, or transaction design,
  • They don't want to talk about data lifetimes, or versioning, or auditing, or security control.
They don't seem to want to talk about the database at all, which I think is an unfortunate turn of events. Sure, modern database systems are complex and intricate, and have behaviors that take thorough and detailed study. But they are also extraordinarily powerful pieces of software, with amazing capabilities, and rich, mature functionality. They can be used well, or they can be used poorly.

In the Build Farm, we started our overall design by discussing some of the basic goals and requirements. Then, the next step we took was to talk about the data that we wanted to track. We discussed the data at a fairly abstract level, then we sat down at a keyboard and typed in a bunch of DDL: CREATE TABLE, CREATE INDEX, etc. This forced us to get specific about table names, and column names, and data types, and inter-table relationships, and so forth. Then we loaded up a little bit of sample data into those tables (by hand, using SQL Squirrel), and practiced writing some queries.

This is what people used to call "building the Data Dictionary". I don't know if anybody talks about a data dictionary anymore; I guess I'm old-fashioned.

We certainly made some mistakes when we implemented the database schema for the Build Farm. We've fixed some of those mistakes, and we'll need to fix more of them in the future.

But I firmly believe that the decision to design the database first was a major reason why the Build Farm's implementation was a success.

Friday, May 8, 2009

Build Farm Part 2: The core mission

The core mission of the Build Farm is:
  • Accurately assign the right work to the right machines,
  • While providing access to historical results for analysis.
You can make a perfectly reasonable argument that both these sub-missions are equally important, but it's the first part of the mission that I want to talk about here: as jobs queue up on the Build Farm, and as machines become available to do work, which job should be assigned to which machine?

Efficiently executing work is an often-discussed topic; you can find it covered until topics such as "Workload Management", "Resource Management", etc. It tends to be the province of software such as DBMSs, Application Servers, and Operating Systems. That is, it's a really hard problem to do this well.

We did not want to write an operating system.

So, we made several simplifying assumptions:
  • We can afford to over-provision our Build Farm. Machine resources are cheap enough nowadays, and our overall problem is small enough, that we can "throw hardware at the problem".
  • We don't have to be perfect. We have to do a decent job of scheduling work, but we can afford to make the occasional sub-par decision, so long as in general all the necessary work is getting done in a timely fashion.
We decided upon a fairly simple design, based on two basic concepts: Capabilities, and Priorities.

Priorities are just what you think they are: each job in the Build Farm is assigned a priority, which is just an integer, and jobs with a higher priority are scheduled in preference to jobs with a lower priority.

Capabilities are an abstract concept which enable the Build Farm to perform match-making between waiting jobs, and available machines. Jobs require certain capabilities, and machines provide certain capabilities, and for a job to be scheduled on a machine, the job's capabilities must be a strict subset of the machine's capabilities.

So, for example, a particular job might specify that it requires the capabilities:
  • Windows
  • JDK 1.5
  • JBoss 4.3
  • DB2 9.5
Trying to run this job on a Linux machine, or on a machine which only has Oracle installed, would be pointless. So each machine specifies the software which it has available as a similar set of capabilities, and the Build Farm can use this information to decide if this machine is capable of handling this job.

The bottom line is that we re-defined the Build Farm's core job-scheduling requirement to be:
  • When a machine is available to do work, ask it to do the highest-priority job in the queue which this machine is capable of executing.
When a programming problem is too hard, re-define it so that you have a problem you can actually solve.

Tuesday, May 5, 2009

File scheme URLs

I'm through with file scheme URLs.

An URL, as of course you know, is a Uniform Resource Locator. The notion is that, given an URL, client software (such as your web browser) can figure out how to retrieve that resource for your use. In the Build Farm, as is true in many modern web applications, the user interface is full of URLs, as there are many resources that we wish to display.

In the Build Farm, many of the resources to which we provide access are files:
  • build logs
  • test output
  • server logs
These files are recorded by individual build machines during job execution, and then are saved on the master file servers. The Build Farm UI then wants to make these files available, so I need to show links (URLs) to these files.

For a time, I tried to construct these links using file scheme URLs. URLs can use a variety of schemes. The "scheme" is the part of the URL at the start, which describes the underlying technique that is to be used to access the resource. Among the common schemes are:
  • http:// URLs, which are the most common, and which indicate that HyperTextTransportProtocol requests are to be used to access this resource.
  • https:// URLs, which are the secured variant of http URLs.
  • mailto: URLs, which are used for constructing links that initiate the sending of an email request.
  • file: URLs, which can be used to specify a file.
When you first have a look at file scheme URLs, they don't look too bad:

file://host/path
However, have a very close look at that middle paragraph in the description on Wikipedia:


On MS Windows systems, the normal colon (:) after a device letter has sometimes been replaced by a vertical bar (|) in file URLs. For example, to refer to file FOO.BAR in the top level directory of the C disk, the URL file:///C|/FOO.BAR was used. This reflected the original URL syntax, which made the colon a reserved character in a path part. For network shares, add an additional two slashes. For example, \\host\share\dir\file.txt, becomes file:////host/share/dir/file.txt.
Trying to figure out these details of how to handle device letters and network file shares is where the wheels start to come off the wagon. As I said, I tried, for a time, to make file scheme URLs work for network share references to build logs and other goodies stored on our file servers. But I could never get it to work reliably across different versions of different browsers. Firefox seemed to behave differently than Internet Explorer, and different versions of the various browsers seemed to behave differently. And the more I searched around on the net, the more confusing it got. There were seemingly authoritative articles suggesting use of 3 slashes, 4 slashes, even 5 slashes. In the bug reports, people seemed to be flailing, even using 20 or more slashes!

It was terribly frustrating, but in the end, I realized that I didn't need to care. Instead, I realized that I was trying to link to a bunch of files that were all on file servers that my web server already had access to. So, I simply changed my web server so that it was willing to serve static files via HTTP protocol (you have to fiddle with allowLinking=true in the Context definition), defined a simple set of symbolic links in my webapps/ROOT directory, thus creating a simple HTTP namespace for accessing the important files on the file server, and replaced all my nasty code which tried to build portable file scheme URLs with simple code that built simple HTTP scheme URLs.

And I was happy! So, like I said, I'm through with file scheme URLs.

Monday, May 4, 2009

Build Farm Part 1: Concepts and Goals

At work, I spend a lot of time building and maintaining a tool we call the Build Farm.

The Build Farm is internally designed and developed, but it bears a lot of resemblance to tools such as Tinderbox or BuildBot, together with some aspects of tools such as CruiseControl, and even some aspects of test case management systems. I looked at several such tools when we were starting out, but our system is wholly home-grown and coded from scratch by us; this is both good and bad. It's good because we're intimately familiar with how and why it behaves as it does; it's bad because we probably could have taken an existing system and re-used it.

Anyway, the Build Farm's primary goals are:
  1. To manage and make effective use of a large (for us) pool of machines dedicated to building and testing our software. We currently have about 80 machines in the Build Farm; there are about 60 Windows machines, about 10 Linux machines, and about 10 Solaris machines. Work is scheduled both automatically and on-demand.
  2. To preserve and assist with the analysis of the results of building and testing the software. Build problems must be identified and resolved; test problems must be analyzed and investigated.
  3. To enable access to these tasks by users across the company, in all our worldwide offices, at any time 24/7.
Architecturally, the Build Farm consists of several primary pieces:
  1. Persistent storage for information about builds, tests, work requests, machines, schedules, results, etc. Our storage includes a large file server which serves both NFS (Linux, Solaris) clients and SMB (Windows) clients (via Samba), and is used to store build logs, test output, and other build-related data such as application server trace logs, test databases, etc.. Our storage also includes a relational database for structured data; we use Apache Derby.
  2. A web server which provides a simple UI for editing build configurations and for viewing build and test reports. The UI allows a variety of searching, as analyzing historical information and looking for trends and patterns is a very common use.
  3. A client program which is installed on each build machine. We try to keep the client program fairly simple, and concentrate most of the logic on the server. The client knows how to request work, how to spawn subprocesses to perform work, and how to capture output and route it to the file server for the central UI to analyse.
  4. The web server also provides a SOAP interface for use by the build machines. The most important and most frequently-issued SOAP request is: "get next job to perform".
  5. A source code control system which the build machines can use to fetch the code to build and test. We use Perforce.
I'll talk more, and in more detail, about the Build Farm in future postings. I've been thrilled with the success of this tool, and I'm quite grateful to the other members of my team for their ideas, their support, and their willingness to tolerate the ups-and-downs of an internally-developed tool like this.