Cons: A Software Construction System - The Perl Journal, Spring 1998

Bob Sidebotham

Understandably, many Perl programmers don't like having to program in low-level languages like C and C++. Even more irritating is writing the Makefiles necessary to compile and install those programs with the Unix make utility. Worse still is being saddled with maintaining someone else's Makefiles.

You might have reached an uneasy truce with make, and you might even have doubts about the statements above. Reading this article will likely shatter your illusions. You have been warned!

In this article I introduce Cons, a Perl alternative to make. First, we'll see why it's necessary.

Scripting

Perhaps the most egregious aspect of make is the make language itself.(In fairness, make has been around for a long time, predating even Perl. The standard for scripting in the Unix community at the time of make's emergence was the Bourne shell, and make represented a considerable advance over the then-common practice of using shell scripts to build software.) As a scripting vehicle, Makefiles are really just glorified collections of macros with no flow of control primitives, no complex substitution rules, no variable scoping, and poor control over import/export of environment variables and command arguments. If you want proof that make is inadequate, just look at the plethora of utilities that people use to avoid writing their own Makefiles, like MakeMaker and imake. I could provide a long list of the shortcomings of the make language, but I'd be surprised if any Perl programmer needs much persuasion.

Make Doesn't Do The Right Thing

Software construction systems like make need to know not only how to build the final target, but when particular components need to be rebuilt. You might think that make does this well. You'd be wrong.

In typical C-based software systems, an executable program depends on libraries, those libraries depend on object files, those object files depend on C programs, and those C programs depend on C header files. Managing these dependencies is trickier than it might seem; even systems that automate dependency generation seldom get it right.

One approach is to use the C compiler to find dependencies. However, this won't find dependencies for intermediate files, such as interface files generated by a separate interface definition utility. This approach also suffers because the entire system has to be built from scratch before it's possible to determine the dependencies.

Global dependencies ("make depend") are typically mishandled by make. The process is usually slow and cumbersome, and thus often avoided by impatient users. If a dependency is wrong, or if the dependency calculations are circumvented (for example, by building a lone subdirectory without first building its dependencies), you're already a goner, because once a file is built - whether correctly or incorrectly - its modification time is updated. Now it's more up-to-date than its dependencies and no amount of after-the-fact analysis will reveal this.

Partly to address this problem, and partly to get around the need to recompile too much of the tree, many Makefiles or make-like programs let you touch of all the files controlled by the Makefile, causing them to be presumed to be up-to-date, while files that depend upon them are presumed not to be up-to-date. (touch is a Unix command that sets the last modification time of the file to the current moment.) This is a contrived and error-prone practice because a touched file will not be rebuilt, regardless of whether it needs to be or not.

Problems result when a subset of modules are built in a particular way. Perhaps debugging is enabled in a subset of modules, but then must be removed in the final target. In this situation there is no reliable way to determine which files need to be rebuilt. A great burden falls on the developer who must remember exactly what he or she has done. A similar problem occurs from changes to rules that affect the way derived files are built. Attempts are occasionally made to deal with this by having derived files depend upon their Makefiles, but this is a heavyweight solution which invites the unsafe use of touch or other hacks to prevent unwanted rebuilds.

The result is that particular build trees become less certain to have been built correctly - in the right order, with the right parameters, by the right compilers, and with the right environment. When the developer has lost confidence in the tree, often after much wasted time chasing build-related problems masquerading as bugs, a make clean must be performed. The complete build from scratch wastes additional time.

Problems also arise from implicit rules, wildcards, and vpaths (lists of alternate directories to be searched for targets or dependencies). The unrestricted import of environment variables also causes confusion, with builds working correctly for one user but not another. And it's not unusual for a Makefile invoked manually to behave differently than when it's invoked by another Makefile.

Using file modification times as the sole determinant of whether a file is up-to-date is perilous. This causes problems in distributed systems if time is not carefully synchronized over all hosts. Library release management can also expose the fragility of this approach; if users have performed builds prior to the release date of the library, but after the library was built, make can be fooled into thinking that the users' builds are up-to-date. The central problem with the use of file modification times, however, is that a file's modification time says when the file was built, but not how the file was built. This is insufficient information for determining whether it's current.

Build Sequencing

Another difficulty with make is build sequencing. If you manage to get dependencies correct, make is pretty good about sequencing the build steps within each directory. However, between directories, the picture is murkier. In large systems, a common approach is to build the whole system in several passes. The first pass might install some include files; the second might build some tools and install them; a third pass might use these tools to build a set of interface files from interface description files; and so on.

Setting up the appropriate sequencing in a maintainable and understandable manner can be quite a challenge, typically resulting in complexity and inflexibility. Because of the perceived performance hit of multiple passes through trees of Makefiles, a full recursive build is often circumvented, compromising the build for the reasons cited earlier.

Variant Builds

make offers little support for parallel variant builds. You might wish to build a system for multiple target platforms or architectures, or maintain separate versions of libraries - one with debugging symbols and one without. It's also desirable to be able to change the way the system is built temporarily; once again, make is no help.

Complexity

Perhaps the most serious issue with make is complexity. This arises from many sources, most of all from the make language itself. Dependencies are hard to maintain, because the dependency information is frequently written in at least two places: once in the Makefile and once, implicitly, in the source file (e.g. #include declarations). Automated dependency add-ons can be difficult to use, unreliable, and difficult to understand. Build sequencing can require extra baggage in each Makefile. Recursive invocations of make in multiple passes can be hard to understand and maintain. Implicit rules, vpaths, and wildcards all increase the cognitive load on the developer or maintainer.

Listing 1: What's Make

Enter Cons

My relationship with make got off to a bad start back in the early 1980s, and went downhill from there. A couple of years ago, my frustration level peaking, I looked at alternatives to make such as Odin, Shape Tools, and Jam. Of these, only Jam was close to what I wanted. It had a small footprint, didn't try to do too much, and was fast. But none of them was as flexible as I wanted. Eventually I realized that I didn't have to be a slave to any of these systems; I could roll my own with Perl.

The result is Cons, so called because it can be used to cons-struct software. Besides, all the other building words like bake, make, and build were already taken. Cons addresses all of the issues raised above and provides superior solutions in a very simple and user-friendly package. Strong claims? Read on.

Cons scripts are Perl scripts

The key to building a powerful construction system turned out to be Perl. Not only is Cons written in Perl - it is a single 1500-line program - but all user-supplied construction scripts are also written in Perl. Here's an example of a trivial script:

         $cons = new cons(); 
         Program $cons "foo", "foo.c";

The first line creates a construction environment, a Perl object that encapsulates how to construct derived files. A construction environment has a set of methods, such as the Program() method shown above, and a set of key/value pairs used by the methods. If we place this code in a file called Construct, making sure that the source file foo.c is in the same directory, then we build foo like this:

  % cons foo 
  cc -c foo.c -o foo.o 
  cc -o foo foo.o

The default implementation of Program() never builds a program in one step. It easily could, but there didn't seem to be any great need for it, so it doesn't.

We can elaborate this script to enable debugging and use the GNU C compiler, gcc, like so:

  $cons = new cons( 
    CC     => 'gcc', 
    CFLAGS => '-g'
  );

  Program $cons "foo", "foo.c";

Then recompile it like this:

  % cons foo 
  gcc -g -c foo.c -o foo.o 
  gcc -o foo foo.o

Since we have Perl scripting at our fingertips, we can get fancier and control compile-time options via command-line arguments. Here's a somewhat trivial example of how you might do this:

  $opt = "-O$ARG{opt}" if $ARG{opt}; 
  $dbg = "-g" unless $opt; 

  $cons = new cons( 
    CC     => 'gcc', 
    CFLAGS => "$dbg $opt"
  );

Any arguments with embedded equal signs are passed to the top-level Construct file via the %ARG hash. Here's the result:

  % cons foo opt=4 
  gcc -O4 -c foo.c -o foo.o 
  gcc -o foo foo.o

If we rebuild again without opt=4, we get:

  % cons foo 
  gcc -g -c foo.c -o foo.o 
  gcc -o foo foo.o

This tiny example hints at the flexibility Perl provides Cons scripts. With Perl, scripts can do exactly what you want them to do in the most straightforward possible manner. It's especially useful for adapting to multiple machine architectures or environments. There's never any need for pre-processors like imake: everything can be done directly within Perl. All the things that make Perl a great language also make it a good choice for a software construction tool.

Cons Does The Right Thing

It doesn't matter who builds the system, what has been built before, or what has changed: Cons will always build the system correctly. Period. Several techniques help ensure this. Cons performs global automatic dependency analysis, and considers everything that goes into the file and how all those pieces were built.

With Cons, the golden rule for dependency handling, is, ideally, that every dependency is written down exactly once. If a C file #include's a header file, there's your dependency. In our previous example, it turns out that foo.c includes the file foo.h. Note that this dependency is automatically recognized by Cons:

  % cons foo 
  cons: "foo" is up-to-date. 

  % touch foo.h 

  % cons foo 
  gcc -g -c foo.c -o foo.o 
  gcc -o foo foo.o

In the following example, we see that the file foo.o is not only dependent on the files foo.c, and foo.h, but it is also dependent on the actual commands that were used to produce it. Notice what happens if we change the optimization level:

  % cons foo opt=1 
  gcc -O1 -c foo.c -o foo.o 
  gcc -o foo foo.o 

  % cons foo opt=1 
  cons: "foo" is up-to-date. 

  % cons foo 
  gcc -g -c foo.c -o foo.o 
  gcc -o foo foo.o 

  % cons foo 
  cons: "foo" is up-to-date.

As long as we leave the optimization level alone, Cons will not recompile. As soon as the optimization level is changed, foo is rebuilt.

Notice what happens if the target file is modified outside of the construction system since the last Cons-directed build:

  % cons foo 
  cons: "foo" is up-to-date. 

  % cp /etc/passwd foo 

  % cons foo 
  gcc -o foo foo.o

If we'd been using make, foo would have been considered up-to-date because it was modified more recently than its dependencies.

Now consider what happens if foo.o is modified behind Cons's back:

  % cons foo 
  cons: "foo" is up-to-date. 

  % cp /etc/passwd foo.o 
  
  % cons foo 
  gcc -c foo.c -o foo.o 
  cons: "foo" is up-to-date.

This one is, perhaps, surprising: Cons detects that foo.o needs to be rebuilt. After doing this, however, Cons does not relink foo because it discovers that it was previously linked with an identical version of foo.o. It's worth observing that make, in the same situation, would have attempted to rebuild foo from the corrupted foo.o. Since the object file was newer than foo.c, make would have made no attempt to rebuild.

Explicit and Implicit Dependencies

How can Cons be this smart? If we go back to the construction script, we see that the crux of the script is:

  Program $cons "foo", "foo.c";

The Program() method expects a target file name and a set of object files. When it's given a file without the standard object extension (e.g. .o, identified by SUFOBJ in the $cons object), it implicitly invokes the Objects() method. So this script is equivalent to:

  Program $cons "foo", "foo.o"; 
  Objects $cons "foo.c";

The Program() method sets up an explicit dependency of foo on foo.o. The Objects() method knows, for the specified environment $cons, how to derive a corresponding object file for each file mentioned, and uses the file's extension to determine precisely how to do this. Cons uses SUFOBJ in the $cons environment to determine the resulting file extension. It sets up an additional explicit dependency of foo.o on foo.c and associates a builder object for C files with foo.o as the output and foo.c as the input. This builder object, in turn, recognizes that foo.c may reference other sources, and it provides a scanner, responsible for scanning foo.c for #include lines, and, recursively, scanning any include files that are included by foo.c. The files returned by this scanner are implicit dependencies of foo.o.

When the user types cons foo, Cons descends the dependency tree, starting from the file foo. Cons determines that foo.o depends on foo, and so determines how to build that. The request to build foo.o causes the dependency scanner for foo.c to be invoked, returning the list of additional, dynamically determined dependencies of foo.o. In this case, as we know, at least foo.h must be returned.

Up to this point, make would have done a similar analysis, but with some key differences. In both systems, the dependency of foo on foo.o is listed explicitly. In contrast to make, in Cons the dependency of foo.o on foo.c is explicit as well. The user might not remember to include the dependency in his script, but it is explicitly defined by the script because the Objects() method is invoked, and it doesn't care what files happen to be sitting in the directory at that moment. In Cons, if foo.c doesn't exist, an error will occur if foo.o is needed; in make, a missing foo.c does not necessarily mean the build will fail: it might choose another, probably inappropriate, way to build it. That might sound unlikely, but it's happened to me. In make, of course, the dependency of foo.o on foo.h is not determined automatically and must be listed explicitly by other means.

MD5 Cryptographic Signatures

Once the dependencies are determined, make decides what to rebuild based solely on the dependency hierarchy and the relative modification times of the files involved. Cons abandons relative modification times and introduces, instead, cryptographic file signatures. Cons uses the MD5 algorithm (see Listing 2) to compress information about the file into a unique signature. Filesystems commonly do not provide any way of adding annotations to files, so the signatures are recorded separately. Cons stores the signatures in a .consign file, one for each directory which contains files derived under the control of Cons. In my directory for the example we've been working with, the .consign file contains the following lines:

  foo.o:867019197 e08563e22a97ea75de433949381ebdea
  foo:867019197 c9cda7211523e4b17aeb8d7850790f47

The first number of each entry is a validation timestamp, and the second number is the file's signature: a 128-bit MD5 checksum. An entry is valid if its validation timestamp matches the corresponding file's modification time, as returned by the Perl stat() function.

Notice what happens to the signatures if we recompile foo at a different optimization level:

  % cons foo opt=2 
  gcc -O2 -c foo.c -o foo.o 
  gcc -o foo foo.o 

  % cat .consign 
  foo.o:867038502 a5ee7b74ca613efe91f193f1e2d3b27b
  foo:867038503 a9e138331c31b8c5f41f95cfdd89569a

Both the validation timestamps and the signatures have changed. If we recompile without optimization:

 % cons foo 
  gcc -g -c foo.c -o foo.o 
  gcc -o foo foo.o 

  % cat .consign 
  foo.o:867038521 e08563e22a97ea75de433949381ebdea   
  foo:867038521 c9cda7211523e4b17aeb8d7850790f47

We get the original signatures back, but with new validation stamps, representing the times the files were most recently derived.

Finally, if we touch an input file that is a dependency, the signature changes to something completely different:

  % touch foo.c 

  % cons foo 
  gcc -g -c foo.c -o foo.o 
  gcc -o foo foo.o 

  % cat .consign 
  foo.o:867038743 3c0cb0d5d95abc4d35d352e536b78655
  foo:867038743 33d6543621b4b1cdc51c15f0e72c9f41

Cons dumps everything it knows about a file into the signature. The signature thus reflects the way a file was built and all dependencies of the file. The signatures of the dependencies, in turn, include the command lines that were used to build them, and their dependencies. For source files that are not derived and for which no pre-existing signature exists, Cons uses the last modification time of the file. Unlike make, however, Cons is sensitive to any change in the file's modification time, either forwards or backwards.

The algorithm used to determine whether a file must be rebuilt is simple. First, determine the file's signature. This is a recursive operation that will build, if necessary, any dependencies of the file. Second, compare the file's signature with that stored in the appropriate .consign file. If no valid signature can be found, or if the modification time of the file does not match the validation time stamp in the signature, assume the file is out of date. If the signature does not match, the file is also assumed to be out-of-date. To rebuild the file, Cons uses whatever rules have been specified in the construction script. If the rebuild is successful, a new .consign entry is created accordingly.

A further benefit of file signatures is that Cons never has to remove files after an interrupted build! Since Cons only updates the signature after the file is built, there is no way that a newly modified file can have a valid signature if Cons is interrupted before completion of the processing step which creates the file.

Automatic, Global Sequencing of Builds

Because Cons does full and accurate dependency analysis, and because it's global across the entire build, it can use this information to take full control of the build sequencing. This sequencing is evident in the above examples, and is equivalent to what you would expect from make if you had a full set of dependencies. With Cons, however, this extends to larger, multi-directory builds.

A build is organized by creating a hierarchy of build scripts, starting with the top-level Construct file, and including multiple Conscript files from various directories. These scripts are tied together by the Cons-provided functions Build(), Export(), and Import(). In make, subsidiary scripts are evaluated in a separate invocation. With Cons, only a single invocation is needed: Cons executes all of the scripts in the tree as part of the same Perl program. Note that the exact set of scripts read is under the control of Cons, and thus might change to accommodate different building arrangements.

To illustrate this, consider a simple system with a top-level directory and beneath this a directory called 'src' containing all the source code for the build. This directory is organized hierarchically. We'll consider just a small part of a system, with subdirectories 'tpj' and 'app'. The first directory contains a library of code used by the application in the 'app' directory. Rather than teach the application the location of the library's source directory, we'll instead arrange the for library directory to install its products in known locations in a shared hierarchy called 'build'. Files will be installed in 'build/include', 'build/lib', and 'build/bin'. The Construct file might look like this (In Cons, any pathname starting with # is considered a top-relative name. The name #build can be passed to subsidiary scripts and will always refer to the "build" directory at the top of the tree. This is analogous to using / to specify an absolute pathname.):

  # Define the various build directories 
  $build   = "#build"
  $LIB     = "$build/lib"; 
  $INCLUDE = "$build/include"; 
  $BIN     = "$build/bin"; 

  # Define an environment for building programs 
  # linked against libtpj.a. 
  $CONS = new cons( CC => 'gcc', 
                  LIBS => '-ltpj', 
               LIBPATH => $LIB, 
               CPPPATH => $INCLUDE); 

  # Export the environment and build directories to 
  # our subsidiary scripts. 
  Export qw( CONS LIB INCLUDE BIN ); 

  # Read and execute the subsidiary Conscript files.
  for ("app", "tpj") { Build "src/$_/Conscript"; }

The file 'src/tpj/Conscript' looks like this:

# Import these variables from our invoking script.
Import qw( CONS INCLUDE LIB ); 

# Arrange to build the library from these source files
Library $CONS "libtpj.a", qw( tpj.c jon.c ); 

# Arrange to install the library in the $LIB directory
Install $CONS $LIB, "libtpj.a"; 

# Arrange to install the header file in $INCLUDE. 
                 Install $CONS $INCLUDE, "tpj.h";

And the file 'src/app/Conscript' looks like this:

# Import these variables from our invoking script
Import qw( CONS BIN ); 

# Arrange to build the "app" program. 
Program $CONS "app", "app.c"; 

# Arrange to install the app program in $BIN. 
Install $CONS $BIN, "app";

These scripts result in a build tree: For every derivable file in the tree, Cons records the explicit dependencies and a method for dynamically determining the implicit dependencies, if any. Cons also determines a method for actually deriving the file should this be necessary. In most cases, this comes down to one or more commands to pass to the shell.

We can ask Cons for a list of products in the build tree:

 
  % cons -p . 
  build/bin/app 
  build/include/tpj.h 
  build/lib/libtpj.a 
  src/app/app src/app/app.o 
  src/tpj/jon.o 
  src/tpj/libtpj.a 
  src/tpj/tpj.o

Or how they're built:

  % cons -pa . 
  build/bin/app: 
  ... Install src/app/app as build/bin/app 
  build/include/tpj.h: 
  ... Install src/tpj/tpj.h as build/include/tpj.h
  build/lib/libtpj.a: 
  ... Install src/tpj/libtpj.a as build/lib/libtpj.a
  src/app/app: 
  ... gcc -o src/app/app src/app/app.o -Lbuild/lib -ltpj 
  src/app/app.o: 
  ... gcc -Ibuild/include -c src/app/app.c -o src/app/app.o
  src/tpj/jon.o: 
  ... gcc -Ibuild/include -c src/tpj/jon.c -o src/tpj/jon.o
  src/tpj/libtpj.a: 
  ... ar r src/tpj/libtpj.a src/tpj/tpj.o 
  src/tpj/jon.o 
  ... ranlib src/tpj/libtpj.a 
  src/tpj/tpj.o: 
  ... gcc -Ibuild/include -c src/tpj/tpj.c -o src/tpj/tpj.o

Or which build scripts define them:

% cons -pw . 
build/bin/app: cons::Install in "src/app/Conscript", line 8
build/include/tpj.h: cons::Install in "src/tpj/Conscript", line 11
build/lib/libtpj.a: cons::Install in "src/tpj/Conscript",line 8
src/app/app: cons::Program in "src/app/Conscript", line 5
src/app/app.o: cons::Program in "src/app/Conscript", line 5
src/tpj/jon.o: cons::Library in "src/tpj/Conscript", line 5
src/tpj/libtpj.a: cons::Library in "src/tpj/Conscript", line 5
src/tpj/tpj.o: cons::Library in "src/tpj/Conscript", line 5

So how is the build sequenced? It's very simple: Cons processes the target arguments, one by one, and looks them up in the build tree. For each argument, it first determines the dependencies (which might involve other recursive build steps), and recursively builds any dependencies whose signatures do not match their computed signature. It then builds the target argument itself, if necessary. This process is sufficient to build any target, with any dependencies, at any location in the tree.

Keep it Simple, Stupid!

The best feature of Cons is its use of Perl as a scripting language. Automatic dependency analysis, cryptographic file signatures, detailed control of the command execution environment, and explicit construction directives further simplify construction scripts and ensure that Cons reliably builds software. Construction environments encapsulate and localize knowledge of how to build target files. Automatic build sequencing makes writing and building scripts dead easy. Parallel, variant builds are especially easy to arrange.

Cons is effective. It is used in a number of projects with several thousand source files. The "from scratch" build time of large projects has decreased significantly whenever Cons was employed. More importantly, these projects no longer suffer from what used to be daily build-related problems: the builds are now consistently correct.

All of this is accomplished with a single portable program in a simple and easy to use package. But you don't have to take my word for it; try it out! Cons is freely available on a CPAN site near you at http://www.perl.com/CPAN/authors/Bob_Sidebotham.

_ _END_ _

Bob Sidebotham, [email protected], is the author of Cons.

TABLE OF CONTENTS