Eleanor McHugh (feyeleanor) wrote,
Eleanor McHugh

  • Music:

C and code readability

Today I'm going to admit to a shameful secret: well, shameful amongst hackers anyway. Even though I've been using C for twenty years, including lots of snazzy embedded work, I loathe working with it. This is the point where a language bigot would usually inject comments about this or that specific feature of C that they think should be changed, but it's not the language features which bug me - it's the code that gets written with them in the real world.

Conceptually C is beautifully simple and in its hey day it was a so-so procedural language which just happened to be an excellent bridge to the low-level world, much like BCPL before it, and like BCPL its appeal came from the ease with which it could be used for systems programming. Compared to Algol and the other mainstream languages of the early 70's it had a free-wheeling, lightweight feel that made it easy to port to new architectures and the fact that UNIX was written in it added a twist of glamour the others lacked. Never underestimate the part played by glamour in setting fashions in the software world: we all want to believe we could write the next great operating system.

Anyway, anything you'd consider writing in Assembler for a specific processor family could be written in C for portability of sorts (assuming you stuck closely to the K&R view of what constituted valid C) with the same kind of development gains that today's dynamic scripting languages offer over C++ and Java. In fact Small C (a C subset) gave birth to several neat scripting systems in its heyday. Anything that saves on development time is a huge win commercially, so yay for C.

Now even amongst the deeper geeks who read this journal I suspect that most have only a passing familiarity with Assembler coding and the low-level workings of the hardware they play with. It's just not fashionable for the young and gifted to be able to spout opcode mnemonics in daily conversation the way it once was, and even having spent a decade of my life working heavily in Assembler I can see why: saying anything useful takes a long time and demands careful attention to detail. Frankly it's tedious. And then there's the problem of whether anyone else will understand what's been said, let alone what the author intended to say.

So for low-level system work, C was an absolute godsend. It empowered a generation of hackers to write games and operating systems and all kinds of weird and wonderful applications. By the 1980's the only real commercial competition it had was Pascal, the strange haunt of students and process engineers. Many respected computer scientists insisted that Pascal was a vastly superior language, and in many abstract respects it is, but that's no different to the BetaMax versus VHS debate. What matters with any technology is popularity and Pascal never got past the fussy prohibitions against goto statements (and yes, it does have goto if you want a guilty pleasure), slightly more verbose syntax, and the sense that your parents would approve of it. Like all natural rebels, hackers love to stick it to The Man - preferably with as little personal effort as possible - so Pascal got relegated to academia and C bestrode the world like a colosus of procedural efficiency. And like all monocultures promptly exploded into a heterogeneity of subtly-incompatible implementations. Well, that happened pretty much from the start, but I don't want to ruin a good story.

So what does this walk down memory lane have to do with today's subject? I guess the fact that in many ways by choosing C over Pascal, and hence a world of free-wheeling code over fussiness, had other consequences. Say what you like about Pascal, I've very rarely encountered a Pascal program that was difficult to read, either on screen or on the printed page. The latter may seem a strange thing to care about, considering that most modern developers only interact with code on a screen, but my experience is that code quality is closely related to whether or not it gets offline reviews. That's the code that usually scans well when you're reading it. Readable code also should aim to be self-documenting - one of the other properties that Pascal programs often display over C counterparts.

Now don't get me wrong, I'm not advocating a switch to Pascal. The years I spent working with it and it's even-more-fussy daughter Modula-2 were full of little annoyances that put me off the whole Wirthian world view permanently. But for all I dislike the language, I applaud the Pascal folks for favoring readable code over the write-only habits that pervade the C community.

And this brings me to the first of my grievance with the majority of C codebases that I encounter: like Assembler before them, they're generally written with the tersest of labeling. If a C programmer can get away with using a variable name like p1 then they will, and chances are that the accompanying comments won't explain what p1 is a contraction for. Intuition will probably explain its purpose but even experienced coders can find themselves scratching their heads for hours when they run into such usages. That alone makes them bad enough, but the signal they send to novice C coders is that this is the way to write professional code. It isn't.

I have a huge bugbear over this kind of incidental and institutionalised laziness because it's the wrong kind of lazziness. It ignores the fact that source code exists to be read by other developers - that's what 99% of the people dealing with it during its life will be concerned about, including the original authors. Terse names may save on typing today, but when it comes to maintenance of a codebase you want every clue you can possibly get about the intent of the original authors, and you want to be able to make changes with minimal investment of labour. Basically you want to be able to just read the damn thing and make the necessary changes to fix a bug, optimise performance or slip in new functionality in an easily maintainable manner. And the first things you're looking for when doing this? Meaningful identifiers.

Yes good comments help, as does well-written design documentation, but they're abstractions that distract from the code narrative and anyway the latter should really be looked on as a fallback for when architectural changes are required rather than something that needs to go into turgid detail. For anything more trivial the codebase alone should be enough to allow a skilled coder to get on with their job.

Readability also matters whether a codebase will belong to one person for its entire lifespan or will pass through many hands. Why? Because there's a practical limit to how many lines of code any one person can keep in their head at one time, and most significant projects will fall outside that limit. Every time we reuse a meaningful identifier in different places where the same concept is used we decrease the effective size of the codebase conceptually, and that means more functionality can be remembered in detail. Keep using meaningless identifiers and you lose that benefit, reducing the amount of code you can slip comfortably back into. Now that's the kind of abstraction that aids the narrative process, and hence makes for a much more productive developer.

Those who think I'm unfairly singling out C here are to some extent right. I see the same problems in codebases developed in other languages (yes, I do spend a lot of my free time reading other people's code) but rarely is it as all-encompassing as it is in the C/C++ community. And that's before we even consider Charles Petzold's Hungarian Naming Convention, which by hijacking the first few characters of an identifier name to show type information adds more unnecessary line noise: there's already a type declaration with that information and it really isn't necessary to show it beyond that unless you're writing thousand-line functions. You are? I hope they're heavily-optimised state machines or numerical methods, because otherwise you should probably be considering a complete rewrite or a change of career.

As you'll probably have guessed by this point, I take a very narrative approach to code design. The code you write is a dialogue between human and machine, and too many programs read more like a chat log from an IRC session than they do a carefully considered debate. Narrative matters, damn it!

Which leads nicely to the other practice that really bugs me: inconsistent formatting. This is pretty much a universal problem in open source projects, probably thanks to the limited resources available to most code maintainers and the way in which patches are often contributed, but it's almost as common in commercial codebases where more than one hand has been involved.

Now there are probably as many acceptable code formats for C as there are C programmers and that's just one of those things we have to deal with and move on from. But just because different programmers favour different formatting conventions doesn't mean that individual codebases need to reflect that: if you're playing with someone else's codebase you should follow their conventions, and if they don't have conventions you should suggest some and explain why they're desirable: readability.

Take the case of indenting as a mechanism for indicating blocks of nested scope. The Python community considers this so important to writing clean code that it's a mandatory practice - the indentation alone indicates to the parser that code is explicitly part of a nested scope, and anyone reading a Python program can instantly see the structure at a glance.

Contrast that to the many C codebases where one or more of the authors decided indentation was an unnecessary waste of typing effort as curly braces already delimit nested scopes. The braces are handy (just like Pascal's begin...end constructs are) because they clearly delineate that a sequence of statements belong together, but without clear indentation it's all too easy for the eye to miss them when scanning through. The same goes with the habit of not leaving space between if, for, while, do or switch statements and the opening bracket of their condition statement. These are really simple habits to get into, and after a while they become instinctual which makes them cost-free, but too many programmers consider them a waste of time. The are focused on the immediate gain of getting the code written instead of viewing the longer-term cost,.

When encountering these habits in proprietary codebases I often hear the excuse that commercial pressures make code readability less important than hitting a particular deadline. This isn't just restricted to the C world, I've heard the same excuses on Rails projects, VB projects and Java projects and they're related to a misunderstanding of how to rapidly develop robust code: hitting a deadline with one bug every hundred lines of code may satisfy a manager who only sees a list of requirements nominally satisfied, but the whole mindset associated with this approach leads to ill-conceived code that's barely fit for purpose. Perhaps it could be forgiven if the next step after hitting the deadline was to polish and refactor the code, but shops that work this way never give that time and so a succession of equally shoddy systems are developed, each a maintenance nightmare. No wonder so much software is rewritten from scratch when an upgrade is needed.

I don't have a ready answer to this problem, save to encourage all senior developers to make a stand and insist on code cleanliness. I always have, and whilst a couple of times it's cost me jobs, more often it's been accepted as part of the process of producing a good quality software system - and that after all is what commercial development is supposed to be about, and why senior developers can command the salaries they do.

The best codebase I ever worked on commercially was approaching a million lines of code: C, Assembler and Visual Basic across two very different operating systems and hardware platforms. It was beautiful and I loved working on it. Nice clean code, meaningful identifier names, good indentation. Four or five people had worked on it during it's lifetime and there were areas that were algorithmically too complex to grasp at the first sitting, but very little documentation was required to explain the architecture or what was going on in any particular area. Maintenance was a dream which helped us produce heavily customised versions for different clients and resolve the vast majority of fresh bugs during initial deployment. Our customers were much happier as a result and despite being a very small company we offered a technically much superior product to all of our much larger competitors and had an enviable reputation. A testament to the philosophy of small and agile versus large and lumbering.

In fact I've noticed in small software houses that concentrate on code quality, whether as a deliberate policy or as a result of the developers choosing to work that way, that the end products have tended to be on the leading edge of both usability and technical capability within their market sector. I don't believe this to be a coincidence. It all comes back to attention to detail, simplicity of design and using the codebase as the documentation: not just peppering it with DOC comments, but literally writing it so that it scans well for the eye and imparts meaning to the reader.

Both developers and managers need to start thinking of their codebases as vital assets to be cared for across generations, rather than as something to be quickly hacked together and then kicked out the door. Deadlines are rarely as critical as the cult of management likes to believe, as Apple routinely demonstrates, but that's a rant for another day. What's important here is that you get out many times over what you put in, because maintenance costs dominate overall software lifecycle costs.

Crossing out of the commercial world, the lack of code consistency and readability is both baffling and at the same time understandable. The raison d'être of open source is so that the code is freely accessible to anyone who wants to download and read it, and consequently you'd think that accessibility would be a key design criterion. Unfortunately the real world often gets in the way of that aspiration.

Many open source projects have large numbers of contributors submitting patches in a hodge-podge of styles, include code drawn from other open source projects and have often accreted several generations of functionality on top of the original design - all without the human resources for code auditing and review. Perhaps the time has come for projectsto include code cleaning phases as a regular part of their lifecycle the way that bug bashes have become the norm. Sure it's not glamorous work, but I suspect a lot of open source teams would be amazed at the cruft which becomes apparent once reading the code becomes effort-free. I know I often am...

  • Back in the US again in October:)

    In mid-October I'll be speaking at Strange Loop in St. Louis about Google's exciting new systems language Go and my open-source GoLightly project.…

  • Letter to my MP regarding the Digital Economy Bill

    Dear Andrew Love, Thank you very much for your letter dated 22nd March in response to my email enquiry concerning the Digital Economy Bill. I…

  • To Dream of Real-Time Ruby?

    My RubyConf 2009 proposal which alas didn't make the cut. Summary Ruby is a beautiful language, and because of that beauty we tend to ignore the…

  • Post a new comment


    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.