Archive for the ‘Computing’ Category

Tips for learning regular expressions

Monday, January 14th, 2008

Here are a few realizations that helped me the most when I was learning regular expressions.

1. Regular expressions aren’t trivial. If you think they’re trivial, but you can’t get them to work, then you feel stupid. They’re not trivial, but they’re not that hard either. They just take some study.

2. Regular expressions are not command line wild cards. They contain some of the same symbols but they don’t mean the same thing. They’re just similar enough to cause confusion.

3. Regular expressions are a little programming language.Regular expressions are usually contained inside another programming language, like JavaScript or PowerShell. Think of the expressions as little bits of a foreign language, like a French quotation inside English prose. Don’t expect rules from the outside language to have any relation to the rules inside, no more than you’d expect English grammar to apply inside that French quote.

4. Character classes are a little sub-language within regular expressions. Character classes are their own little world. Once you realize that and don’t expect the usual rules for regular expressions outside character classes to apply, you can see that they’re not very complicated, just different. Failure to realize that they are different is a major source of bugs.

Once you’re ready to dive into regular expressions, read Jeffrey Friedl’s book. It’s by far the best book on the subject. Read the first few chapters carefully, but then flip the pages quickly when he goes off into NFA engines and all that.

Million dollar cutoff for software technique

Friday, January 11th, 2008

I listen to a podcast recently interviewing Rob Page from Zope. At one point he talked about having SQL statements in your code vs. having accessor classes, and how as your code gets bigger there’s more need for OO design. No surprise. But then they said something interesting: if your project is smaller than $1M then straight SQL is OK, and over $1M you need accessors.

I don’t know whether I agree with the $1M cutoff, but I agree that there is a cutoff somewhere. I appreciate that Page was willing to take a stab at where the cutoff is. Also, I found it interesting that he measured size by dollars rather than, for example, lines of code. I’d like to see more pundits qualify their recommendations as a function of project budget.

Almost all advice on software engineering is about scaling up: bigger code bases, more users, etc. No one talks about the problem of scaling down. The implicit assumption is that you should concentrate on scaling up because scaling down is easy. I disagree. Over the last few years I’ve managed hundreds of small software projects, and I know that scaling down presents unique challenges. Maybe “scaling down” is the wrong term. My projects have scaled up in a different way: more projects rather than bigger projects. 

One challenge of small projects is that they may become large projects; the opposite never happens. Sometimes the transition is so gradual that the project becomes too big for its infrastructure before anyone notices. Having some rule like the $1M cutoff could serve as a promt for reflection along the way: Hey, now that we’re a $1M project, maybe we should start to refactor now to avoid a complete rewrite down the road.

Complementary validation

Thursday, January 10th, 2008

Edsgar Dijkstra quipped that software testing can only prove the existence of bugs, not the absense of bugs. His research focused on formal techniques for proving the correctness of software, with the implicit assumption that proofs are infallible. But proofs are written by humans, just as software is, and are also subject to error. Donald Knuth had this in mind when he said “Beware of bugs in the above code; I have only proved it correct, not tried it.” The way to make progress is to shift from thinking about the possibility of error to thinking about the probability of error.

Testing software cannot prove the impossibility of bugs, but it can increase your confidence that there are no bugs, or at least lower your estimate of the probability of running into a bug. And while proofs can contain errors, they’re generally less error-prone than source code. (See a recent discussion by Mark Dominus about how reliable proofs have been.) At any rate, people tend to make different kinds of errors when proving theorems than when writing software. If software passes tests and has a formal proof of correctness, it’s more likely to be correct. And if theoretical results are accompanied by numerical demonstrations, they’re more believable.

Leslie Lamport wrote an article entitled How to Write a Proof where he addresses the problem of errors in proofs and recommends a pattern of writing proofs which increases the probability of the proof being valid. Interestingly, his proofs resemble programs. And while Lamport is urging people to make proofs more like programs, the literate programming folks are urging us to write programs that are more like prose. Both are advocating complementary modes of validation, adding machine-like validation to prosaic proofs and adding prosaic explanations to machine instructions.

Moore’s law and software bloat

Wednesday, January 9th, 2008

I ran across an article recently comparing the performance of a 1986 Mac and a 2007 PC. Of course the new machine would totally blow away the old one on a number crunching benchmark, but when it comes to the most mundane benchmarks — time to boot, launch Microsoft Word, open a file, do a search and replace, etc. — the old Mac pulls ahead slightly. Software bloat has increased at roughly the same rate as Moore’s law, making a new machine with new software no better than an old machine with old software in some respects.

The comparisons in the article resonate with my experience. I expect administrative tasks to be quick and number crunching to be slow, and so I’m continually surprised how long routine tasks take and how quickly numerical software runs.