Éibhear/Gibiris

As a designer of back-end IT systems, I regard error management and error reporting as something to consider at the start, rather than at the end.

Some years ago1, I designed a file-handing system, where we identified a little over 100 different error scenarios to manage.

The system acted as a file-movement inferface between our internal system and an externally-hosted service. Between taking files off the end of one pipe and placing them onto another pipe, there was a need to perform rigourous integrity-checks on the files, and then specific transformations on the contents. Quite a number of potential points of failure arose, and if any one was to occur, we wanted to make sure we knew which one, so that the response from the support team would be appropriate.

Management of error conditions was fairly easy. Depending on the error, we needed to know whether the user was to be informed (the user being the business admins of the service within the organisation, and the answer being "sometimes"), whether the support staff were to be informed ("always") and where to place the file (and its associated artefacts). Being clear to all concerned on what errors we were dealing with was the interesting challenge.

As is normal, we devised a look-up file which would contain the following information on each potential error:

  • A code uniquely identifying the error;
  • A directive on what to do with the file when the error presented (record the error and continue processing the file, abort processing that file, abort processing of all files, etc.);
  • A message describing the error, in english;
  • A flag for whether to inform the user of the error; and
  • A flag for whether to inform the IT Support department ("Y" in all the identified scenarios at this time, but potentially "N" for new, future errors to be managed).

The code we developed would identify the error by its error code, and the look-up file would be used to determine the response. In the notification e-mail, the error is expressed using the human-language message, and not the internal error code. However, the log files recording the processing activity – for the sake of brevity – would record the error code, and not the english message.

For what follows in this post, it's important to re-state the following: the notifications sent to the user and the IT Support department would not contain the error code, only the message. If, in designing the system, we have done our jobs properly, the e-mail notification should be sufficient to inform the reader of the specific error condition, and in the vast majority of cases, the appropriate response would naturally follow without further investigation. The population of log files was deemed prudent, though, in case something completely unexpected arises: but they should not need to be consulted except when all other options have been exhausted.

As a help to those reviewing the log file, I determined that the error code itself should attempt to be readable. Therefore we devised a format that had two benefits: it would allow those reading the code to guess quickly what the nature of the error was, and also to allow for the easy addition of new codes should new error scenarios arise. The format is a little like:

[IN|OUT]_<OBJ>_<ERROR>

The first part says whether we're dealing with a file coming in to our internal system, or heading out from it; the second part says what the file type is (i.e. a payload file, or one of the accompanying integrity-affirming files); and the third the error has been triggered. Thus:

  • IN_FILE_DECOMPRESSING tells us that there was an error decompressing the inbound file;
  • IN_FILE_INVALID_SIG tells us that the cryptographic signature for the inbound file is invalid;
  • OUT_FILE_LINE_COUNT declares that we could not determine the line-count of the outbound file we're processing

and so on.

Fast-forward to a few weeks before we go live, and we present our system-nearing-test-completion to some of the IT support staff. This is so that they are familiar with how the system was intended to work.

Over a number of sessions, we presented the major features of the system, broken into the sessions on the business requirements implementation and the non-functional implementations. One of the latter sessions was on error handling.

A comment during the session had me quite puzzled. One of the attendees decried the format of the error codes, claiming that there were too many elements to it. The expressed desire was that they didn't come with "simple" alpha-numeric codes that they could learn off.

I hemmmmmed.

I hawwwwed.

I agreed that the comment was an interesting one.

I also suggested that at this late stage in the project, going back to devise and implement such a scheme would be costly, and would introduce unacceptable risk to the project and its post-implementation support. But I would keep it in mind.

And I did keep it in mind. I work hard to figure this one out for a long time. Eventually, I think I arrived at the core point.

Consider this: Oracle DBAs are aware what an ORA-600 error2, or a ORA-12154 error3 are. Someone, somewhere (surely?!), knows what to do when the "Excel found unreadable content in…" error that MS Excel often throws up is presented. However, the users don't, and that's because they don't have access to the documentation that tells them what's going on. If our error codes were, in themselves, explicative as regards what the problem is or was, then the user (or, someone new to the support team!) might be overly empowered to resolve the matter him- or herself.

Yes.

So, we need the error subsystem to use an obscure coding so that those who respond often must learn the codes as part of their jobs as well as their meanings, but those who encounter them rarely must reference others for help. Also, wouldn't it be cool for two people to speak with each other using these arcane codes in front of a user not-so-familiar with them?

I have a scheme that could be fool-proof: the ADICEC, pronounced "Adi-ssek".

The "Arbitrarily-Devised, Intentionally Complex, Error Code" can be an attribute of the error, prepared purely for the IT department to maintain a separation from the user, building a false dependency between them.

Here's how to build an ADICEC:

  • It must be long.
  • It must contain elements that are utterly pointless. For example:
    • A client code (because you never know when the department head is going to attempt to "monetise" the system by selling it to others)4;
    • An instance code (because you never know how many instances of this single-requirement-system there will be);
    • A date-time stamp (because, we always want to know when these codes have been devised).

      Alternatively, you could use the change-tracker ticket id with which the error code was introduced…

    • A product code (because … never mind)
  • (Finally) a randomly-generated – but seemlingly-sequential – alphanumeric value, uniquely identifying the error, but bearing no reference at all to what the error is about.

Now, don't be put off by this. We will continue to use a more accessible error code for exception handling in our source code. The ADICEC will be external to the source, and will only be used for the purpose of inflating a department's sense of importance.

So much for K.I.S.S.

Footnotes:

1

Many years ago! I now feel it's time to publish this story!

2

internal error

3

TNS alias look-up error

4

This was a system for a financial services organisation to meet it's specific requirements, and was never going to be sold as a product because of the uniqueness of those requirements.


You can't add any comments to this post. If there is something you would like to bring to my attention, please use the contact mechanisms below to get in touch.