What is it even like to work on safety-critical software as a developer?

January 1, 2021

When I talk to people about what I do and that the software I'm working on is a medical device¹, I often get one of two reactions: Either people are fascinated by the impact we can make for patients. Or they think that we must be drowning in bureaucracy and are super slow to do any change or to ship any new features. Most of the time it's a bit of both. But what is different to working on less safety-critical software? And is there really that much paperwork?

Let's take a look at some of the more notable aspects. Please bear in mind, though, that for each of them you could easily write multiple in-depth blog posts. And I probably will. This is at best an introduction.

Tests and documentation are taken seriously

This probably comes as a surprise to exactly no one. When you have to be confident in how your software behaves, the first step is to know how it should behave. This is surprisingly hard, once you reach a certain level of complexity. So you have to write it down. You want to be able to look up any behavior-question without having to ask another human. There is of course a lot of things to be said about how, where, when - and to which level you really want to go with the descriptions - but these are topics for blog posts on their own.

Once you know how your software should behave you can start making sure that it actually does so. So you test it. These days, using automated tests wherever possible, which results in a lot of unit, integration and end2end tests that are run continuously.

This may seem like a lot of fairly annoying work to some people, sure. But once you achieve a high level of documentation and testing you can really rely and build upon them. This makes doing the same for new features both easier and more enjoyable. It definitely is a nice feeling to only have to add documentation and tests for what changed and don’t have to first catch up with a year's worth of missing documentation before you can explain what actually changed.

You think about and spec features before implementing them

What I mean by this is, similar to how you do the tests first in pure TDD², you want to write the documentation of new features before actually starting to implement things. At first glance this seems like a detail that is too small to be relevant for an introductory post. But it does change the way you work in ways noticeable every day.

The idea behind it is, roughly, that once you start changing the concept of what you are currently doing, you suddenly have to keep the documentation, the already existing bits of the implementation and the still to be done bits of the implementation in sync. You also have to keep everyone involved updated about the newest state of the concept. This can get chaotic quickly if you're not very careful - and relying on being careful is something you always want to avoid wherever you can.

So instead you first make a concept, document it and try to anticipate as many aspects as you can. In an ideal world there will be no open questions during implementation. In reality that is of course not achievable. But with practice, you can get this to a point where you can regularly build fairly big features that just end up working as planned.

For your day-to-day work this means that instead of doing both things at the same time you basically switch back and forth between some time where you write documentation for future features and some time where you work on the current features.

Gradual rollout of features is rarely a useful option

A common practice in most modern software development is to build a new feature and then gradually enable it for larger and larger percentages of your user base. One of the reasons for this is that if something goes wrong in production it doesn’t impact as many users and it is also hopefully easier to fix.

Of course it would be highly problematic to pick some patients to essentially beta-test a new feature. Basically it would mean conducting a clinical study - with all the time and effort this requires. So as you anyway need to be sure that your next version is safe and works as intended you usually want to roll it out as quickly as you can to reduce the length of your overall feedback cycle.

One of the circumstances where there is a place to do rolling releases is for operational reasons. So for example there might be releases where you know that some manual upgrade steps are needed or the customer needs additional training in advance. In those cases distributing this work probably makes sense.

You can still be flexible and release regularly

You probably already noticed while reading the previous sections. But I think it's super important to give this aspect it's own heading, because it's so common to think otherwise. None of the things I mentioned above say anything about how often you can release things or how big of a new feature you can tackle at once.

It is absolutely fine to do small features that build upon each other. You can and should react to new learnings from talking to your users, and iterate on your software. Working on a medical device doesn't have to mean releasing at a glacial pace and being detached from reality in the time between. In fact I would even argue that all the normal benefits of doing releases often and iterating on software also very much apply to safety-critical software.

Besides the objective benefits of releasing regularly, for me personally it's also a big motivation to know that the things we built are out there helping people. I think this is something that is the same for a lot of other people, too - no matter if they are designers, developers, project managers or any other role.

It's really hard to get to rapid releases of single features though

One of the core aspects that makes continuous releases and delivery possible is extensive automation of most testing activities, combined with gradual rollouts. As we already saw above, gradual rollouts can't really take on the role of beta-testing so we need to compensate for this loss. Also, in safety-critical software, as the name already suggests, you have way lower tolerances for bugs. A tolerance that is at the moment still very hard to achieve without having humans in the loop somewhere to fill in the gaps.

These two aspects together make it a real challenge - at least with the currently available tools - to get to a point where automated testing gives you the confidence needed across all the aspects of building software. This doesn't mean it's not achievable but for most teams out there it's currently not reality.

Also, it is probably hard to deny that some steps of releasing a new version of a medical device software include what you could call paperwork. I don't want to sound like paperwork doesn't exist at all, there is some of it in several corners that cannot be automated away.

So where does this leave us?

Mostly, how we work at my current job is not all that different to the way an organized team would work on any "normal" software. Even in our highly regulated environment. But then where is the paperwork for some companies coming from? The reasons are of course hard to know from the outside. But from what I’ve seen and heard so far, one big source of paperwork seems to be all types of verifications. So approvals of feature specs, testing of finished features or of releases. A second one seems to be around keeping the references between different documents up-to-date.

But it doesn’t have to be like that. That you drown everyone in paperwork. You can also take all the great capabilities and tools that we now have at our disposal and apply them to the challenges involved in building software that really really has to work. And this can actually be a lot of fun.

ps: I hope you enjoyed this first post on here. I will continue to write about building safety-critical software using modern software engineering approaches - or describe tools that I think would make doing so way easier if they existed. So if you’re interested in the details on one of the topics I touched on above, I invite you to come back in the future for more.

pps: Also feel free to write to me on twitter @reddish_flo if you disagree on something or have questions or ideas.

A very rough definition for a medical device is that any device (including software) that in some way impacts patients with the goal to help them is a medical device (see the actual regulation text Art. 2 for the precise definition). They are strictly regulated to make sure that they don’t do more harm than good.↩
Test-Driven-Development is a method for writing code that emphasises writing tests before writing the implementation with the goal of improving the amount and quality of tests. ↩

Reddish Florian