Kotlin Coroutines, SupervisorJob, Async, Exceptions and Cancellation

Kotlin Coroutines framework allow you to write concurrent code that looks almost sequentially. However, in some cases, it can be tricky to get this code “right” and even trickier to reason about its correctness.

In this article I’ll deal with the intricacies of async coroutines builder, SupervisorJob construct and the cancellation mechanics according to “structured concurrency” paradigm.

Coroutines Failure Propagation

Consider the following class:

public class TestUseCase {

    fun test() {
        val scope = CoroutineScope(Job() + Dispatchers.IO)
        scope.launch {
            repeat(2) { i ->
                val result = async { doWork(i) }.await()
                log("work result: ${result.toString()}")
            }
        }
    }
    
    private suspend fun doWork(iteration: Int): Int {
        delay(100)
        log("working on $iteration")
        return if (iteration == 0) throw WorkException("work failed") else 0
    }

    private fun log(msg: String) {
        Log.i("TestUseCase", msg);
    }

    private class WorkException(msg: String): RuntimeException(msg) {}

}

I know that async isn’t required in this specific case. However, I intentionally wanted to use async because that’s where things get very complex and that’s the behavior I want to explore in this article.

As you can see, the call to doWork(Int) (and subsequent await()) during the first iteration of repeat loop throws an exception. Since I don’t catch that exception, it crashes the application. That’s not good, obviously.

To avoid the crash, I need to wrap await() call into try-catch block:

fun test1() {
    val scope = CoroutineScope(Job() + Dispatchers.IO)
    scope.launch {
        repeat(2) { i ->
            try {
                val result = async { doWork(i) }.await()
                log("work result: ${result.toString()}")
            } catch (t: Throwable) {
                log("caught exception: $t")
            }
        }
    }
}

Unfortunately and surprisingly, despite the fact that I explicitly catch the exception, it’s still propagated to the parent Job. The parent Job fails then, and the app still crashes.

That’s one of the most bizarre and counter-intuitive aspects of Coroutines framework. Others had already brought this issue up soon after “structured concurrency” landed in Coroutines in 0.26.0, so, if you want to see the bigger picture, read this discussion and this one too.

Using SupervisorJob to Avoid Crashes

Since I can’t stop the exception thrown in async from propagating “up the stack”, I need to instruct the “upper” Job to remain active even if one of the children fails. To do that, Coroutines framework contains a special type of Job: SupervisorJob.

My first attempt looked like this:

fun test2() {
    val scope = CoroutineScope(SupervisorJob() + Dispatchers.IO)
    scope.launch {
        repeat(2) { i ->
            try {
                val result = async { doWork(i) }.await()
                log("work result: ${result.toString()}")
            } catch (t: Throwable) {
                log("caught exception: $t")
            }
        }
    }
}

The intuition behind this approach is something like: “I don’t mind if all the Jobs in the hierarchy fail, but I want this cascading failure to stop at the Scope level”. Unfortunately, Coroutines framework doesn’t work according to my naive intuition and just making the Scope supervisor doesn’t resolve the crash problem.

This approach doesn’t work due to two aspects:

  • SupervisorJob’s “magic” applies only to its immediate children
  • Uncaught exceptions from launch are propagated to the exception handler regardless of whether the parent is supervisor or not

Alright, let’s move SupervisorJob one level down the hierarchy to make sure that it’s the “parent” of async:

fun test3() {
    val scope = CoroutineScope(Job() + Dispatchers.IO)
    scope.launch(SupervisorJob()) {
        repeat(2) { i ->
            try {
                val result = async { doWork(i) }.await()
                log("work result: ${result.toString()}")
            } catch (t: Throwable) {
                log("caught exception: $t")
            }
        }
    }
}

Surprisingly, even after this change, the behavior is the same and the application crashes.

The reason this solution doesn’t work is “simple”: Job object that I provided as an argument to launch isn’t going to be the Job associated with that coroutine, but its parent. This tricky point is discussed in this article by Roman Elizarov and also explained graphically using this diagram:

So, even though I’d intuitively expect the SupervisorJob that I pass into launch to become the parent of async, in practice it’ll become its grandparent. And since SupervisorJob’s “magic” applies only to its immediate children, it doesn’t help me to prevent the crash in this configuration.

The solution is evident, then: let’s move SupervisorJob even deeper down the hierarchy:

fun test4() {
    val scope = CoroutineScope(Job() + Dispatchers.IO)
    scope.launch {
        repeat(2) { i ->
            try {
                val result = async(SupervisorJob()) { doWork(i) }.await()
                log("work result: ${result.toString()}")
            } catch (t: Throwable) {
                log("caught exception: $t")
            }
        }
    }
}

Now, finally, the app won’t crash on the first failure of doWork(Int) and I’ll get to see the result of the second invocation.

It’s important to note that even though the failure of async won’t be propagated “up the stack”, the exception will still be thrown. Therefore, it’s still mandatory to have try-catch around await() to handle this exception.

Cancellation with SupervisorJob

The implementation in test4() method from the previous section doesn’t crash the app, but it has one downside: by overriding the parent of async with a standalone SupervisorJob, I broke “structured concurrency” cancellation mechanics.

Structured concurrency in Coroutines assumes a hierarchy of Jobs in the form of a tree, with Scope’s Job as a root, and coroutines’ Jobs as nodes. Then, cancellation of any Job in that tree is propagated down to its sub-tree automatically. However, since I made the parent of async a standalone SupervisorJob, cancellation of the upper-level Job or the entire Scope won’t affect this logic anymore.

The following code demonstrates this issue:

fun test4WithCancellation() {
    val scope = CoroutineScope(Job() + Dispatchers.IO)
    val job = scope.launch {
        repeat(2) { i ->
            try {
                val result = async(SupervisorJob()) { doWork(i) }.await()
                log("work result: ${result.toString()}")
            } catch (t: Throwable) {
                log("caught exception: $t")
            }
        }
    }

    scope.launch {
        delay(50)
        job.cancel()
    }
}

If you actually run the above code, then, despite the cancellation, you’ll see log output from within both invocations of doWork(). That’s not good, unless you intentionally want to break “structured concurrency” in this case.

To fix the cancellation mechanism and still keep SupervisorJob’s functionality I can use the following trick:

fun test5() {
    val scope = CoroutineScope(Job() + Dispatchers.IO)
    scope.launch {
        repeat(2) { i ->
            try {
                val result = async(SupervisorJob(coroutineContext[Job])) { doWork(i) }.await()
                log("work result: ${result.toString()}")
            } catch (t: Throwable) {
                log("caught exception: $t")
            }
        }
    }
}

Here I basically manually restore the hierarchy of Jobs by grabbing the existing “parent” Job from CoroutineContext and making it the parent of SupervisorJob.

However, the better and more idiomatic alternative would be this:

fun test6() {
    val scope = CoroutineScope(Job() + Dispatchers.IO)
    scope.launch {
        supervisorScope {
            repeat(2) { i ->
                try {
                    val result = async { doWork(i) }.await()
                    log("work result: ${result.toString()}")
                } catch (t: Throwable) {
                    log("caught exception: $t")
                }
            }
        }
    }
}

In this case, I wrap the execution into a full-blown scope which also provides supervision feature. Since this scope is the direct parent of async block, supervision will actually work here.

Conclusion

I said it in the past and I’ll repeat it again: Coroutines is the most complex concurrency framework I’ve ever worked with. I already spent weeks learning it, but every time I think I finally got it all figured out, there is a new gotcha waiting around the corner.

The motivation for this specific post came from a response to my earlier post where I argued that concurrency frameworks are overrated. One of the readers decided to provide an alternative implementation using Coroutines, and that implementation used SupervisorJob incorrectly. I didn’t notice this bug. It was one of the readers, Roman, who spotted it and also provided code examples that I modified for the purpose of this article. Big shoutout to Roman!

What really concerns me is that this gotcha of wrapping async with try-catch is the very first thing that Dmytro Danylyk described in his outstanding article about Coroutines. I had read Dmytro’s article at least three times before I saw the aforementioned buggy code, but I still didn’t notice the bug. Maybe I’m stupid, or my brain has special limitation in the context of Coroutines. However, what seems to be more probably explanation is that Coroutines is just too complex for a general-purpose concurrency framework. So, be careful if you use this framework in your applications. Especially if you work on high-reliability systems.

I hope that this will be the last time I get caught off-guard by Corutines (yeah, sure) and I hope that this write-up (which, to be honest, I wrote mainly to explain things to myself) was helpful to you as well. As usual, you can leave your comments and questions below.

Check Out My Courses on Udemy

6 comments on "Kotlin Coroutines, SupervisorJob, Async, Exceptions and Cancellation"

  1. It seems to me that coroutines are a great way to write (and specially read) asynchronous code, but the complexities of error handling in structured concurrency are its achilles heel. It really is a mess – or maybe I just have not found a clear explanation of it yet.

    Reply
  2. Hi Vasily, I like very much the article, really. I have only one comment: “by overriding the parent of async with a standalone SupervisorJob, I broke “structured concurrency” cancellation mechanics.” I don’t think that you’re breaking anything here, you’re just using the cancellation definition for Job and SupervisorJob. I mean, you don’t break anything if it works as is intended to work, and this is the case: the SupervisorJob cancellation mechanics are the ones that you’re experiencing by aplying SupervisorJob, no others, so you’re not breaking anything.
    Congrats for the article!

    Reply
    • Hey Julian,
      I see what you mean. You’re basically saying that if it works as expected, I shouldn’t call it “a break”. However, in this case, I think it’s appropriate.
      Coroutines provide support for structured concurrency, but these are still two different concepts. Therefore, you can use coroutines without structured concurrency (though it requires more work). In this case, by breaking the chain of Jobs, I indeed broke the predicate of structured concurrency in that area of my code. Therefore, I still think my phrasing is accurate.
      Would you agree with this explanation?

      Reply
      • Hey, thanks for your answer!
        I agree with this explanation, I mean, maybe I assume that speaking about structured concurrency you were speaking about Kotlin coroutines, but that’s not the case. Thanks again!

        Reply
  3. I replace async(SupervisorJob()) with async(Job()) and still get the same behaviour of crash getting handled . Can anyone explain ?

    Reply
    • Job() creates an instance, which is not linked to the rest of the jobs tree. Which means that there is nothing to propagate an error to. Job(coroutineContext[Job]) is the “proper” way.

      Reply

Leave a Comment