An interesting bug caused by .NET’s TaskCompletionSource

Introduction

I recently got bit by an interesting bug that I thought was worth writing up. Like all good bugs, it happened because I was fiddling with things I wasn’t familiar with, and as such it became a learning experience.

The context requires some explanation.

I’ve been recently working on a toy MOO in C#, unoriginally called MooSharp. For those unaware, MOOs are an offshoot of MUDs ('multi-user dungeons'), an old style of text-based games which were basically multiplayer interactive fiction. You type commands into a parser to move between rooms, pick up swords and Slay Monsters or whatever.

The project was an excuse for me to try out the Actor pattern in a real app. This is the pattern where you conceptualise your program as a network of small isolated nodes which work by sending messages to each other, in the style of Erlang. Akka.NET is a famous implementation of this in the .NET space, but it felt too heavy for me; I wanted to see how much I could do by myself with some AI assistance.

The bug

After a while, I noticed some unusual behaviour.

My world had three or four rooms seeded from a JSON file on launch. You could move between them just fine with commands like move atrium or go closet.

But - whenever you went into one specific room, the entire game would lock up. The text interface would hang indefinitely and you wouldn’t get pushed any further updates from the engine. A hard refresh of the browser (this was running in Blazor, a web framework) would reset things back to normal, but the bug was completely reproducible and consistent. What on earth?

Naturally, my first thought was that something about the room was causing an exception to be thrown. But there was nothing particularly special about its data. I changed each of its fields independently to some obviously-safe value, and the bug still reproduced. No logs in STDOUT or exceptions thrown, no matter where I put my try-catches.

The only clue I could figure out was this. The room which broke things was the final room specified in the seed .json file.

That is, the file looked like this:

{
  "Rooms": [
    {
      "Name": "atrium",
      "Description": "A beautiful atrium.",
      "Slug": "atrium",
      "ConnectedRooms": ["side-room"]
    },
    {
      "Name": "side-room",
      "Description": "A small side-room for drinking coffee.",
      "Slug": "side-room",
      "ConnectedRooms": ["atrium", "closet"]
    },
    {
      "Name": "closet",
      "Description": "A cramped and dark closet.",
      "Slug": "closet",
      "ConnectedRooms": ["side-room", "dark-world"]
    }
  ]
}

Going into the 'closet' room always broke the game. I confirmed my theory by adding a new dummy room to the end of the file. Entering the closet now worked perfectly fine. But when I then tried to enter the room I’d just added - the final one in the file - it broke.

To call this suspicious would be an understatement. It wasn’t data and it wasn’t a one-in-a-thousand Heisenbug. There was clearly a definite mechanical cause here.

I tore apart the code which seeded the world from the JSON file. Was it setting the exits incorrectly for the last item in the list, causing some kind of secret stack overflow or infinite recursion? I wrote checks for that, and no, it was setting up exits OK. What about the room IDs/slugs? Was one of them broken somehow? But that wouldn’t explain why it was always the last item in the list. I tried shrinking the list down to just one room, and as expected, the game hung immediately.

It doesn’t shame me to admit that I admitted defeat here for several weeks. This was just a fun toy project, and the bug was giving me very few clues to work with - no logs, no exceptions, just an absence of behaviour. I had a creeping suspicion that it was due to some of the concurrency stuff I’d done in my implementation of the Actor model, but no matter how many times I stepped through things with a debugger, I couldn’t spot anything that felt wrong.

I put the project aside, for a time.

A crack in the slab

I’ve been using AI to write code and debug code for a long time now, and as you might expect, I threw every frontier model I could at this problem. The codebase was small enough that it was trivial to markdown-ify it and paste it into whatever LLM I wanted, but unfortunately, none of them made real progress in figuring out what was happening. To be sure, they came up with many extremely plausible-sounding solutions, suggesting locks and semaphores a-plenty. But their spells, when cast, failed.

Things finally changed with the release of Gemini 3.0 Pro Preview a few weeks ago.

After hearing the wonder stories from other people online, I decided to dust off the codebase and take another crack at things. Gemini, unsurprisingly, failed to identify the cause of the bug when I simply gave it the code. But this time I was more determined; I felt there surely had to be a way to make it easier to diagnose.

Following Gemini’s suggestions, I added boatloads more logging and instrumentation to the app, including hardcoded Debug.WriteLine statements in case my logger was getting stifled. I also manually gave it the action output log from the game itself.

After one or two rounds of this, Gemini abruptly and quite arbitrarily stopped in its tracks. It said it now understood the bug and gave me a one-line fix. I was quite skeptical, of course, but a strange feeling blossomed in my chest. Could this actually be it? Had the computer seen something I’d failed to, all this time?

I made the suggested change, and started the game. And moved from one room to the next. And the next. And then, I typed the command to move to the final room. And hit enter.

And it worked.

The crime scene

Let’s now see the gory details of where my original mistake was.

Below is a copy of the file where the bug lived, sans logging and error handling: take a look and see if the error stands out to you. This is the living heart of the Actor implementation in the project. My players, objects and (importantly) my rooms all inherited from this class. It encapsulated the following behaviour:

You can send messages to this thing.
It will process its messages sequentially forever.
You can get responses from your messages to it.

TState below would be something like RoomDto or ObjectDto.

using System.Threading.Channels;
using Microsoft.Extensions.Logging;

namespace MooSharp;

public interface IActorMessage<in T>
{
    /// The context is the state object that the actor protects.
    Task Process(T context);
}

public abstract class Actor<TState> where TState : class
{
    private readonly Channel<IActorMessage<TState>> _mailbox;
    protected readonly TState State;

    protected Actor(TState state, ILoggerFactory loggerFactory)
    {
        State = state;

        _mailbox = Channel.CreateBounded<IActorMessage<TState>>(100);

        // Start the long-running task that processes messages.
        Task.Run(ProcessMailboxAsync);
    }

    // The main loop for the actor. It runs forever, processing one message at a time.
    private async Task ProcessMailboxAsync()
    {

        await foreach (var message in _mailbox.Reader.ReadAllAsync())
        {
            await message.Process(State);
        }
    }

    public void Post(IActorMessage<TState> message)
    {
        var posted = _mailbox.Writer.TryWrite(message);

        if (!posted)
        {
            _logger.LogWarning("Failed to post message to mailbox");
        }
    }

    public Task<TResult> Ask<TResult>(IRequestMessage<TState, TResult> message)
    {
        Post(message);

        return message.GetResponseAsync();
    }

    public async Task<TResult> QueryAsync<TResult>(Func<TState, TResult> func)
    {
        var message = new RequestMessage<TState, TResult>(state => Task.FromResult(func(state)));

        return await Ask(message);
    }

    public override string? ToString() => State.ToString();
}

/// A message that just performs an action and doesn't return anything.
public class ActionMessage<T> : IActorMessage<T>
{
    private readonly Func<T, Task> _action;
    public ActionMessage(Func<T, Task> action) => _action = action;
    public async Task Process(T context) => await _action(context);
}

// An interface for messages that need to return a value.
public interface IRequestMessage<TState, TResult> : IActorMessage<TState>
{
    Task<TResult> GetResponseAsync();
}

// The implementation uses a TaskCompletionSource to bridge the async gap.
public class RequestMessage<TState, TResult> : IRequestMessage<TState, TResult> where TState : class
{
    private readonly TaskCompletionSource<TResult> _tcs = new();
    private readonly Func<TState, Task<TResult>> _request;

    public RequestMessage(Func<TState, Task<TResult>> request) => _request = request;

    public async Task Process(TState context)
    {
        try
        {
            var result = await _request(context);
            _tcs.SetResult(result);
        }
        catch (Exception ex)
        {
            _tcs.SetException(ex);
        }
    }

    public Task<TResult> GetResponseAsync() => _tcs.Task;
}

I will now reveal the bug. It’s here:

private readonly TaskCompletionSource<TResult> _tcs = new();

The fix was changing it to this:

private readonly TaskCompletionSource<TResult> _tcs = new(TaskCreationOptions.RunContinuationsAsynchronously);

If you have dealt with TaskCompletionSources before, your eyebrows are probably raising right now, though maybe with a slight measure of confusion.

For those unfamiliar, TaskCompletionSource (TCS) is a fairly low-level part of the async/await machinery in C#. The language represents asynchronous operations through the Task type, and while these are usually created for you with helper methods like Task.FromResult or the async machinery itself, you sometimes have to manage them manually when interfacing with older asynchrony patterns.

That’s where TCS comes in. It owns a Task, but lets you arbitrarily say when it’s completed, failed or cancelled. I was using it in the example above to decouple the processing of a message by an Actor from the availability of the message’s result to consumers. I could .SetResult the task in my Process method, and then expose the task itself in GetResponseAsync() so external code could simply await it.

OK, that’s great and all. But why was my app hanging?

The reason it took me so long to find this bug is because the answer to that question was not in this file at all. To fully understand the bug, we must go to the source of all crimes - Program.cs.

The other half of the puzzle

My Program.cs looked like this:

var builder = WebApplication.CreateBuilder(args);

// ... other stuff ...

var app = builder.Build();

// ... other stuff ...

// The 'world' is my type containing all my rooms, objects and other actors.
var world = app.Services.GetRequiredService<World>();

// Seed the world from the JSON file.
await world.InitializeAsync();

app.Run();

Nothing immediately untoward here.

I won’t share the implementation of world.InitializeAsync, because it’s quite verbose and uninteresting. It loaded the JSON file and then mapped the resulting barebones DTOs into the Actor types I showed previously. That is, we were creating and starting our Actor mailboxes here. To set up exits between them, they naturally sent messages to each other:

                await currentRoomActor.Ask(new RequestMessage<Room, bool>(roomState =>
                {
                    foreach (var exit in exits)
                    {
                        roomState.Exits.Add(exit.Key, exit.Value);
                    }
                    return Task.FromResult(true);
                }));

You now have all the pieces of the puzzle to understand why the app was hanging.

Here is a final clue - changing this line in Program.cs also fixed the bug, independently of my change to the TaskCompletionSource line.

await world.InitializeAsync();

await app.RunAsync();

The real answer

N.B. My knowledge here is not perfect. That’s why I wrote the bug in the first place!

By default, TaskCompletionSource runs continuations synchronously.

This means that when its task gets await ed, the flow of execution does not continue on another thread. It continues on the same thread. Why? As a performance optimisation. In normal code, this saves some work (as I understand it) with the bookkeeping that comes from maintaining the async context.

So what thread is the continuation going to run on? Well, the TCS is `await`ed when an Actor processes a message from its mailbox. So it’s the Actor’s thread -the one running ProcessMailboxAsync.

When we’re cycling through rooms in world.InitializeAsync, we posted messages to each of them to set up their exits.

When we did this, the thread of execution 'hopped' onto the thread of the first room’s Actor. More specifically: when the Actor calls _tcs.SetResult(), because of the default synchronous behaviour, the rest of InitializeAsync (the continuation) executed immediately on the Actor’s thread. It moved to the second room, and the same thing happened again. Another hop, this time onto the second Actor’s thread. It repeated this chain until it reached the final room.

What then?

Well, then we get to here:

await world.InitializeAsync();

app.Run();

app.Run() is a blocking method. The mailbox thread for the last room has now been kidnapped to work as the app’s Kestrel server. The poor room’s mailbox thread is now trapped there until app.Run completes.

This has the following immediate implications:

The web app itself will run completely normally. After all, it has a thread working for it!
The last room won’t respond to any messages at all. Its thread for listening to messages is busy elsewhere.
Because it’s never going to respond, whoever sent it the message will never get an answer.
If you’re trying to move into the room, you’ll never get the 'OK, you can move into me' signal.
You will be stuck in limbo forever. The game hangs.

With this in mind, you can likely understand why the fixes worked.

Specifying TaskCreationOptions.RunContinuationsAsynchronously forces the TCS to schedule continuations on a separate thread. This completely avoids the issue with thread-hijacking.

await app.RunAsync was not a 'real fix'. The await here let the mailbox thread bubble up the stack, finish its work and get back to processing the actor’s messages. But this is a band-aid over the real problem; introducing some other blocking/synchronous call after world.InitializeAsync would’ve brought the bug back in force.

(Note: this doesn’t mean that we shouldn’t be using await app.RunAsync - we should. It’s strictly better than app.Run. I used the synchronous version originally purely out of carelessness.)

Lessons learned

When Gemini finally pointed out the bug here, I was elated. I’d mulled this nightmare over in my mind for weeks, and getting a real, definitive answer felt impossible. It was the best kind of bug-fix - one where I learned something interesting and concrete that I could carry into the future.

My main lesson was simple. Treat TaskCompletionSource with respect!

There are some types which always demand additional scrutiny when they’re used. Anything that implements IDisposable. Many kinds of Stream. HttpClient.

I foolishly did not consider TCS as one of these types. I hacked together a pattern that seemed to work and walked away self-satisfied. I was peering one layer of abstraction deeper than I usually do, and toying with something low-level; that is not unsafe by itself, but it warrants reading the documentation deeply, or at least searching 'taskcompletionsource things to avoid' on Google.

I’m not going to be too hard on myself, however. I think this was a legitimately tricky bug, for a few reasons.

The genesis was a type exhibiting unusual behaviour when in its default configuration (no RunContinuationsAsynchronously in the constructor) for the sake of a performance optimisation.
There’s no compelling reason prima facie to think the TCS would run continuations synchronously. It makes sense in retrospect, but it’s something you 'just have to know'.
The bug wasn’t caused by the TCS in isolation. It was its interaction with the blocking app.Run() call.
Even then, the bug wouldn’t have happened in the same way for most normal blocking calls. The game only hung because app.Run hijacks the thread in perpetuity.
The behaviour only exhibiting for the last room in the JSON file was a hell of a red herring. I think looking in the world-seeding logic, and not Actor.cs, would be anyone’s natural response.

Ultimately, getting caught in beartraps like these is how you learn the thorny details of an ecosystem. Better to find them in toy projects than in production.

Of course, it was not really me who found the bug. The ultimate credit goes to Gemini: I doubt I would’ve ever figured this out by myself, unless I randomly read up on TCS at some point. The fact that the model was able to reason what happened just from the code and the logs legitimately gobsmacked me when it happened. It didn’t just display a piece of fairly niche .NET knowledge (how many developers have never even seen TaskCompletionSource in the wild?), but integrated that with its knowledge of the overall codebase to build a coherent and ultimately correct hypothesis.

People can denigrate AI for ultimately being based on statistics, rather than principled reasoning. But behold the dark truth: the easiest way to hit bullseye is to throw a thousand darts at the board.