[Post-Mortem] Room Database Migration Gone Wrong during Android App Update

When I started using Room database for App Pause, I knew it was just a matter of time before I messed up the migration one day. I hit the milestone this week after releasing v0.11.0.

I am writing this post-mortem of the incident so that I can review this next year, along with all the other mess ups that I am sure will happen. We need to share our failures just as much as we share our wins.

Problem and its Discovery

My wife calls me to tell: “Hey! Your app isn’t working anymore.” I run to her to check what’s happening. She shows me the screen with a button: “Use for 0 minutes”.

Yup, that was a bug 🐞 I went into the app and found that all values are either 0, false or null for per-app settings, even though she doesn’t use that feature. I immediately knew that my update had gone wrong. I must have made some error in the migration logic that corrupted her data.

Got really heart broken. Just that morning I was celebrating hitting 100 WAU on Threads, and now this! I already have a high uninstall rate and this was just gonna make things worse.

Reducing the blast radius

After mopping for 5 minutes, I started to think if there was any way to stop the roll-out for the new version. Even though I released the new version 12 hours ago, not everyone received it yet.

I remember reading about halting roll-out in one of Google Play’s recent updates, so I searched for it. I found the button on console and pressed it.

Halting reduced the blast radius, but didn’t change the fact that there were damage. 26 devices had received the update already. While it was bad, at least rest of the 140+ devices were now safe.

Root Cause

I started digging into the code to figure out where I went wrong. So here is what I was doing:

@Entity(
    tableName = "per_app_settings",
    ...
)
data class PerAppSettings(
    @PrimaryKey(autoGenerate = true)
    val id: Long = 0,
    ...
    @Embedded
    val appConfig: AppConfiguration? = null
)

data class AppConfiguration(
    val waitingPeriodInSeconds: Int,
    val gracePeriodInMinutes: Int,
    val appUsage: AppUsageSection, <--- New
)

In the recent update, I added a field to the embedded object appConfig called appUsage. I first altered the table to add a column for this new field, and then populated it with a default value (“BARCHART”) since it’s meant to be non-null.

The latter part was the bug and purely my fault for not understanding how embedded fields worked in Room database. I thought since appUsage is non-null, I must put non-null value in it. Wrong!

The database has no concept of AppConfiguration object or AppUsageSection. In it’s eye, appUsage is just a string column which can be null. The Room database uses converters to convert the raw fields into the kotlin objects as required.

Since I used AppConfiguration as nullable embedded field, Room behaves as following:

Flatten all the fields from embedded class into the table.
If EVERY field of embedded class is null, then treat AppConfiguration as null and don’t convert it.
If ANY of the field is non-null, treat it as non-null and convert it to AppConfiguration object.

By updating every appUsage value to BARCHART, I inadvertently ended up marking the embedded field non-null, and when Room tried to convert it to an object, it converted the null into their corresponding default values (0 for integer, false for boolean and null for string).

Fix and Recovery

Once I understood the root cause, the fix was simple.

Don’t update appUsage field for ALL rows.
Just update it for rows that already have non-null AppConfiguration. That’s detectable by querying for non-null grace-period.

But this just fixes the migration for those who hasn’t migrated yet. What about the 26 people whose data got corrupted during migration? I needed a way to fix their broken state.

For them, I ended up bumping my database version again to v6 and wrote a version-5-to-6 migration logic. As part of this, I queried all rows that have grace-period column set to null, but appUsage field set to non-null. These should never happen ideally and indicate a corrupted row. For these I set all fields to null again.

Then I tested the flow few more times and shipped the update (after a 24 hours review period).

After thought

This brings me to the question: what can I do to avoid such a scenario in future?

Few things I can improve:

This bug impacted per-app configuration, even if user didn’t use the feature. Very few user uses this feature, yet everyone ended up suffering from the bug. This shouldn’t be the case. I have decided to make per-app configuration a toggle switch and for those who have it turned off, in future, they won’t be impacted by any bug in per-app configuration.
Even as the author, it took me a while to realize that the pause screen was showing config from per-app configuration. The detection should be simpler. On pause screen, I should show some icon to differentiate configuration from per-app vs dashboard.
It takes 3 clicks to reach per-app configuration. I should make that easier.
I had no way to communicate to the impacted users that I am aware about the bug, fix is coming and there is a workaround. I am gonna explore if there is any way to push message to my app’s home screen from Firebase (to apps on a particular version).
I learnt a lot from this mistake. That’s the only consolation I can give to myself at the moment. I wouldn’t be surprised if I lose some users due to this issue and that sucks. Took me a long time to hit 100 WAU.

Oh well. Mistakes are part of the journey. Let’s just hope I don’t repeat the same mistake again.