132 reads

Unconventional Deadlock Fix Inspired by XCTest's "Expectation"

by Misha K.June 12th, 2023

Too Long; Didn't Read

Firebase Crashlytics has a feature that detects a crash during app startup. The crash info will hopefully be sent to the server before the crash happens again. The name of that feature is "urgent mode" Firebase was interrupting the launch process by invoking the function `regenerate installIDIfNeededWithBlock'

featured image - Unconventional Deadlock Fix Inspired by XCTest's "Expectation"

Have you ever wondered what to do when the very tool you use to predict and handle crashes, Firebase Crashlytics, encounters a problem itself? You might think it's an impasse, but don't worry – we will do some detective work in this post. I have come across a unique deadlock within Firebase Crashlytics' urgent mode. After some deep digging, I've found an unexpected yet efficient solution, drawing inspiration from an unlikely place – XCTets' "expectation" implementation.

The Final Frontier

Let's begin with revealing what the "urgent" mode is. We are spending countless hours testing and fixing bugs before deployment. But then, happens, and your app crashes upon launch! Not a single request will not be able to send and inform you about that incident. But how do you know the reason for that crash if it was not reproducible?

Firebase Crashlytics comes to the rescue! It has a feature that detects a crash during app startup. If that happens, Crashlytics will pause Main Thread pause initialization to prevent it from crashing; The crash info will hopefully be sent to the server before the crash happens again. The name of that feature is "urgent mode."

Discovering The Culprit

Let's jump back to the issue at hand. I observed my app was taking an unusually long time to launch. To dig into this, I used lldb to pause my app and examined the issue in detail. As I went through the stack, it didn't take long to spot the culprit: Firebase Crashlytics was interrupting the launch process.

The function regenerateInstallIDIfNeededWithBlock had appeared on the Main thread. This was odd because if you use a symbolic breakpoint, you'll notice that regenerateInstallIDIfNeededWithBlock is normally invoked from a background thread, not the Main thread. This unusual shift was a clear red flag that the expected process flow was off.

Use the source, Luke

Now, let's unravel this deadlock situation. A close examination reveals that regenerateInstallID is preceded by prepareAndSubmitReport, which is itself preceded by processExistingActiveReportPath.

Let's dive into to understand it better.

- (void)processExistingActiveReportPath:(NSString *)path
                    dataCollectionToken:(FIRCLSDataCollectionToken *)dataCollectionToken
                               asUrgent:(BOOL)urgent {
  FIRCLSInternalReport *report = [FIRCLSInternalReport reportWithPath:path];

  if (![report hasAnyEvents]) {
    // call is scheduled to the background queue
    [self.operationQueue addOperationWithBlock:^{
      [self.fileManager removeItemAtPath:path];
    }];

    return;
  }

  if (urgent && [dataCollectionToken isValid]) {
    // called from the Main thread
    [self.reportUploader prepareAndSubmitReport:report
                            dataCollectionToken:dataCollectionToken
                                       asUrgent:urgent
                                 withProcessing:YES];
    return;
  }

The "urgent" parameter determines whether the code will run in the background or on the Main thread. Submitting a report from the Main thread seems like expected behavior.

But why does it halt?

The regenerateInstallID waiting for the semaphore to signal, which should occur when [self.installations installationIDWithCompletion] is completed. of regenerateInstallID looks like this (for the sake of brevity, the code is simplified):

- (void)regenerateInstallID {
  dispatch_semaphore_t semaphore = dispatch_semaphore_create(0);

  // This runs Completion async, so wait a reasonable amount of time for it to finish.
  [self.installations
      installationIDWithCompletion:^(void) {
        dispatch_semaphore_signal(semaphore);
      }];

  intptr_t result = dispatch_semaphore_wait(
      semaphore, dispatch_time(DISPATCH_TIME_NOW, FIRCLSInstallationsWaitTime));
}

To figure out why the completion does not fire, I've dug down in the chain of calls to the installationIDWithCompletion and did not notice any path that could ignore the completion.

The real issue revealed itself when I noticed the completion wrapped in a FBLPromise.then {} block. This block is dispatched asynchronously on , as shown here:

@implementation FBLPromise (ThenAdditions)

- (FBLPromise *)then:(FBLPromiseThenWorkBlock)work {
  // Where defaultDispatchQueue is gFBLPromiseDefaultDispatchQueue by default
  return [self onQueue:FBLPromise.defaultDispatchQueue then:work];
}

@end

static dispatch_queue_t gFBLPromiseDefaultDispatchQueue;

+ (void)initialize {
  if (self == [FBLPromise class]) {
    gFBLPromiseDefaultDispatchQueue = dispatch_get_main_queue();
  }
}

So, the deadlock essentially boils down to this: A semaphore is waiting on the Main thread for a signal from the completion handler to release it, but the completion handler itself is stuck, waiting for the main thread to execute dispatch_async. This circular dependency was causing our app launch to stall.

Searching for the optimal solution

So, what options are we left with?

We could pass a queue to the promise if we wait for completion on the Main thread. However, this approach would require proposing a new interface to FBLPromise.
We could alter the default queue for all promises. This, however, is a risky move that would affect every call in the SDK.

With my preference for containing bug fixes in their local context to avoid introducing new bugs, I chose not to tweak FBLPromise. Instead, I looked for a solution that would be minimal and confined to this particular case.

If only we could execute an async callback on the Main thread while simultaneously waiting on it... Sounds familiar? Well, it should! We do have this capability in XCTest viawaitForExpectations.

Here's an example:

// This test will pass
func testExample() throws {
	let testExpectation = expectation(description: "")
	DispatchQueue.main.asyncAfter(deadline: .now() + 0.5) {
		testExpectation.fulfill()
	}
	assert(Thread.isMainThread == true)
	waitForExpectations(timeout: .infinity)
}

Intrigued, I delved deeper into the XCTest framework's source code to understand how it does that trick.

Here's the related piece of code:

func primitiveWait(using runLoop: RunLoop, duration timeout: TimeInterval) {
	let timeIntervalToRun = min(0.1, timeout)

	runLoop.run(mode: .default, before: Date(timeIntervalSinceNow: timeIntervalToRun))
}

Surprisingly, I discovered we could handle dispatched callbacks on the current thread using a nested RunLoop spinner. This seemed like a promising way out of our deadlock.

The Fix

To address this deadlock, the code was adjusted to implement a run loop spinning mechanism instead of the semaphore while running on the main thread. This tweak allows dispatch_async to signal the main thread to continue execution, preventing it from blocking.

- (void)regenerateInstallID {
	dispatch_semaphore_t semaphore = nil;

	bool isMainThread = NSThread.isMainThread;
	if (!isMainThread) {
	  semaphore = dispatch_semaphore_create(0);
	}

	[self.installations
		installationIDWithCompletion:^(void) {
		NSAssert(NSThread.isMainThread, @"We expect to get a completion on the main thread");
		completed = true;
		if (!isMainThread) {
		  dispatch_semaphore_signal(semaphore);
		}
	}];

	intptr_t result = 0;
	if (isMainThread) {
	  NSDate *deadline =
		  [NSDate dateWithTimeIntervalSinceNow:FIRCLSInstallationsWaitTime / NSEC_PER_SEC];
	  while (!completed) {
		NSDate *now = [[NSDate alloc] init];
		if ([now timeIntervalSinceDate:deadline] > 0) {
		  break;
		}
		[[NSRunLoop mainRunLoop] runMode:NSDefaultRunLoopMode beforeDate:deadline];
	  }
	  if (!completed) {
		result = -1;
	  }
	} else {  // isMainThread
	  result = dispatch_semaphore_wait(semaphore,
									   dispatch_time(DISPATCH_TIME_NOW, FIRCLSInstallationsWaitTime));
	}
}

Although the proposed solution worked, the maintainers of the Firebase SDK discovered an even more elegant and streamlined solution. They found that calling regenerateInstallID was not required. The most straightforward fix is the most effective, sidestepping the need for complex or solutions. And I want to highlight the importance of constantly refining and enhancing our solutions to focus on simplicity and efficiency in our code.

Final Thoughts

Understanding and preventing deadlocks is key to keeping your app responsive. Tools like run loops, locks, and semaphores can help manage tasks across multiple threads, but they can also make things complex and cause deadlocks if not used correctly. When using these tools, it's important to avoid potential issues like race conditions and deadlocks. Keep your code simple, make sure to always balance semaphore waits with signals, and try not to hold locks during lengthy tasks. Applying these concepts correctly can help your app stay responsive and provide a smooth user experience.

L O A D I N G
. . . comments & more!