Debugging A Race Condition

tl;dr Had to implement resource locking on the fly after realizing a bug was caused by a rare race condition.

The Background

The Notice system relies on four main components: the web interface, a data store, a lookup process, and a sender process. The lookup process is the most complex part and is responsible for finding unsent notifications and assembling the notification + ad combination based on the current subscribers and placing them in the send queue.

The Breakdown

Ultimately it came down to the lookup process firing every 10 seconds with the initial implementation not locking notifications that were in the process of being queued. We ran into someā€¦ Less than stellar situations where if a channel had enough subscribers it would take longer than 10 seconds to queue and when the process re-fires it would find the same notification (that was currently being queued in a different thread) and begin re-queueing it resulting in subscribers receiving duplicate notifications.

The obvious fix was to implement resource locking and we now no longer have the duplicate notification issue. A stressful day but at least a fix was found.