Grab Switches from SQS and Redis to Temporal for Its Subscription Platform

Grab based the new architecture for GrabUnlimited on Temporal. The company enhanced user experience and reduced production incidents by 80% for its subscription platform, which serves millions of users. The new architecture significantly improved robustness and scalability, addressing a range of issues with the previous solution.

GrabUnlimited is Grab’s subscription program, offering benefits to members who pay a monthly fee. The platform that powers GrabUnlimited implements two primary flows for enrolling members in the program and for renewing their membership. As the subscriber base grew, the company started experiencing problems and system outages. Based on operational metrics, Grab observed issues with users being blocked due to corrupted membership states in the database, memberships not being automatically renewed, or users not receiving benefits after renewal.

Michel Parreno, engineering manager, and Theodore Felix Leo, lead engineer at Grab, describe issues experienced with the initial architecture:

With the initial triumph and significant surge in subscriber count by over 1000% from January 2022 to June 2023 […] the architecture that supported GrabUnlimited was starting to show signs of strain. Common subscriber concerns such as not receiving their membership benefits, along with developer issues marked by an increase in service outages highlighted the system’s low resiliency. The culprit? A backend service that, while functional, was not built to efficiently manage the complexities of a rapidly scaling membership model.

The original architecture of GrabUnlimited relied on Amazon RDS for data storage, SQS for messaging, and Redis for caching and distributed locking. It adopted the state machine pattern to track the membership state. Over time, due to the ever-increasing number of subscribers, the daily cron job that retrieved memberships due for renewal became slower. The team had to resort to splitting the job into batches and, ultimately, vertically scale the database. Furthermore, due to the limitations of a 5-minute Redis lock, the renewal process could result in corrupted membership state.

The original architecture of GrabUnlimited (Source: Grab’s Engineering Blog)

The solution also suffered from resiliency issues, where outages of upstream services, combined with SQS retries without exponential backoff, could lead to overloading of services as they came back online. Lastly, the subscription service within the initial architecture became overly complex with the growing number of state transitions. It lacked idempotency guarantees, resulting in the double awarding of benefits when the process was retried.

After considering the problems with the original architecture and their impact on user experience and operational overhead, the team began seeking solutions. Instead of refactoring the existing system, engineers opted to replace the existing architecture with a new one, based on Temporal, an open-source workflow orchestration engine that another team at Grab had adopted previously.

The team has meticulously assessed Temporal in many areas, including scalability, reliability, performance, security, as well as development effort, cost, and testability. The new architecture benefited from many built-in features of Temporal, like infinite retries, exponential backoff, rate limiting, and observability.

The new architecture of GrabUnlimited using Temporal (Source: Grab’s Engineering Blog)

In the new architecture, the daily cron job was replaced by a Timer, which allowed the renewal process to be spread throughout the day, based on the user’s subscription time, and greatly improve scalability. Previous concurrency challenges were addressed by leveraging Temporal’s built-in workflow-execution capabilities with the same workflow ID being assigned to multiple workflow instances running mutually exclusive operations.

The team also took advantage of Temporal’s resilience mechanisms, like infinite retry and exponential backoff, to configure appropriate retry policies protecting external services from getting overwhelmed in case of brief availability problems. Idempotency issues in the previous architecture were addressed by designing workflow steps to handle distinct responsibilities and utilizing Temporal for error handling and sequencing workflow execution.

The GrabUnlimited team learned a great deal while adopting Temporal and had to adjust their approach to system design to leverage Temporal’s functionality fully. Engineers highlighted that the switch allowed them to focus on important aspects of the product platform, rather than spending time implementing basic building blocks. Despite the challenges in adopting the new technology, the team believes that “the learning curve pays off”.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top