The Scenario
The API serves a user base with hundreds of thousands of users who have a high demand for our services. Receiving up to 8,000 requests per second during peak times. These users are passionate and actively engaged with our content, making it crucial to ensure timely data updates to avoid negative social media feedback. An improved version of the API using Google Cloud was released, serving both old and new data.
However, an unexpected challenge surfaced. The usage and engagement surpassed our expectations, resulting in a concerning 99th percentile latency of 7 seconds for the data API. Fortunately, the peak usage occurred on weekends, and the dedicated users remained active throughout the week. Although this provided some relief, there was a tight deadline of five days to resolve the issue. In response, we focused on Redis as a potential solution, as it showed promising features to help us overcome these problems.
The challenge
Redis was considered a potential solution because of its reputation for quickly delivering document sets in sub-millisecond timeframes. What seemed like the perfect fit for the needs.
Upon closer examination, it became clear that while Redis excelled in speed, it had certain drawbacks. To put it in perspective, let’s use an analogy. Think of a Bugatti Chiron Supersport, an impressive engineering marvel that can get you from Belgium to Portugal in just 4 hours. Amazing, right? But there are downsides. It can’t accommodate your entire family, you’ll likely accumulate numerous speeding tickets, and it consumes fuel (and money) at an alarming rate.
Sub-second speeds weren’t needed for the API. Instead, the primary requirements were:
- The ability to store and retrieve cached data with a 99th percentile latency below 500 milliseconds.
- The capability to scale seamlessly without compromising the first requirement, effectively handling increasing demands.
- Quick invalidation of selected entries, ideally below 5 seconds, without impacting the first two requirements.
Why Redis wasn’t the right pick
Redis delivers cache data with remarkable speed, meeting the first requirement effectively. Its performance in this aspect is unquestionable. However, the necessary experience to fully comprehend how Redis would handle performance under increased demands wasn’t in place. Although this presented a challenge, it wasn’t an insurmountable obstacle.
The real difficulty arose with the third requirement. None of the team members possessed extensive expertise in utilising Redis beyond basic caching functionalities. The advanced querying capabilities weren’t explored or delved into their full potential. While Redis showed promise, it ultimately didn’t align perfectly with the specific needs, given the limited familiarity and experience with its more advanced features.
Firestore got our Back
Firestore proved to be a great solution for the needs. It met the requirements with reading times below 30 milliseconds and writes below 100 milliseconds. Its querying capabilities were an advantage. The recent addition of Time to Live (TTL) functionality in Firestore was a welcome bonus, further enhancing its appeal. Being a fully managed service by Google, Firestore alleviated any operational concerns. It handled operations and scaling seamlessly, allowing focus on the core tasks.
A service developed to streamline the process, standardise requests and check if the data existed in Firestore. If not, it retrieved the data from the source, parsed the response to extract relevant tags, and cached the request using those tags. Intelligent hooks maintained a consistently fresh cache, triggering whenever changes occurred in the original data. These hooks invoked the invalidator, which parsed the changes to extract relevant tags. The invalidator queried Firestore and invalidated the corresponding cached entries using the tags. Not immediately deleting the entries ensured correctness even during periods of high concurrency. Firestore proved to be a capable contender, offering the perfect combination of performance, versatility, and seamless management for the API.
Details On The Implementation
The service was implemented in Go and deployed on Cloud Run. It consisted of two endpoints: the proxy and the invalidator, each deployed in a separate Cloud Run service to ensure smooth operation. One service exclusively handles the proxy, and the other is dedicated to real-time invalidation. This segregation provides a homogenous workflow, simplifying scaling decisions.
By leveraging Firestore’s triggers, you establish a seamless connection between data changes and the invoker of the invalidator. Whenever a change occurs in the original data, Firestore triggers the invalidator, initiating our cache invalidation process. This intelligent setup maintains efficiency and prevents interference between the proxy and the invalidation process. The gears are now in motion, and the service is ready to overcome the challenges.
Results
Before Cache | After Cache | Improvements | |
Throughput | 1675 req/s | 5993 req/s | 4318 rec/s |
Percentile 99 | 9.96s | 477ms | 9.483s |
Percentile 95 | 2.52s | 57ms | 2.463s |
Percentile 50 | 220ms | 22ms | 198ms |
Cache Hits | 0% | *85% | 85% |
Service Instances | 606 | **200 | 400 |
Cache Instances | 0 | 20 | 20 |
Invalidator Instances | 0 | 2 | 2 |
Total Memory | 606 Gb | 203.5 Gb | 402.5 Gb |
** The main server is fundamentally heavy. Without restructuring we cannot lower more than this.
During peak weekend usage, caching was important in improving the system’s capabilities. Most cache invalidations occur within a second, with occasional instances exceeding 5 seconds. These longer durations typically arise when streaming and invalidating a large number of documents simultaneously.
From a cost perspective, caching has been a game-changer. Previously, the service required reading an average of 6 documents for single-resource endpoints and approximately 150 documents for multiple-resource endpoints. With cached hits, a single read is all it takes. Resulting in significant cost reduction while maintaining efficient operations.
Conclusion
Firestore is a versatile caching solution for REST services or websites. It offers scalability, flexibility, and performance that meets most Service Level Objectives (SLOs). As a fully managed service, it handles all the management tasks, allowing you to focus on your core business. With Firestore, you only pay for the resources you use, ensuring cost-effectiveness.
Whether you want to optimise your caching, gain more flexibility, or prefer a managed cache experience, Firestore empowers your caching efforts. It unlocks the full potential of your applications and websites with its powerful capabilities.
Looking for a managed cache solution, more flexibility and ready to pay only for the resources you utilise? Contact us now to explore the possibilities.
Future Applications
Enhance your server infrastructure with upcoming advancements that deliver remarkable performance improvements:
- Service Cache Integration:
Integrate the cache directly into each server, eliminating the need for a separate cache server. This reduces latency, lowers costs, and leverages Firestore’s batch-get capabilities for efficient API aggregation. It ensures optimal performance, resembling cache hits or direct access to the origin server.
- Advanced Rule Systems:
Utilise sophisticated rule systems to handle complex data conditions. Configure rules that adapt and branch out based on multiple fields, providing unmatched flexibility and accommodating intricate requirements.
- Fragmented Speed:
Accelerate large-scale invalidations by fragmenting the document stream into multiple concurrent streams.
- Precise Indexing:
Fine-tune your indexing strategy to achieve quick updates, boosting your system’s overall performance.
Wondering how to use your data to make better decisions within your business? 🚀
Read the 4 Expert Insights on common pain points and challenges when looking for ROI using Data