Rendered at 23:38:13 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
tancop 15 hours ago [-]
> The approval gates came out, replaced by a sequenced market-group rollout (test -> eu-0 -> eu-1 -> eu-2) that uses the smaller regions as an alarm buffer before the critical eu-2 region.
this is interesting. these names dont look like aws regions. are they geographic divisions or groups randomly assigned to users? or based on something like total spend over last N months to make sure high value customers dont quit?
if users are assigned to a stable group that would mean some of them experience way more issues than others. i would do it with a random subset of AZs or individual accounts thats different every time. not sure which one is the case here.
cjbooms 10 hours ago [-]
What we call "Market Groups" are an artifact of how we deploy our Product Read API. Any deployment can serve Product-Offer data (product with price and stock information from a merchant in a country) for all of our 29 European markets. But rather than have a deployment per country, or one deployment for all countries, we choose to group countries into 3 "Market Groups' named: eu-N. Our edge load-balancer interrogates HTTP headers to determine which Market Group deployment to send the request to, and we can dynamically adjust this edge routing as needed. The idea was to have some resiliency and be able to isolate say a massive sales campaign or bot attack in one country from impacting other countries. But it has been mostly useful for this staged deployment where we can deploy and test changes first in a group of our lower monetary value countries, such as Ireland, and then move on to higher value country groups, such as Germany/Switzerland.
So not customers, traffic from countries.
cjbooms 1 days ago [-]
Hi all. Blog author here, happy to take any questions you might have. It's a bit of a long one, but I put a lot of effort into the story arc and readability so it's (hopefully) an easy top to bottom read, that takes you on our journey. It covers some advanced distributed computing topics, such as replicating the exact same hash-ring as our existing load-balancer in the client application (JVM), what we implemented to fade-in new pods slowly so their caches get a chance to warm, and how we attempted to migrate to Availability Zone (AZ) Aware routing to save on AWS' inter-az transfer fees.
Hope you enjoy it!
charleshn 22 hours ago [-]
Thanks for the article.
I have two questions/comments:
1. The N-ring fade-in is quite neat. I guess without the constraint of hash parity rendezvous hashing [0] could have been an elegant approach since it has support for weights (and generally better statistical properties than consistent hashing based on rings).
2. You mention still having the fallback of your existing load balancer. Is this a temporary thing during rollout or do you intend to keep it long-term. Asking because I generally tend to stay clear of fallbacks in distributed systems, as they introduce bimodality and metastable failures [1] [2].
1. I was not aware of rendezvous hashing, very interesting. Yes, we had an implementation to reverse engineer and cache parity was a priority. RH would be an elegant approach to fade-in alright. I wonder if it would work to provide consistent spill-over also, so cache affinity is preserved when spilling over to N pods, or would that break if we fed occupancy metrics into the RH algorithm.
2. Yes, this was primarily temporary during rollout. And also as a bit of a sales-pitch to the owning team that this was a two way door. Totally agree, and we will likely take it out now that everything has been running for a couple of months with complete stability. Right now, we are protected by our in-house Safe Deployments setup. Where our CI/CD system versions all ConfigMaps when a new version is deployed. A v1 deployment gets my-config-map-v1, and a v2 gets my-config-map-v2. So re-enabling Skipper would require a blue green deployment, where traffic is gradually switched back onto Skipper over a 30 minute window for each stack. No big-bang fallback to trigger a cascading failure.
this is interesting. these names dont look like aws regions. are they geographic divisions or groups randomly assigned to users? or based on something like total spend over last N months to make sure high value customers dont quit?
if users are assigned to a stable group that would mean some of them experience way more issues than others. i would do it with a random subset of AZs or individual accounts thats different every time. not sure which one is the case here.
I have two questions/comments:
1. The N-ring fade-in is quite neat. I guess without the constraint of hash parity rendezvous hashing [0] could have been an elegant approach since it has support for weights (and generally better statistical properties than consistent hashing based on rings).
2. You mention still having the fallback of your existing load balancer. Is this a temporary thing during rollout or do you intend to keep it long-term. Asking because I generally tend to stay clear of fallbacks in distributed systems, as they introduce bimodality and metastable failures [1] [2].
[0] https://en.wikipedia.org/wiki/Rendezvous_hashing
[1] https://builder.aws.com/content/3EuS9Sakq7L3VLQIF3qzfMfke1Y/...
[2] https://brooker.co.za/blog/2021/05/24/metastable.html
1. I was not aware of rendezvous hashing, very interesting. Yes, we had an implementation to reverse engineer and cache parity was a priority. RH would be an elegant approach to fade-in alright. I wonder if it would work to provide consistent spill-over also, so cache affinity is preserved when spilling over to N pods, or would that break if we fed occupancy metrics into the RH algorithm.
2. Yes, this was primarily temporary during rollout. And also as a bit of a sales-pitch to the owning team that this was a two way door. Totally agree, and we will likely take it out now that everything has been running for a couple of months with complete stability. Right now, we are protected by our in-house Safe Deployments setup. Where our CI/CD system versions all ConfigMaps when a new version is deployed. A v1 deployment gets my-config-map-v1, and a v2 gets my-config-map-v2. So re-enabling Skipper would require a blue green deployment, where traffic is gradually switched back onto Skipper over a 30 minute window for each stack. No big-bang fallback to trigger a cascading failure.
Thanks for all the links, added to reading list.