-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add a new blog post about being smart around scaling
- Loading branch information
1 parent
e37b9e1
commit 18d2469
Showing
1 changed file
with
147 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
--- | ||
title: Illusions as a Service | ||
description: | | ||
The typical tech sales pitch you'll see today, especially in the age of AI, | ||
reads more like a magician's prelude than the matter-of-fact boxes your | ||
grandfather's (or granmother's) drill came in. Let's talk a bit about | ||
reality, business decisions, and understanding what you're buying in a world | ||
where sellers control the narrative. | ||
date: 2024-06-01 03:00:00 | ||
permalinkPattern: :year/:month/:day/:slug/ | ||
categories: | ||
- business | ||
- technology | ||
tags: | ||
- meta-commentary | ||
--- | ||
|
||
# Illusions as a Service | ||
The world today strikes a particularly jarring contrast with the one I grew up | ||
in. Buzz-words and gimicks have taken center stage in an environment where a | ||
"cheap" alternative can be found for thruppence on the quart (or however that | ||
monetary system worked). | ||
|
||
To compete in a sea of identical products, companies pitch differentiating | ||
features, many of which seem to have been crafted through a haphazard "Cards | ||
against Humanity" approach where every second card is the word "AI". Before that | ||
it was "Blockchain", and before that "The Cloud". | ||
|
||
Thing is, for all that these products will try to convince you that they have | ||
the singular ability to magically solve whatever problem you face; the simple | ||
reality is that the laws of physics cannot be negotiated with. | ||
|
||
There is no magic, you cannot beat the Shannon compression limit, you can't | ||
move information faster than the speed of light, and you must choose between | ||
consistency or availability. | ||
|
||
<!-- more --> | ||
|
||
Okay, that's a pretty heavy way to start things, but there is a bright side: | ||
knowing this allows us to much more accurately evaluate options and make | ||
(reasonably) informed decisions. | ||
|
||
::: warning | ||
This post takes an intentionally excessive position on the topics it covers, | ||
it's probably not going to give you a good idea about what technology to choose, | ||
but hopefully it gives you some ideas about how to think about the technology | ||
you choose. | ||
::: | ||
|
||
## The illusion of choice | ||
Humans are gullible (and liable to get irritated if you point this out...woops). | ||
Marketing and sales departments know this all too well, and one of the fun | ||
wrenches they like to use to percussively encourage you to part with your money | ||
involves giving you the choice of anything you could possibly want, so long as | ||
its one of the things they're selling. | ||
|
||
> Any colour the customer wants, as long as it's black — Henry Ford | ||
A great example is that in all likelihood you've never been asked if you'd like | ||
a smartphone; you were just asked if you wanted Android or iOS. You're not asked | ||
if you'd like a managed database that costs multiples of the cost to run it | ||
yourself, you're just asked if you'd like Dynamo DB, Cosmos DB, Azure PosgreSQL, | ||
CloudSpanner, RDS, or a wide range of other options that all promise to be the | ||
most cost effective way to spend lots of money. | ||
|
||
## Pick a card, any card | ||
Everyone and their dog is offering to sell you their own take on foundational services. | ||
You've got the hyperscalers who could probably solve your problem, but they could also | ||
offer you at least three different ways to solve your problem, all of which compete with | ||
one another, and all of which claim to be the right solution (except for the one with the | ||
best margins, which probably has a bunch of research papers explaining how it defies the | ||
laws of physics and beats CAP theorem - all so that the sales team can impress your executives). | ||
|
||
Silicon Valley, not be outdone by "The Big Cos" (as the kids are calling them these days) | ||
are tripping over their shoelaces (and tripping over their shoelaces trying to sell you their products). | ||
In most cases, they'll be offering some smart ways to run Open Source software with a specific | ||
pain point that a founder has been hurt by polished into profitability (or at the very least, | ||
the illusion of it). | ||
|
||
And then there's Open Source, the good old "free" option that has had teams asking themselves | ||
"how hard can it be?" for decades (to which the answer is obviously "Ah sure, it'll be grand!"). | ||
You will, of course, want to hire a full team to run this for you - and don't make the mistake | ||
of hiring the wrong team (I'll let you figure out whether that is SRE, DevOps, DBA, or a | ||
Software team running their own DBs - good luck, I'm sure you'll figure it out). | ||
|
||
## Letting truth get in the way of a good story | ||
As I said at the start, there is no such thing as magic, and anything that sounds too good to | ||
be true probably is. Everything has trade-offs, the laws of physics mean that as you move things | ||
further apart you will invariably introduce performance constraints, and the Universal Scaling Law | ||
means that if you care about ordering you're going to see retrograde scaling beyond a certain size. | ||
|
||
There is some amazing software out there, and many of the options in the wild will work for many | ||
of the problems you're likely to face - but don't let the sales team spin you a merry yarn, instead | ||
get your hamster running and do some thinking. | ||
|
||
When it comes to databases, you'll be amazed how much you can do with an in-memory implementation | ||
and some snapshotting logic. Care a bit more about consistency and data quality? SQLite will gladly | ||
let you solve most problems you can solve on a single node without skipping a beat. Take a copy of | ||
the `db` once in a while for backup purposes and you're probably going to be set. | ||
|
||
God forbid you might need to have multiple clients accessing the database at the same time; grab PostgreSQL. | ||
It is rock solid, can solve most problems you need a database for, is incredibly fast, and if you run it on | ||
a single node with backups (and maybe a log streaming replica for point in time recovery) you'll be just | ||
fine for everything most products will ever need. | ||
|
||
## Struggle-snuggling with dragons | ||
Of course, if you know anything about how you're *meant* to run reliable systems, everything I've just told | ||
you sounds like I'm an idiot. "This clown thinks that running SQLite is better than CloudSpanner for my use case?! Hah! | ||
What do they know?". So okay, let's play it out: you've read the recipe books; you must have a replication factor of | ||
at least N+2 to support planned downtime and unplanned outages without impacting availability. You must distribute | ||
that capacity across multiple failure domains (AZs, regions, maybe both) so that a major outage doesn't impact your | ||
blog which gets 10 views by `User-Agent: *Bot` per month, 150 from your home IP, and nothing else. You must shard your | ||
data across multiple smaller instances because "horizontal scaling" and you'll need infrastructure to manage automated | ||
fail-over and leader-election (so grab yourself a Raft or some Zookeepers to wrangle this mess). | ||
|
||
Before you know it, you're managing a fleet of systems, fighting with the laws of physics, and have so many | ||
interdependent systems with different failure modes that even though a single machine only has an estimated 99.95% | ||
uptime per year, you're probably getting 99.5% reliability (or about 4 hours of downtime a month) because of unexpected | ||
issues which cascade and require manual intervention. | ||
|
||
At which point, your local hyperscaler leans over and says "Hey there kid, want to try some serverless? I've got the | ||
good stuff!" and you reckon "Why not? They probably know how to do this better than me and they have a 99.999% SLA.". | ||
And so you live happily ever after... | ||
|
||
Well except for the major outage last month, or the fact that there are these weird timeouts you get from time to time, | ||
or the crazy cloud bill, and the network egress fees! Meanwhile, your little home lab machine has been running SQLite | ||
happily for the last 536 days without downtime, happily doing its job without a word of complaint. | ||
|
||
## Wrapping up | ||
There's no small amount of snark in this post, but there is a very serious core consideration that led to it. When we | ||
think about reliability we immediately jump to thoughts of system design patterns which can allow us to reduce blast | ||
radius, layer redundancy, and give ourselves the best shot of weathering a bad day. | ||
|
||
But complexity is its own form of reliability challenge and ignoring it is a really bad idea :tm: (especially at the kind of scale where the compounding probabilities of failure introduced by additional nodes required for scaling are | ||
far outweighed by the additional risk introduced by adding more components). | ||
|
||
Conceptually, we can think of this in terms of a reliability cost from architectural complexity (which has a negative | ||
scaling factor associated with it but a large constant cost) and the reliability cost of scaling (which has a low | ||
up-front cost but a positive scaling factor). | ||
|
||
$$ risk = risk_{architecture} + N (risk_{scale} - benefit_{architecture}) $$ | ||
|
||
At small scale (where $N$ is small), keeping your $risk_{architecture}$ low matters a lot more than when $N$ is large, | ||
where the $benefit_{architecture}$ plays a much larger role. So be honest with yourself: are you really large enough to | ||
warrant a complex system, or would you be better served by working on automating that backup system you've been meaning | ||
to get to? |