DDIA Book #1: Metrics. I was wrong, you are wrong….

Just in Chapter #1 of this book, I've realized I have been capturing the wrong metrics my entire life.

So I'll first go section by section and start by doing a quick review of a core metric I should have captured

RAW NOTES: (TBD when I finish this book)

Reliability

The book defines reliability as the capacity of a system to continue operating correctly even when it fails, AKA hold and recover from failures, both predicted and unexpected.

Note: I am still confused about unexpected failures..

Key Points:

Understanding and measuring metrics are pointless if you don't know the Mean Time To Failure (MTTF).
Fault tolerance: creating systems resilient to various types of faults. No matter what type of error they expect.
Implementation of RAID configurations and hot-swappable CPUs to make hardware reliable (although I've never worked at this low level, it sounds interesting).
Always implement multi-machine redundancy for applications requiring HA (high availability) from the beginning (no single server setups).
In terms of Agile (yikes another book that talks about Agile), the only important thing was that the system should be able to rollback and upgrade without challenges in less than a day (avoiding lengthy processes)

The Only Strategy:

Ensuring that the system has telemetry as a core component for monitoring.

Scalability

Is the ability of a system to handle increased load without degrading performance or availability. Keyword here is performance the I've seen it in my previous job where with more scale the old services became really really slow.

Key Metrics:

Load parameters (Request per second (RPS), ratio of reads to writes, active users (MAU/DAU), cache hit rate).
Throughput and latency (distinguishing between service time and the time seen by users).

Observations:

The best way to know the performance is with distribution of values; median and percentiles (e.g., 95th, 99th) are the things you want to understand for improving response times.
The significance of tail latencies and how they impact user experience, particularly for valuable customers. So you dont mesure lets say a device that is in 3G, you take a cohort of devices and mesure the longest request in that small cohort.
The concept of "tail latency amplification" so this is when a request depends on other request and delays the end response. Making the user timeout (I wonder if for client teams this thing is easier to capture).

Strategies:

Scalable architectures must be design to be specific to operation volumes, data complexity, and response time.

Maintainability

I guess a good book on maintainability was the Clean Architecture book where they go in detail in how to make code that is easy to maintain, but this books touches on:

Goals:

The goal is to simplify the OP workload by providing good default experience. If your system is notifications approval you should be 99% good at that, other things can fail.
Make the system changable (Uncle Bob has a good definition on this) this would allow it to adapt over time with minimal effort.

Summary

Reliability is correct operation in the middle of a fault.
Scalability is maintaining performance under increased loads.
Maintainability focuses on easing the work of engineering.

Very interesting, a bit general. I guess with the following chapters, I will form an opinion on this