One of the challenging things about working on performance problems is their fluid, unpredictable nature.
A system can be completely stable with a certain configuration and workload; it may be handling a high volume of requests, using the available resources very efficiently.
But a slight perturbation in that workload, or a tiny increase in the volume, or even the presence of some other unrelated activity that competes unexpectedly for resources, can turn a well-operating system in to a completely malfunctioning mess.
Performance behaviors have these "knees in the curve", places in their performance profiles where the first derivative of your performance graph changes dramatically.
Worse, this can be very hard to diagnose and adjust, because often these systems don't exhibit these behaviors under synthetic controlled workloads. Your existing performance benchmarks are probably mature and predictable; over the years, your system has become "grooved in" to these workloads, and doesn't exhibit these performance peculiarities under laboratory conditions.
So you are forced to diagnose and debug these performance anomalies under live conditions, in the field, under pressure, with voices raised and tempers flaring.
One lesson is to keep your diagnosis tools and skills ready at all times: have lots of monitoring points in your application, capture lots of information, and be familiar with the available tools for diagnosing that information.
But surely there are better ways to build software so it doesn't fall over in cases like these? Where is the body of research that talks about how to design algorithms and systems that decay gracefully under overload conditions?
Several years ago, I remember that people were talking about "chaos theory", claiming that it would provide techniques for understanding systems that exhibited large responses to small changes in input.
These performance knees seem to present a worthwhile venue for applying such theory, but I don't recall having seen actual results in this area.
Am I just looking in the wrong places?