[Note: Lately, I’ve been talking a lot about fitness for purpose and fitness criteria. Other than David Anderson and a few others, though, not much material exists — at least not applied in the software-delivery space — to point people to for further reading. So I’m jotting down some ideas here in the hopes of furthering the discussion and understanding.]
- The first step in improving is understanding what makes the service you provide fit for its purpose.
- Fitness is always defined externally, typically by the customer
- Fitness for purpose has two components: a product component and a service-delivery component
- Fitness criteria are metrics that enable us to evaluate whether our service delivery and/or product is fit for purpose
- Of the two major categories of metrics, fitness criteria are primary, whereas health or improvement metrics are derivative
- Examples of service delivery fitness criteria are delivery time, throughput and predictability
Fitness for purpose is an evaluation of how well a product or service fulfills a customer’s desires based on the organization’s goals or reason for existence. In short, it is the ability of an organization or team to fulfill its mission. The notion derives from manufacturing industry that purportedly assesses a product against its stated purpose. The purpose may be that as determined by the manufacturer or, according to marketing departments, a purpose determined by the needs of customers. David Anderson emphasizes that
Fitness is always defined externally. It is customers and other stakeholders such as governments or regulatory authorities that define what fitness means.
Fitness criteria then are metrics that enable us to evaluate whether our product, service or service delivery is “fit for purpose” in the eyes of a customer from a given market segment. As Anderson notes, fitness criteria metrics are effectively the Key Performance Indicators (KPIs) for each market segment, and as such are direct metrics.
As Anderson explains,
Every business or every unit of a business should know and understand its purpose … What exactly are they in business to do? And it isn’t simply to make money. If they simply wanted to make money they’d be investors and not business owners. They would spend their time managing investment portfolios and not leading a small tribe of believers who want to make something or serve someone. So why does the firm or business unit exist? If we know that we can start to explore what represents “fitness for purpose.”
For me, fitness is something that, like user stories, can be understood at varying levels of granularity. Organizations have fitness for their purpose — “are we fit to pursue this line of business?” — and teams (in particular, small software-delivery teams) also have fitness for their purpose — “are we fit to delivery this work in the way the customer expects?”
Therefore, the first step in improving is understanding what makes the service you provide fit for its purpose. Fitness for purpose is simply an evaluation of how well an organization or team delivers what it is in the business of (its purpose). Modern knowledge-worker organizations like Asynchrony often focus on concerns like product development or technical practices, sometimes overlooking service-delivery excellence. But service delivery is a major reason why our customers choose us. That’s why we attempt to understand and define each project team’s purpose and fitness for that purpose at the project kickoff in a conversation with our customer representatives.
Two Components of Fitness
Fitness for purpose has two components: a product component and a service-delivery component. That is, the customer for your delivery team considers the product that you are building (the what) — did you build the right thing? — as well as the way in which you deliver it (the how) — how reliable were you when you said you’d deliver it? How long did it take you to deliver it? We have useful feedback mechanisms for learning about the fitness of the products we build (e.g., demos/showcases, usage analytics), but how do we learn about the fitness of our service delivery? That’s the service-delivery review feedback loop, which I will write about later.
Fitness criteria are metrics which enable us to evaluate whether our service delivery is “fit for purpose” in the eyes of a customer from a given market segment. These are usually related to but not limited to delivery time (end to end duration), predictability and, for certain domains, safety or regulatory concerns. When we explore and establish expectation levels for each criteria, we discover fitness-criteria thresholds. They represent the “good enough” or the point where performance is satisfactory. For example, our customer may expect us to deliver user stories within some reasonable time frame, so we could say that for user stories, our delivery-time expectation is that 85% of the time we complete them within 10 days. We might have a different expectation for urgent changes, like production bug fixes.
Fitness criteria categories are often common — nearly everyone cares about delivery time and predictability, for instance — the actual thresholds for them are not. While some are shared by many customers, the difference in what people want and expect allow us to define market segments and understand different business risks. Fitness criteria should be our Key Performance Indicators (KPIs), and teams should use those thresholds to drive improvements and evolutionary change.
Who Defines Fitness?
As opposed to team-health metrics, like happiness or pair switches, fitness and fitness criteria are always defined externally: Customers and other stakeholders define what fitness means. That means you cannot ask the delivery team to define its fitness. They cannot know because they are not the ones buying their service or product. We should be asking customers “What would make you choose this service? What would make you come back again? What would encourage you to recommend it to others?”
These are a team’s fitness criteria and these are the criteria by which Asynchrony should be measuring the effectiveness of our teams’ service delivery. Then we’ll be improving toward the goal, the greater fitness for our purpose, both as an organization and as individual delivery teams. By integrating fitness-for-purpose thinking into everything we do, we will create an evolutionary capability that will help us sense changes in market needs and wants and what those different market segments value. As a result, Asynchrony will continue to thrive and survive in the midst of our growth and growing market complexity.
Difference Between Fitness Metrics and Health Metrics
|Fitness Metric||Health Metric|
|Metric that enables us to evaluate whether our product, service or service delivery is “fit for purpose” in the eyes of a customer from a given market segment. Effectively comprise the Key Performance Indicators (KPIs) for each market segment.||Metric that guides an improvement initiative or indicates the general health of your business, business or product unit or service delivery capability.|
|Examples: delivery time, functional quality, predictability, net fitness score||Examples: flow efficiency,velocity, percent complete and accurate,WIP|
|Customer-oriented and derived||Team-oriented and derived|
A Food Example
I like to use food for examples (also to eat). Is a restaurant in the product or service-delivery business? That’s a trick question, of course: The answer is “both.” As a customer, you care about the meal (product) but also about the way you have it provided (service delivery). And those always vary depending on what you want: If you want cheap and fast, like a burger and fries at McDonald’s, you may have a lower expectation for the product (sorry, Ronald) but a higher one for delivery speed. Conversely, if you’re out for fine dining, you expect the food to be of a higher quality and are willing to tolerate a longer delivery time. However, you have some thresholds of service even for four-star restaurants: For example, if you have a reservation, you expect to be seated within minutes of your arrival. And you expect a server to take your order in a timely way. If you don’t have a reservation, the maitre d’ or hostess will perhaps quote you an expected wait time; if it’s unacceptable, you’ll go elsewhere. If it’s acceptable but they don’t seat you in that time, you are dissatisfied. The service delivery was not fit for its purpose, which is to say the reason why you chose to eat there.
A Software-Delivery Example
The restaurant experience is actually not too dissimilar from software delivery. The customer expects software (product) but also expects it on certain terms or within certain thresholds (service delivery). A team works hard to deliver the right features and demonstrates them at some frequency; at the demo, the team likely will explicitly ask “is this what you wanted?” What’s often missing is the “are these the terms on which you wanted it?” Whether in the demo or a separate meeting, we need to also review service delivery. This is where we look at whether our service meets expectations: Did we deliver enough? Reliably enough? Respond to urgent needs quickly enough? The good news is that we can quantitatively manage the answers to these questions. Using delivery times, we can assess whether the throughput is within a tolerance. One team used a probabilistic forecast and found that their throughput was not likely to help them reach their deadline in time. Conversely, another realized that they were delivering too fast and could stand to reallocate people to other efforts. Also, for instance, when we set up delivery-time expectations (some people call these SLAs), like delivering standard-urgency work at a 10-day, 85% target, we can then make decisions based on data rather than feelings or intuition (which have their place in some decisions but not others). These expectations needn’t be perfect or “right” to begin; set them and begin reviewing them to see if they are satisfactory.
Having an explicit review of fitness criteria, especially for service-delivery fitness, is a vital feedback loop for improving. Rather than having the customer walk away dissatisfied for some unknown reason, we can proactively ask and manage those expectations and improve upon them. Often these are the unstated criteria that ultimately define the relationship and create (or erode) trust; discover them and quantitatively manage them.
If you’re using kanban in your environment, you probably use a cumulative-flow diagram. It’s a handy tool to track kanban metrics like cycle time and to quickly see bottlenecks. In addition to all the kanban goodness it gives you, it can also double as a timeline that you can use in your retrospectives.
Whether you use a physical version posted on your kanban board (like my current team does) or an electronic one, you can annotate dates with important events, such as when:
- A team member joins or leaves
- An unexpected technical problem surfaces, like a major refactoring or bug
- The team decides to change something about its kanban, like increase a WIP limit
- The team makes a positive change, like switching pairs more often
It’s pretty easy to do, especially if you have a physical chart that you update during your standup meeting.
Then, when you have a retrospective, bring the diagram along to help you remember what happened during the period over which you’re retrospecting. If you’re anything like the teams I’ve been on and like me, you have a hard time remembering what happened beyond yesterday, so it’s handy to have a reference. Having this time-based information will help you make more objective decisions about how to improve, since you won’t be guessing so much as to why your cycle time lengthened over the last week, or why you decided to decrease a WIP limit a month ago.
At the 2010 Agile Grenoble conference, Alexandre Boutin and Emmanuel Etasse presented Behavior Driven Metrics: Even numbers can be agile. I liked it so much that, with Alexandre’s permission, I translated it from French to English and re-presented it today to a few people at Asynchrony. If you’re interested, you can view it for yourself (all instances of unclarity and mistranslation are mine and not the original authors’).
Among the feedback I heard and observations I made today:
- Positive behavior isn’t the same as benefits (from that positive behavior). For instance, if your metric is “number of calories consumed per day,” eating foods with fewer calories is the positive behavior, while losing weight is a benefit (and not a behavior, per se).
- It’s important to understand why you would use a particular metric.
- The metric is not the same thing as the means of gathering and displaying data for the metric. For example, a metric is “number of pair switches in a time period,” and a pairing chart (or in my parlance, “pairamid”) is merely an instance of implementation
- In the spirit of one of the core concepts — metrics are not certainty — many metrics are best understood in conjunction with another metric (to give context, etc.), especially for guidance in knowing when to start and stop using them and what an optimal level is. For example, we talked about pairing and pair switching. Deciding when to track pair switching and determining the optimal number of pair switches per day is perhaps better understood by looking at some other metrics, like product quality, code ownership, knowledge siloing, etc. When the team starts realizing the benefits of switching pairs more often, or when their marginal utility of the benefit is small, that can inform the team’s decision about when to stop measuring, or at least to have an idea of what is best.
I like Mike Cohn’s testing pyramid as a guideline for test allocation, and I’ve mentioned it to a couple of teams around here (for more, read Patrick Wilson-Welsh’s blog and/or see his Agile 2008 presentation). Lately, on one team, we’ve been very earnest about writing Selenium tests for UATs, even doing some ATDD. But we’re seeing what many have seen: Selenium tests are often (necessarily) long and slow, and occasionally brittle. I asked the team what our “testing pyramid” would look like, and, notionally, it’s something like this:
^^^^^^^^^^^^ (GUI/system)^^^^^^^^^^ (functional)^^^^^^^^^^^^^^^^^ (unit)
^^^^^^^^^^ (GUI/system)^^^^ (integration)^^^^^^^^^^ (functional)^^^^^^^^^^^^^^^^^ (unit)
I went ahead and grabbed the actual test numbers from the build(s) — simply the total number of assertions in each of the test levels (unit, functional, integration and UI) — and generated a chart in Excel:
I posted it in our war room; we’ll see what kind of conversation it sparks. It looks like we need to continue moving toward increasing the ratio of webrat (integration) tests to Selenium (UI) tests, as well as upping our base level of units. Using the actual data also corrected my anecdotal assumption that we had a lot more unit tests than we do (see my sketch in my previous post).
UPDATE: I ran the numbers for another team. Here’s their chart:
This team has some UI tests written in Watir, but they don’t run them (so they’re useless). All of their integration tests are Webrat; apparently, these can be run as Selenium tests, but the team isn’t doing that (yet). This team has the fundamentals down well — more unit tests than functional, more functional than integration. We’ll see how they expand the upper levels of their triangle in the coming weeks.