Single-Magnitude Story Sizing (Rather than Same-Sizing)

A practice exists in the agile community of “same-sizing” (sometimes called “single-sizing”) user stories. I tend to cringe when I hear this because it’s often accompanied by questionable guidance that it’s “required” for kanban or continuous flow or NoEstimates (it’s not). But at the recent Lean Agile Scotland conference, I heard an interesting and helpful clarification of at least one person’s version of this practice that switched on a light for me.


Lyndsay Prewer gave a talk on reimagining agile ceremonies in which he touched on “evolving estimation.” He noted that his teams had started following Pawel Brodzinski’s three-choice approach of “estimating”:
  • 1
  • TFB: Too F-ing Big
  • NFC: No F-ing Clue
Sounded good — I like Pawel’s cheeky and simple heuristic. But I’ve seen teams interpret this as meaning all stories need to be a one either in terms of relative sizing (doing only the “ones” in a Fibonacci estimation session, for example*) or absolute time (doing only stories that would take one day). But here’s where it got interesting: Lyndsay then explained that, for his teams, the “single-sizing” — the “one” in Pawel’s approach — meant that the team expected the story would take between two and 10 days to finish (10 days being the total duration of the two-week sprint). This single size then wasn’t same-sizing at all: It was single-magnitude sizing. Lynsday gave me hope that perhaps not everyone who says he’s doing “same-sizing” is actually trying to make upfront guesses about uniform effort and delivery duration.


At the very least, though, it gives us some language to clarify things with: Single-magnitude sizing is indeed a salutary practice, insofar as it accommodates the impossibility of guessing delivery time as well as the need for predictability. A few additional points of guidance and clarification:
  • Saying that any story seems likely to finish within a range of time (good use of time) is different from saying the specific time that it will take (fool’s errand). We’re not saying that Story X is a five (or whatever) but simply saying it’s likely 10 days (or 20, etc.) or less. Mike Cohn stated this well when he wrote “Try to keep most estimates, or at least the most important estimates within about one order of magnitude, such as from 1-10. There are studies that have shown humans are pretty good across one order of magnitude, but beyond that, we are pretty bad.” But Cohn goes to far, in my opinion, with his next advice, which refers to the Fibonacci sequence “That’s why in the agile estimating method of Planning Poker, most of the cards are between 1-13. We can estimate pretty well in that range.” Data is showing that we can’t (upfront estimates have little correlation with actual delivery times). Playing poker is gambling!
  • The reason we care about sizing within a single magnitude is that it helps us satisfy the assumptions behind Little’s Law, which makes forecasting more reliable.
  • The data should inform the sizing, not the other way around. Rather than starting with the boundary of a sprint, I would start with the actual data of observed delivery times and work backward from there. For instance, if a team finds that stories are sometimes taking up to 15 days to complete, forcing them into a two-week (10-day sprint) cycle is only going to drive counter-productive behavior.
  • If you are using this approach for sprints, then I would make the sprint duration at least twice the highest number in the range. For example, if you’re observing stories to take between two and 10 days, I would make the sprint 20 days long, because you need to accommodate the possibility of starting one of those 10-day stories late in the sprint, which would jeopardize the sprint goal (remember, you don’t really know which stories are going to take 10 days because effort — even if estimated perfectly — is only one of many sources of variation). That’s if you care that everything gets finished in a sprint time box. (And if you don’t, then what are sprint boundaries really doing for you?)
  • It’s not necessary to have every story fit that magnitude range. Here’s where percentile levels on scatterplot charts come in handy. You might choose to accommodate some outliers by using the 85th percentile as your upper range. In the example scatterplot below, we see that the 85th percentile gives a range of delivery times between three and 17 days.

    Delivery-time (aka Cycle Time) scatterplot chart from
  • Sizing stories in this way is different from estimating them further, as in assigning time or story points.
So let’s single-magnitude-size our stories. Since the “one” in Pawel’s approach should not refer to the relative size (to say, other Fibonacci numbers) or a unit of time, it’s probably better referred to as something other than a number, like a color. I’ll propose green, since it indicates “go,” as in good enough to go with. So here’s my simplified magnitude-sizing proposal:
  • Green: seems to be within the 85th-percentile range of previous work
  • Red: something other than that


*It’s described by at least one person (though I’ve heard it said this way many times): “The idea, at least at high level, is very simple: slice down your tasks until they are all more or less of the same size (1 story point), then your final estimation is just a matter of summing the total number of stories.”



How Many Runs Would Man City’s Seven Goals Have Been?

After Manchester City scored seven goals in their Oct. 14 match against Stoke City, my first reaction was: Wow, they’re playing some beautiful, unselfish soccer. Being also a baseball fan, my second reaction was: That’s a load of goals — how many runs would that equate to in baseball?

To find out, I used the same technique that we can use for understanding the performance and predictability of our knowledge-work systems, such as software delivery.

First, let’s look at the distribution of goals per team in soccer. Since the new English Premier League season has only just begun, I’ll use the data from 2016-17, the most recent complete season of play:

From this we can then start to understand the likelihood of a seven-goal outburst by a single team. For instance, with 246 occurrences in a total of 760 total outcomes, the goal total of one is the most likely, at 32.4% Seven goals happened only once last year, making it 0.1% likely.

We can do the same for baseball. Let’s look at the runs scored per team for the entire 2017 regular season, which recently concluded:

(That 23-run game was when the Washington Nationals beat the Mets by a landslide on Apr. 30.)

To compare these outliers, we could use something like an average with standard deviations away from that. But the data from both the EPL and MLB are not normally distributed, which renders that approach inappropriate. Instead, we’ll use percentiles. Why? As Dan Vacanti writes in When Will It Be Done?:

Percentiles are not skewed by outliers. One of the great disadvantages of a mean and standard deviation approach (other than the false assumption of normally distributed data) is that both of those statistics are heavily influenced by outliers.

A percentile is simply a level that contains a certain percentage of data points. For instance, if I looked at the Premier League data at, say, the 61st percentile — the “one goal” column, that would mean that 60% of our outcomes were teams who scored one goal or fewer (the total percentages for zero goals (28.2%) and one goal (32.4%). We could even draw a curve that shows those numbers:
From the Premier League data, we see that the seven-goal outcome doesn’t happen until the 100th percentile, which makes sense because it was the highest-scoring outcome! We have to go all the way to the 100% percentile in terms of likelihood of possibilities to arrive at seven goals.
So where is the 100th percentile for baseball? Naturally, it will be the highest-scoring run total of the season:
Now we have our answer! Seven goals, at least from recent data from the English Premier League, is equivalent to 23 runs in Major League Baseball.
Okay, so maybe that wasn’t all that interesting, since all we did was take the top outcome from each league. But using the same approach, we could develop a reference table for all of the scoring outcomes.
0% 60% 80% 90% 98% 99% 100%
MLB runs 0-4 5-6 7-8 9-11 12-14 15-22 23
EPL goals 0 1 2 3 4 5-6 7
Reading the table, you can make statements like:
  • In 60% of MLB and EPL games, a team scores six or fewer runs and one or fewer goals, respectively.
  • Seven or eight runs (or fewer) in baseball occurs at about the frequency as two (or fewer) goals in soccer.
We can apply this same approach to our delivery-time data in software delivery, because, like these professional sports, the data is not normally distributed. In fact the distribution of both leagues probably looks a lot like your team’s (graph it and see!). In knowledge work, as in this little exercise, we’re also trying to determine the probability of a single outcome happening, as in when we ask the question: “When might I expect this user story to be finished?” We can answer that question, and then plan, using percentiles, just like we did with the sports scores, like: “We have a 90% confidence that we’ll complete any given next user story in 11 days or fewer.” And like the sports scores, the longer the range in the “tail” the farther it pushes out our highest confidence intervals.

So the next time someone asks you about the likelihood of your favorite sports team — whatever the sport — scoring a certain number, you’ll know what to do — just as you will in your own team when someone asks when to expect a single piece of work to be finished.

Special thanks to Dan Vacanti for the insights from his recent book, When Will It Be Done?