This story begins about a year after I joined a financial institution. The company was going through another restructure, which involved the reorganisation of teams and reprioritisation of work, which in large part affected IT teams.
Teams were formed around different initiatives and systems. However, there were a subset of systems that were considered ‘legacy systems’, which weren’t factored into the restructure. People were allocated to these teams through negotiation between middle layer managers without consultation with the people who would be most affected. Yikes!
As a result, there were people who weren’t allocated into any teams, who were then put together to form the team that I would be coaching and supporting. This team was deemed the Technical Optimisation of Applications team (or ToA for short).
The ToA team was responsible for looking after nine different legacy systems spread across three product owners (POs). It consisted of seven developers, two dedicated testers, and myself as the Scrum Master. The developers and testers had varying levels of understanding of the nine different systems, meaning we needed to co-create a plan to upskill.
It’s worth mentioning that, originally, this team was called the Technical Maintenance for Applications team. Now, language is an extremely important tool in setting expectations. Using the word ‘maintenance’ gave the expectation that the team would only be used to clean up and apply fixes. I wanted the team to adopt a mindset more centred on improvement. Hence, the word optimisation was chosen. Quite an improvement on “maintenance” right?
The first couple of weeks working with the team were particularly rough. Had I known what I know now, I would have gathered the team to co-create our purpose and develop a working agreement within the team, enabling them to get to know each other as people, colleagues and team members.
Right off the bat, there was a perception that the legacy systems the team was looking after were going to be decommissioned. This was repeatedly voiced to the team directly by middle layer managers, but with no plan/schedule in sight.
One thing that stuck in my mind about this was a conversation I had with a manager: “Do you know what we call your team? We call it the ‘Graveyard Team.’ You know, like where applications go to die’”. My response to him was: “If we are indeed the ‘Graveyard Team’ and you have work that needs to be done on these dying systems, why would we prioritise it to be done?” Followed by radio silence!
Three POs were assigned to nine systems, and two out of the three had some previous experience as POs. However, they were all disengaged because the legacy systems they were working on weren’t considered interesting or new.
Other teams tried passing work they didn’t want along to the ToA team, so we quickly realised we needed a better working mechanism in place—one that incorporated a backlog, and was understood and adhered to by all.
Over 400 backlog items (yes, 400, that’s not a typo) which were up to five years old had built up over the years across the nine systems that the team were responsible for. We knew we had to find a way to reduce these. In addition to this obstacle, the time that it took to deploy changes across systems ranged from three to six months, which impacted our ability to respond quickly to change.
The team was also tasked with tackling any priority issues. The financial institution had a system where ‘priority 1’ and ‘priority 2’ issues needed to be dealt with immediately. P1’s wouldn’t happen very often, but when they did, they were all-hands-on-deck situations. P2’s were similar, but had a different service-level agreement. P3’s and P4’s were put into the backlog.
Taking the obstacles above into account, as a team, we created the following objectives and key results:
We split the OKRs into three phases, each for 90 days, and dove right in...
The whole team agreed that the most important thing to work on immediately was reducing the number of issues in the backlog. The first thing we did was delete any issues that were over four years old, therein testing our hypothesis that if we had deleted something important someone would come back and let us know.
The second thing we did was look at the remaining list to understand which of the nine systems had the most long-standing issues. We also looked at the frequency of the change of deployments into production to see if we could ascertain which of the nine systems had the most activity, and therefore the most urgency.
Once the team understood this, they were able to prioritise which systems to focus on first. For each of the systems, we took the list of issues to the POs to reprioritise, to ensure they were still valuable.
With the POs armed with this information, the team implemented a 30-minute, time-boxed triage meeting at the start of each sprint, with all the system POs and the entire team in attendance. Each PO would discuss their priority items and discuss with the other POs why their request was more important than the others, as the team had limited capacity. If we ran out of time to discuss the agreed upon items, it was understood by the team that this meant one of two things:
a) The PO and the team were not clear on what needed to be done for the work item
b) Large pieces of work needed to be broken down into smaller parts
We decided as a team that in each two-week sprint they would have the capacity to work on eight items. The rule of thumb for each item was that it couldn’t take more than three days to develop and test. If the team decided that the task was larger than a three-day effort, it was the POs responsibility to break it down further.
For items that took less than three days, the PO needed to explain why this work item was important for them. The team was able to ask questions, discuss, and when this back-and-forth was completed everyone could vote as to whether or not this piece of work should go into the next sprint. Democracy can work wonders in these situations.
The voting process was simple, thumbs up 👍 or thumbs down 👎 but the votes needed to be unanimous. For example, if everyone except one person voted to do a piece of work, then it was either up to the people who voted 👍 to convince the person who voted 👎 otherwise, or vice versa.
The team had daily scrums to discuss what they did the previous day, what they were planning to do on the current day, any issues/blockers they had, and whether they needed help.
At the end of each sprint the team had a sprint review with POs and stakeholders, evaluating the work that was done so they could provide input and direction. As they discussed these completed work items, they also discussed which release these changes would go into, so the POs could gain visibility of non-functional requirements like change awareness and comms to the affected customers.
The team would then have their own sprint retrospectives, which were quite standard, with the one exception that we would ask extra questions: “What’s one thing you learned this sprint?” and “What’s one thing that you would like to learn next sprint?”. This encouraged a learning mindset, and facilitated discussion around which team members could pair in the following week to build capability.
Every now and then, P1 and P2 incidents on the systems we looked after would occur. All current work would stop, the team would go into incident mode, and swarm to resolve the issue. It didn’t matter if you knew the system or not, as having visibility and context of why the team needed to stop working and solve the incident raised understanding of the system as a whole.
Using all the aforementioned mechanics, the team was able to reduce the number of work items in the backlog of nine different systems to zero in three months, all while building their abilities to manage new work items generated by POs.
Using Jira, we were able to determine that—before working this way—the average age of work items from creation to production was about 12 months. By the time I left the organisation, the average age was 14-30 days 🙌
Once the team reduced the number of backlog items to zero across systems and started delivering business outcomes more effectively, the POs started generating new work items for the systems. These new work items would go through the same process as before, using the triage method.
The team was now delivering consistently on around six to eight work items per sprint, and discussing improvements to make their lives easier (e.g. testing and deployments). The team, along with the POs, agreed to reduce the number of work item slots to six and use the remaining two slots as system improvement slots.
The team started to generate their own system improvement ideas into the backlog. These ideas were discussed in the same triage meeting, using the same process, ensuring that the team and the POs understood why it was important and what precisely needed to be done.
A rule was also put in place where no improvement work would be started unless all the requested work items in the six slots were completed first. This provided an indication of whether we were effective as a team in completing the work in the six slots, or if we needed to adjust the work limit.
Improvement, that’s the key word here. Once the team had the time, the people and the morale to better their processes and understand their capacity, everything changed for the better. Simply surviving, keeping your head above the water—especially in a development environment—is not conducive to good business or happy teams.
Wrapping up, a few bullet points that you might incorporate into your own projects and workflows...
What did we discover/learn along the way?
A few rules of thumb