This story begins about a year after I joined a financial institution. The company was going through another restructure which involved building collaborative cross-functional teams, and the reprioritisation of work, which in large part affected IT teams.
Teams were formed around different initiatives and systems. However, there were a subset of systems that were considered ‘legacy systems’, which weren’t factored into the restructure. People were allocated to these teams through negotiation between middle layer managers without consultation with the people who would be most affected. Yikes!
As a result, there were people who weren’t allocated into any teams, who were then put together to form the team that I would be coaching and supporting. This team was deemed the Technical Optimisation of Applications team (or ToA for short).
The ToA team was responsible for looking after nine different legacy systems spread across three product owners (POs). It consisted of seven developers, two dedicated testers, and myself as the Scrum Master. The developers and testers had varying levels of understanding of the nine different systems, meaning we needed to co-create a plan to upskill.
It’s worth mentioning that, originally, this team was called the Technical Maintenance for Applications team. Now, language is an extremely important tool in setting expectations. Using the word ‘maintenance’ gave the expectation that the team would only be used to clean up and apply fixes. I wanted the team to adopt a mindset more centred on improvement. Hence, the word optimisation was chosen. Quite an improvement on “maintenance” right?
The first couple of weeks working with the team were particularly rough. Had I known what I know now, I would have gathered the team to co-create our purpose and develop a working agreement within the team, enabling them to get to know each other as people, colleagues and team members.
Right off the bat, there was a perception that the legacy systems the team was looking after were going to be decommissioned. This was repeatedly voiced to the team directly by middle layer managers, but with no plan/schedule in sight.
One thing that stuck in my mind about this was a conversation I had with a manager: “Do you know what we call your team? We call it the ‘Graveyard Team.’ You know, like where applications go to die’”. My response to him was: “If we are indeed the ‘Graveyard Team’ and you have work that needs to be done on these dying systems, why would we prioritise it to be done?” Followed by radio silence!
Three POs were assigned to nine systems, and two out of the three had some previous experience as POs. However, they were all disengaged because the legacy systems they were working on weren’t considered interesting or new.
Other teams tried passing work they didn’t want along to the ToA team, so we quickly realised we needed a better working mechanism in place—one that incorporated a backlog, and was understood and adhered to by all.
Over 400 backlog items (yes, 400, that’s not a typo) which were up to five years old had built up over the years across the nine systems that the team were responsible for. We knew we had to find a way to reduce these. In addition to this obstacle, the time that it took to deploy changes across systems ranged from three to six months, which impacted our ability to respond quickly to change.
The team was also tasked with tackling any priority issues. The financial institution had a system where ‘priority 1’ and ‘priority 2’ issues needed to be dealt with immediately. P1’s wouldn’t happen very often, but when they did, they were all-hands-on-deck situations. P2’s were similar, but had a different service-level agreement. P3’s and P4’s were put into the backlog.
Taking the obstacles above into account, as a team, we created the following objectives and key results:
- We want to understand and resolve P1 and P2 incidents effectively and quickly
- We want to deploy value-add changes into production effectively with no issues
- We want to allow enough capacity within the team to make improvements
- We want to increase our capabilities across the nine systems
We split the OKRs into three phases, each for 90 days, and dove right in...
PHASE 1: Reduce the number of backlog items to zero across all systems within three months
The whole team agreed that the most important thing to work on immediately was reducing the number of issues in the backlog. The first thing we did was delete any issues that were over four years old, therein testing our hypothesis that if we had deleted something important someone would come back and let us know.
The second thing we did was look at the remaining list to understand which of the nine systems had the most long-standing issues. We also looked at the frequency of the change of deployments into production to see if we could ascertain which of the nine systems had the most activity, and therefore the most urgency.
Once the team understood this, they were able to prioritise which systems to focus on first. For each of the systems, we took the list of issues to the POs to reprioritise, to ensure they were still valuable.
PHASE 2: Deliver valuable business outcomes more effectively
With the POs armed with this information, the team implemented a 30-minute, time-boxed triage meeting at the start of each sprint, with all the system POs and the entire team in attendance. Each PO would discuss their priority items and discuss with the other POs why their request was more important than the others, as the team had limited capacity. If we ran out of time to discuss the agreed upon items, it was understood by the team that this meant one of two things:
a) The PO and the team were not clear on what needed to be done for the work item
b) Large pieces of work needed to be broken down into smaller parts
We decided as a team that in each two-week sprint they would have the capacity to work on eight items. The rule of thumb for each item was that it couldn’t take more than three days to develop and test. If the team decided that the task was larger than a three-day effort, it was the POs responsibility to break it down further.
For items that took less than three days, the PO needed to explain why this work item was important for them. The team was able to ask questions, discuss, and when this back-and-forth was completed everyone could vote as to whether or not this piece of work should go into the next sprint. Democracy can work wonders in these situations.
The voting process was simple, thumbs up 👍 or thumbs down 👎 but the votes needed to be unanimous. For example, if everyone except one person voted to do a piece of work, then it was either up to the people who voted 👍 to convince the person who voted 👎 otherwise, or vice versa.
The team had daily scrums to discuss what they did the previous day, what they were planning to do on the current day, any issues/blockers they had, and whether they needed help.
At the end of each sprint the team had a sprint review with POs and stakeholders, evaluating the work that was done so they could provide input and direction. As they discussed these completed work items, they also discussed which release these changes would go into, so the POs could gain visibility of non-functional requirements like change awareness and comms to the affected customers.
The team would then have their own sprint retrospectives, which were quite standard, with the one exception that we would ask extra questions: “What’s one thing you learned this sprint?” and “What’s one thing that you would like to learn next sprint?”. This encouraged a learning mindset, and facilitated discussion around which team members could pair in the following week to build capability.
Every now and then, P1 and P2 incidents on the systems we looked after would occur. All current work would stop, the team would go into incident mode, and swarm to resolve the issue. It didn’t matter if you knew the system or not, as having visibility and context of why the team needed to stop working and solve the incident raised understanding of the system as a whole.
Using all the aforementioned mechanics, the team was able to reduce the number of work items in the backlog of nine different systems to zero in three months, all while building their abilities to manage new work items generated by POs.
Using Jira, we were able to determine that—before working this way—the average age of work items from creation to production was about 12 months. By the time I left the organisation, the average age was 14-30 days 🙌
PHASE 3: Enhancements to our systems
Once the team reduced the number of backlog items to zero across systems and started delivering business outcomes more effectively, the POs started generating new work items for the systems. These new work items would go through the same process as before, using the triage method.
The team was now delivering consistently on around six to eight work items per sprint, and discussing improvements to make their lives easier (e.g. testing and deployments). The team, along with the POs, agreed to reduce the number of work item slots to six and use the remaining two slots as system improvement slots.
The team started to generate their own system improvement ideas into the backlog. These ideas were discussed in the same triage meeting, using the same process, ensuring that the team and the POs understood why it was important and what precisely needed to be done.
A rule was also put in place where no improvement work would be started unless all the requested work items in the six slots were completed first. This provided an indication of whether we were effective as a team in completing the work in the six slots, or if we needed to adjust the work limit.
Improvement, that’s the key word here. Once the team had the time, the people and the morale to better their processes and understand their capacity, everything changed for the better. Simply surviving, keeping your head above the water—especially in a development environment—is not conducive to good business or happy teams.
Wrapping up, a few bullet points that you might incorporate into your own projects and workflows...
What did we discover/learn along the way?
- We didn’t fully realise until six months in that we were a fully-funded team with the freedom to work on whatever improvements we wanted.
- There was a lot of manual work entering issues and defects from our service management systems to Jira. To resolve this bottleneck, we developed an integration component that synchronises with Service Manager, so that any issue that gets logged there and assigned to our team will automatically generate a Jira card and place it into our backlog.
- We discovered that, without a clear reason to work on a system, just having a developer explain the system to another dev and tester was not effective. Our goal here became actually learning the system from doing the work, rather than simply being told how the system functions. Out of this evolved a working agreement that any piece of work needed to have one primary developer (someone who knows the system in and out) to act in a consulting role, and another dev and another tester to work together on how to solve the issue.
- It was important for us as a team (including the PO) to understand how we wanted to engage with each other, and what events and processes we wanted to follow. By defining and agreeing to terms within our own team we were able to back each other up when other managers sought to push work onto the team.
- Eventually the POs that had their work done for a few consecutive weeks would acknowledge this pattern, and would negotiate with the other POs to allow other pieces of work to be done.
A few rules of thumb
- Pair together, and have a primary person on a system teaching another team member. This improves capabilities across systems over time. Fully realising this goal means no longer having any single point dependencies in your team.
- Fix the amount of available work item slots per sprint (eight slots) to increase transparency about workload. Fully realising this goal means your PO understands or can challenge the number of available slots.
- Only allow the team to work on enhancements after all outstanding issues are resolved. This allows you to focus more effectively on improvements. The desired endpoint here is having zero issues in your backlog, and the capacity to work on new requests and enhancements.
- Refuse any work from POs larger than three days worth of effort, request that they break it down further into smaller pieces.