How to save $70M USD with Chaos Management

Matan Cohen
Wix Engineering
Published in
5 min readOct 29, 2022

--

By https://unsplash.com/@brett_jordan

Abstract

Managing requires constant improvisation. It’s important to keep your team members synchronized with each other’s projects and agendas by thinking creatively. At the same time, you want to ensure that no one in your organization holds significant knowledge alone or creates job security based on it.

Here, I would like to share with you one of my management strategies — Chaos Management.

Chaos Engineering

One of the teams I lead at Wix is responsible for creating the Chaos platform.

The Chaos Engineering concept was introduced by Netflix to test production resilience by intentionally breaking production and making sure that both you and your dependencies are resilient and can self-heal.

The concept by WIKI:

“In software development, a given software system’s ability to tolerate failures while still ensuring adequate quality of service — often generalized as resiliency — is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or a lack of knowledge of the field. Chaos engineering is a technique to meet the resilience requirement.

Chaos engineering can be used to achieve resilience against infrastructure failures, network failures, and application failures“

As an example, you might want to take out one of your DB’s or micro-services and make sure that everything in the process can handle it (through retry mechanisms, exponential backoff, etc.)

It sounds great, but how does it relate to management?

Chaos management

To paraphrase chaos engineering, I thought it would be helpful to incorporate some of the concepts into management.

There are many ways to create “Chaos management”: Take a long vacation, rotate team members once a quarter to a different team for a week, swap the agenda, etc. This time I want to focus on long vacations and the power it gives us as managers.

Shortly, I encourage my team members and team leaders to take long vacations, around 3–4 weeks in order to take things off balance and create chaos. In my opinion, A short vacation does not lead to chaos, since usually nothing happens in such a short period of time, but chaos begins around two to three weeks into the vacation. There is always a production incident, the rest of the team must resolve on their own. In addition, they will become experts in fields that they would otherwise not touch, or even don’t know exist.

Additionally, it will prevent your team members from creating ”job security”.

Preventing job security

As managers, how often have you asked yourself, “What would I do if this person leaves tomorrow?” All of you, probably. What are you doing with it?

An employee with a field of expertise that only they understand is one of the most problematic situations for a company. Additionally, they may hold key positions with specialized knowledge of technologies. My experience as a manager has taught me that such employees don’t take long vacations because “nobody can handle all the knowledge”. They will even work on their short vacations. They can be very competent employees, but no one can really estimate that.

If you enforce a long vacation policy in your organization and ask your employees to transfer knowledge to other members of the team, you solve one of the most fundamental issues a company faces. The employee must communicate critical knowledge to the rest of the team as well as troubleshoot production issues.

Sometimes employee that holds key position can cost a lot — like $70 million dollars.

Prevent $70M fraud

I want to tell you the story of Eti Alon.

In April 2002, Alon admitted to stealing over $70M USD over five years while being the deputy head of the Trade Bank. Alon had transferred money from customer’s accounts to her brother Ofer Maximov who had accumulated approximately $25M in gambling debts to organized crime groups. Alon was convicted in 2003 and her brother was also sentenced to15 years in prison.

On April 25, 2002, Etti Alon, walked into the fraud squad headquarters. She confessed that she’d stolen a quarter of a billion shekels from the bank’s customers over five years.

Alon’s ability to succeed was largely due to her refusal to take vacations even when she was sick. In case her replacement noticed something amiss, Alon had refused to leave on vacation for years. In the event that someone were to replace her, he would probably notice something is wrong, and he would inform his superiors.As a result, vacation regulations were tightened to prevent bank employees from refusing to take time off to hide their nefarious activities.

Bank managers is now taking “Eti Alon Vacations” which are long vacations that every Bank manager must take to prevent such extreme cases in the future.

In a very radical way, this story shows the benefits of long vacations, and how effectively they can affect an organization if they are not taken.

Even in the infrastructure field, you should rotate the infrastructure (EC2, keys, etc.) often for many reasons, among them avoid security bridges. “Chaos” is needed once again.

Chaos is coming. Be prepared.

Preparation is key to a successful chaos experiment. Start with a small blast radius to see if the impact can be controlled easily without major consequences. Initially, you test it, then increase the blast step-by-step to production with automatically scheduled failures. Additionally, you should prepare everyone in advance. It is imperative that you notify all the parties who interact with the system that you intend to break the dependency.

Chaos management is no different. Make sure the other teams are prepared and that there are no blind spots (it will never be 100%, which is ok).

For example, someone could schedule a vacation and start a few months earlier to pass on their knowledge, let others be “on call” in that area, leave production issues to other team members, etc. This will increase the blast radius management.

Last but not least, let them completely disconnect from your workplace communication apps (Slack, or any other you use). It is a psychological effect that both the individual and the team must follow in order to succeed.

Don’t be afraid of chaos, embrace it.

Many teams and organizations find it tough to allow such vacations, and sometimes it’s challenging to enforce everyone to take vacations that long for a variety of reasons. Regardless, I believe it is something that should be strived for, in order to have an organization that works well.

--

--