Part 3: How we delivered V-platform, the Infrastructure Platform at VTS

Pavel Susloparov
Building VTS
Published in
7 min readAug 7, 2022

--

Since its inception, VTS had used Heroku as the cloud provider solution and hosted all applications there. Heroku was a great choice at that stage as it helped us run our applications in a cloud environment with minimal infrastructure management overhead. However, with rapid growth in customers, product lines, and geographical locations, there came the point where we needed additional capabilities to accelerate product development at scale. In 2021, we boldly chose to embark on a cloud infrastructure transformation journey for growth. This journey was two-phased,

  1. We built a state-of-the-art infrastructure platform on AWS and named it The V-platform.
  2. We worked with several teams onboarding and migrating their applications from Heroku to AWS (The V-Platform).

We set ambitious goals of completing our journey in less than a year without halting feature development and without clients feeling any impact from the migration.

The previous articles in this series describe the technical details of the centralized tooling (V-Platform) we developed in-house and why we decided to migrate from Heroku to AWS. This article describes the details of running the project itself — one of the biggest and most complicated that VTS has undertaken to date. Read on to learn how we achieved this!

The Project Journey

Our journey to migrate cloud providers began in January 2021. After an in-depth analysis of our technical requirements and a detailed review of the leading cloud providers in the market, we decided to switch to a different vendor. For reasons outlined in a previous blog post, we decided to move forward with AWS.

At this stage, we did not have a fully formed infrastructure team at VTS. However, a small group of engineers worked on putting together a high-level migration plan. VTS also brought on a migration partner, Mission Cloud Services, to offer support in building the new platform and migrating applications. We officially kicked off this project in March 2021.

We then assembled a dedicated group of internal engineers to run the project and the team grew to three members at the start of Q2 2021. At this point, the constant challenge was to define the scope since there is a lot of missing information, knowledge and context at the beginning of a project. In order to ensure that we were focussing on the right set of problems, we leveraged Mission Cloud’s Change Request Procedure and dedicated technical support to change direction or re-focus the priorities as needed.

However, while we could keep the project moving forward, we quickly realized that a group of three internal engineers was insufficient to complete the project within the ambitious timelines. We took two approaches to manage this — we brought in Subject Matter Experts from application developer teams, and we aggressively hired more dedicated infrastructure engineers to support the team with the migration and beyond. We welcomed three more engineers on the central team over the next six months and managed the knowledge distribution and onboarding of new team members well. In Q3, we completed the building blocks for V-platform and started creating plans for the existing infrastructure and applications migrations. We also brought in subject matter experts(SMEs) from application developer teams and trained them on the infrastructure setup so far.

The early involvement of product teams was a tremendous help for building, testing, and executing migration plans and also provided an educational opportunity for SMEs. In addition, we involved the central QA team early on during the planning of the applications migrations. They provided valuable recommendations and were irreplaceable allies in testing the completeness and quality aspects of the project.

We constantly communicated with EMs(Engineering Managers) and VTS technical leadership about the status of the project and their teams’ involvement in the project. Being connected helped us ensure there were no adverse effects on product deliverables.

We decided to complete the migrations in late Q4 (early-mid December). We chose this time because it is quieter traffic-wise, giving us time to monitor the new infrastructure over the holidays. In Q1 2022, we were ready to receive higher traffic for all our clients on our applications running on AWS.

Culture change

Before this migration, a central team managed the infrastructure on the Heroku cloud provider. This team operated as a service team. They received tickets from other engineering teams, and handled all infrastructure, performance, and availability-related problems. The team used Kanban to manage their work, and their success metric was the number of tickets incoming and the number of tickets completed.

The ratio of application developers to infrastructure engineers at the time was 80:3. This was not a sustainable working model, especially given the company’s growth plans. Considering the above bottlenecks, we decided to make a fundamental change here.

The change came with the team transformation to the SRE (Software Reliability Engineering) model, where a central infrastructure team builds self-service tools and provides best practices to the rest of the organization. In particular, we focussed on bringing about the following mindset changes:

  • We decided to invest in long-term infrastructure solutions as opposed to short-term fixes
  • We removed the metric of incoming vs completed tickets for the team.
  • We fostered the belief that infrastructure is a shared responsibility of all engineers in the company.
  • We inspired engineers to develop the skills to build technology solutions with AWS SaaS support.

This mindset shift changed the team’s focus from short-term fixes to long-term solutions that would benefit the company for many years. It also helped reinforce the mindset that infrastructure is everyone’s responsibility.

Learning and Training

Changing infrastructure providers is a massive undertaking for any engineering organization. It comes with the price of educating all engineers on how to follow new processes and use new tools. Our approach was

  • Write project documentation, which answers the questions — why, what, who, and when and outlines requirements.
  • Communicate milestones and progress through All-Hands, Email, Slack
  • Write technical documentation regarding infrastructure components
  • Write technical documentation regarding infrastructure and application migration sequence (runbooks)
  • Conduct presentations about AWS SaaS offerings

Earlier in the project, the infrastructure team tried to balance the actual migration effort alongside training the organization on AWS. We tried to communicate the new cloud platform offerings through presentations and hands-on training. The problem arose that application engineers did not have AWS access or had time constraints in trying out the AWS offerings while balancing their other work commitments. In addition, we quickly realized that the infrastructure team did not have enough time or people to run training for the entire organization.

Our response to this feedback was to offboard education to A Cloud Guru and shift the infrastructure team’s focus to the continuous planning and execution of the migration project. The platform provided on-demand test AWS accounts and step-by-step instructions on navigating AWS. Leadership supported training the engineering organization by requiring that all engineers complete the Cloud Certified Practitioner training track and providing them with dedicated time to do so. This was a great help in familiarizing the engineering organization with the primary AWS offerings.

For in-depth training, we conducted a series of DevOps Katas for a group of engineers who monitor production environments and perform operational application production support. Those trainings contained everyday operations that engineers can use to troubleshoot and maintain their applications running on AWS infrastructure in production.

Planning Details

The success of migration is entirely dependent on pre-planning. For us, it was focussed on three areas:

  • High-level project milestones plan with frequent updates and refinement
  • Constant creation and modification of WBS(Work Breakdown Structure) for each deliverable
  • Migration runbooks

High-level project milestones were shared through different channels in varying degrees of detail so that everyone was presented the information to the extent of granularity that was required.

WBS creation required constant grooming within the team and fast execution of proof of concepts to validate our assumptions.

Migration runbooks were written, reviewed, and executed multiple times to ensure flawless plan execution on a cutover day. We used Google Sheets to develop the runbooks. The format was simple:

Tasks one week before cutover | Assignee | Notes | Follow Up
Tasks one day before cutover | Assignee | Notes | Follow Up
Tasks during the cutover | Assignee | Notes | Follow Up
Tasks after cutover | Assignee | Notes | Follow Up

We ran the runbook execution with an Incident Management approach. We had an IC(Incident Commander) and Responders (people who execute steps). The IC was responsible for overall coordination and communication while the responders performed the actual execution of the tasks assigned to them. We conducted multiple refinement sessions for each application migration and performed several dry runs in lower environments.

Finally, we instituted an on-call rotation for the infrastructure team so that a primary responder could be reached easily in case of an outage and help troubleshoot and mitigate any issues.

This meticulous preparation resulted in extremely smooth cutovers with no significant incidents on the final cutover day or subsequent days for each application at VTS. We finished migrating the final VTS application on December 11th 2021 giving us three weeks of quiet time to observe the applications before start of the new year.

Credits

The migration to AWS cloud provider was a truly remarkable project, which was made possible by the talented VTS infrastructure team, AWS vendor support from Mission Cloud, application developers and QA.

I give notable credit to the leadership team for staying the course and trusting their people to execute on this huge project.

Thank you all for this remarkable achievement!

I hope you enjoyed reading this. Keep an eye out for more articles on engineering and infrastructure at VTS.

--

--