Introduction: Setting the Scene
In the vast landscape of one retail giant’s cloud computing infrastructure, where data flows like the lifeblood of the organisation, there exists a critical system: a stock inventory data pipeline. The pipeline procures vital datasets from a datalake and processes them to deliver valuable insights on stock inventory movements.
With plans for the demerger of this organisation into two separate entities well under way, the mammoth task of cloning and migrating existing data pipelines and workloads to a new account – Project Apollo – began in earnest, to ensure both entities could operate independently without disruption.
Project Apollo involved the migration of more than 40 workloads – the stock inventory data pipeline was one among them. This is the story of how we migrated this pipeline.
A map of the Apollo migration: can you spot our pipeline circled in red?
The pipeline’s work begins just downstream of the datalake, pulling in the relevant datasets (stored as Parquet files) into an S3 bucket. From here, it traverses the Ingestion layer, copying these datasets into another S3 bucket within the processing account. The Compute layer, a combination of AWS Glue jobs and Lambdas orchestrated by Step Functions, processes these datasets. For frequently changing datasets, a Kinesis stream-based process performs real-time analysis. Finally, the processed outputs are stored in S3 and RDS Postgres, ready to be published to downstream consumers via FTP or S3 bucket.
Overview of the data pipeline
This system’s deployment follows a Gitflow workflow, utilising AWS CodeBuild and CodePipeline to automate deployments across Development, UAT, and Production environments. Despite the apparent simplicity of deploying the pipeline using Infrastructure as Code (IaC) via CloudFormation, the reality of migrating this system to a new account was far from straightforward.
Our adventure begins here.
The Problems We Faced
As we embarked on the migration, we encountered several formidable challenges:
1. Environment Drift:
The production environment did not accurately reflect the CloudFormation templates, leading to a drift between the actual state and the IaC configuration. This disparity reduced confidence in the CI/CD pipeline, as the team feared deployments could break the production environment. The same issue plagued the UAT and dev environments.
2. Code Divergence:
Long-lived dev and UAT branches had significant differences compared to the prod branch. Due to lengthy release cycles, changes were batched in UAT awaiting QA by a separate testing team, often taking weeks to finalise. The thought of having to build separate deployment pipelines to cater to each long-lived branch filled us with dread, and risked our project targets
3. Cyclic Dependencies:
The six CloudFormation stacks had cyclic dependencies, primarily due to the prevalent use of `Export` and `ImportValue` references. This made deploying the stacks from scratch in a new account almost impossible.
The Solution: Overcoming the Obstacles
To overcome these challenges, we implemented a series of strategic solutions:
1. Repository and Stack Reorganisation:
We restructured the repository and CloudFormation stacks into four logical groupings:
- Foundational Stack: For persistent resources, secrets, and rarely changing resources.
- Compute Stack: For processing resources.
- Publish Stack: For publication layer resources.
- FTP-Service Stack: For FTP-related resources.
We replaced `Export` and `ImportValue` references with SSM Parameters, reducing inter-stack dependencies and simplifying deployments.
Before the re-organisation…
After the re-organisation
2. Consolidation of Branches:
We consolidated changes from long-lived dev, uat, and prod branches into unified stack templates. Feature flags were introduced to enable or disable changes across different environments. Environment-specific parameters were managed using Mappings and SSM parameters, and the `cfn-lint` tool provided confidence in the configurations.
3. Migration to Buildkite:
We migrated the deployment flow from CodeBuild/CodePipeline to Buildkite. Buildkite’s trunk-based deployment flow and cleaner UI suited the team’s needs, offering easy build status visibility and integration with existing workloads.
The Value We Brought to the Team
Our efforts culminated not only in a successful deployment of the pipeline in the new account; we also in put in place significant improvements:
1. Improved Deployment Management:
By aligning the deployment process with the team’s standard practices and reducing cognitive load, we made maintaining the IaC configuration more manageable.
The deployment pipeline(s) before migration
The deployment pipeline after migration
As part of a more streamlined and organised deployment workflow, we also took the opportunity to standardise resource tagging, to ensure more accurate cost reporting for this workload.
2. Trust Reestablished:
The reorganisation reestablished trust in the CI/CD pipeline, enabling the team to deploy changes with confidence.
3. Reduced Context Switching:
Moving from a Gitflow to a trunk-based workflow reduced the need for context switching between branches, streamlining the development process.
4. Feature Flag Implementation:
Introducing feature flags allowed for effective code shipping to production, enabling timely deployments and a holistic testing strategy, without waiting for full feature completion.
Key Lessons Learned
Our adventure taught us several valuable lessons:
1. Avoid Stack Exports:
Using stack `Exports` coupled with `ImportValue` references can create complex dependencies. SSM Parameters offer a more flexible alternative. But be vigilant of SSM Parameter drift and implement strategies to mitigate it.
2. Organise Stack Resources Logically:
Group persistent resources separately and organise stacks based on change cycles to simplify management of Cloudformation stacks.
3. Engage the Team:
Bringing the team along on the journey is crucial. Address their concerns, help them understand the rationale behind changes, and involve them in the process. Emphasise and raise awareness of the benefits of new practices, such as feature flags, to gain buy-in. And most importantly, foster collaboration.
As our Project Manager Renee Mortlock remarked, this was key to the delivery of such a complex migration.
‘Everyone on this team added value – they positively supported each other, all worked hard and were the definition of collaboration. Working as one team with the client included was what made this delivery a success.’
4. Shift Mindsets:
Changing long-held practices requires patience and support from both ground-level team members and top-level management. Focus on easy wins to build confidence and ensure key figures support your initiatives.
Conclusion
Our journey through the migration of the data pipeline was fraught with challenges, but with perseverance, strategic planning, and teamwork, we transformed a troubled deployment workflow into one that’s robust and reliable. This undertaking not only improved our deployment process but also strengthened the team’s ability to navigate future changes with confidence.
Keep in mind that achieving a working deployment in the new environments was just the first step of our migration! But it was a crucial one that set us up to tackle the next big challenge – achieving a successful run of the data pipeline.
But that is a story for another blog…