A couple months later, we have made some progress, and the end vision is slightly different:
- We (or at least I) have come to realize that local (docker) setup is very important piece of this puzzle, and is missing from original diagram. Ian has been working hard on transitioning the whole pipeline to be using the same docker containers, so this is at least in progress. From our call yesterday it sounds like that is its own domain with many features still to be made to make it fast and predictable
- Dev3 has a PR up which will reduce its dependence on the CDN, making it more predictable and less error prone. This is a work in progress, so the completion of the PR does not necessarily mean the process will be complete.
- Test3 is up which has been a huge help already, for a) quickly verifying breaking changes in dev against what was working before. and b) as a backup remote backend when dev3 goes down. Also, the product / qa team has already stated testing against my proposed requirements ruberic to give some idea of stoplight-status as project progresses. Next steps here are to push to test regularly, record a smoke test and image number, and promote passing images to demo
- Demo is up and looks fairly recent, and is pre populated with data. I think this is a great improvement over last few months, when it was stale since about March.
It seems like we are starting to have the infrastructure pieces in place and now the goal is to fine-tune the process so that we are shipping constantly. In order to do that the next steps are
- Dial in the local docker setup so that FE devs are never blocked and have consistent build environments with no external dependencies to run as sandboxes. This will help isolate network and package load issues, and allow more disruptive test configurations (making many changes to the database).
- Focus on integration testing both with the QA test-3refapp and by bringing cypress ci tests to github actions once we merge monorepos. A good example of we have tests but they don’t catch the right thing yet, the patient-registration app has broken several times in the last few months even when all CI tests are green. Adding a combination of the two testing methods above will allow us to catch better bugs so that registration doesn’t keep breaking, which will ultimately allow us to develop faster
- Test regularly on test3, record the results, and then promote successful candidates to demo. Using the process will help us get better at it, and hopefully brings much needed transparency as to the state of O3
Maybe I put this into a nice bullet point list, but there is a ton of work to do here, both for setting up the infrastructure and for the ongoing maintenance once it is running. DevOps affects many many stakeholders who all are requiring that CI pipeline have high availability, or else unfortunately can be totally blocked costing teams hundreds of engineer-hours. It is critical we have a dedicated resource here.