@sunbiz discovered it himself!
If CI (or any other automated test) were able to detect bugs in upgrades (involving DB metadata changes) I would be impressed.
I would hope & expect CI could detect if the application cannot be deployed (like the bug @sunbiz discovered in TRUNK-4773). When the change was applied to master, I would expect CI’s attempt to build & deploy to fail with that commit.
A post was split to a new topic: On improving testing upgrades for releases
CI definitely would’ve have caught this, another reason now why we really need to set up the CI plans sooner than later for testing installation and upgrade scripts on mysql and maria DB
This is a very illustrative example of something.
As Paul says, if we really never backport new features, this bug would not have happened in 1.11.4. As Burke and Mark point out, this was not an actual incompatibility, but could/should have been caught in testing.
My personal opinion is that we should remain flexible about backporting small new features, and improve our automated testing processes. This is especially relevant for features that cannot easily be implemented in a module (e.g. an addition to a core table, like this one).
Let’s look at the counterfactual in this example. In real life, it was implemented and backported in June. If instead we had been sticklers and not backported it, then the Bahmni team would still not have this feature, and indeed it would not be scheduled for release until Platform 2.0, scheduled for December. And to get this they would have to also simultaneously take the upgrades to Spring and Hibernate, removal of all deprecated methods, and removal of the legacy UI. And I presume they wouldn’t do this until most of the reference application modules had been upgraded to handle this, e.g. not until March. [I will fork this topic to talk about forks.]
Some of that complexity is because Platform 2.0 is going to be a big upgrade. But even assuming they would get to use the feature in December/January rather than March, adding 6-7 months to the timeline of getting a small optional column added is kind of thing that leads Mark to say:
I’ll add that I myself basically stopped committing to openmrs-core at the end of 2012, because no feature I put there would ever be ready in time for me to use it in an implementation. From github:
I’m afraid of having more devs who are attached to implementations end up thinking the same way that Mark and I did.
That said, I can see the argument that if OpenMRS truly behaves like a platform, we must be iron-fisted about never backporting data model changes. Does someone want to argue that point? I’d honestly like to know where people’s opinions fall regarding how OpenMRS should balance this tradeoff.
While thinking about backporting changes, improving releases, better testing etc., we should not lose sight that we need to withdraw this release. Delete the tag from git and any other places (maven repo maybe ?) where this has percolated to limit the damage. And think about how to fix this issue. I think @wyclif suggestion of another liquibase file that will be run on startup irrespective of login is a fine approach, that I have seen many projects take. That or another solution needs to be implemented soon, since this is a blocker and will prevent the refapp 2.3 release. 2 days with a known bad release should not become a week!!
FWIW, In the past when we had broken releases, we never really removed them, just posted a newer one and made appropriate documentation.
I think we should focus on fixing the issue for now, does anyone have anymore ideas on how to fix this?
Offhand, if I’m understanding the problem correctly, two solutions could be:
- What Saptarshi said (i.e. some liquibase run at startup before the spring context is loaded)
- Don’t use the regular API to authenticate the user who is logging in to an upgrade (to avoid the hibernate mappings), but write this as direct SQL queries instead.
To add to your second solution Darius, i wonder if the same might apply for the daemon user if there are database changes to be run, seems to me like we would need to avoid the API too.
So, we’re starting up the OpenMRS API and using it to run liquibase changes? Given that the API is built to run against the updated data model, this sounds like a problem. How hard would it be to run updates before firing up the API?
After looking into this a bit, i think this is the expected behavior for a couple of reasons:
- Before closing the ticket, i had tested on the master branch and the setup wizard appeared as expected.
- Looking at the stack trace on the ticket, the failure happens on
WebDaemon.startOpenmrs(WebDaemon.java:64)which is after
- I have confirmed by doing an upgrade from 1.11.0 to 1.11.4 and had the changesets run to completion with the upgrade successful.
So i suspect @sunbiz may have “auto_update_database” set to true in his runtime properties file. Which we are aware of as per:
If the above is not so, then i need a deeper dive into this!
To be clear, you are saying that upgrading to 1.11.4 actually does work? (At least it usually does, as long as you do not have auto_update_database=true.)
Yes that is exactly what am saying!
Then, a few things:
- I guess it’s okay to proceed with the Reference Application 2.3 release based on Platform 1.11.4 (as long as @sunbiz confirms he has auto-update=true, and he didn’t hit some other bug)
- In that case we need to flag in the documentation of Reference Application 2.3 and Platform 1.11.4 that you cannot upgrade with auto-update=true
- We need to prioritize TRUNK-4726 ! Rafal curated this ticket and marked it as Priority = Must on June 3 (and also “intro” a bit later), however in the 3+ months since then it hasn’t been touched. (Unfortunately Platform 1.11.4 was released during the 9-day span where this ticket incorrectly lost its 1.11.4 fixVersion.)
- I think it’s misleading to have a ticket in openmrs-core that is marked with priority=Must that goes 3+ months without being worked on. “Must” is weaker than “Blocker” but to me it conveys more than “whenever someone picks this up”. @dkayiwa, @wyclif, @raff, and others, could you all review the process for curating and prioritizing tickets, and see whether/how it needs to be modified to ensure that marking a ticket as higher priority actually has some impact?
I took sometime to reproduce and deep dive into the cause of this, it turns out that Daniel is correct that you should be able to upgrade to 1.11.4 if you have allow_auto_update set to false which is the default but it will fail if it is set to true. By the way, this feature worked in the past i.e run updates when allow_auto_update set to true but has been broken since 1.11.0, when it is set to true updates are automatically run by Context.startup(). The bug is caused by this commit/fix for TRUNK-344, which as seen here is fetching a GP through the API before OpenMRS is started (VERY UNACCEPTABLE) , of course the authorization advice around this service method tries to do its work but when it calls Context.getAuthenticatedUser() it fails as this method tries to load the daemon user, because the updates haven’t been run yet, you get the error @sunbiz reported.
The fix is it to move the line that fetches the GP through the API to happen after the call to Context.startup() since the updates at that point have already been run even when allot_auto_update is set to true OR use SQL to fetch it if it needs to be used before the service context is up. This also makes me think that it is ironic that we provide a mechanism to notify modules that openmrs has been started so they can be able to set certain configurations but in core we don’t have a well designed way to guide DEVs to know the correct place to add code like this which initializes certain things on startup to avoid similar bugs in the future.
With that said, this shouldn’t block the 2.3 release given that the bug has actually been around since 1.11.0
Please also make sure that any work required to fix things up is properly ticketed and prioritized.
I’ve updated the ticket description to say what needs to be done
Looks like there was this duplicate ticket reported even earlier: