Using AI Coding Agents and LLM-generated Code

Hi all! I wanted to start this as a way of getting some community discussion going on AI-generated code in the community, as we’re seeing more contributions that are clearly made with AI. The intention here is not to ban AI, but figure out how we as a community can take advantage of the growing capabilities of these agents to the benefit of the projects we’re building.

AI coding agents are very capable of generating code—sometimes even excellent and efficient code—but they also have important caveats and limitations. For example:

  1. AI agents tend to be “better” when using common languages and frameworks in common ways. Much of OpenMRS’s technical stack is built with common languages and frameworks (Typescript + React, Spring + Java), but not necessarily used in common ways.
  2. AI agents are currently quite bad at understanding the context of a modular application developed across numerous repos, the way OpenMRS is. That is, AI agents will tend to pick-up on resources and patterns that already exist in a given repo, but will not know how to leverage resources and patterns not in the current repo. E.g., it will be completely unaware of APIs available in, e.g., another module or even OpenMRS Core and often in existing dependencies. That is to say, AI agents often have a worse understanding of the OpenMRS code-base than your average IDE.

In addition to the technical sort of limitations of AI agents (and at least the second issue I outlined is likely to be improved in the next 3-4 months), there are non-technical considerations as well.

With AI tools, the role of a developer changes a bit, but I don’t think the fundamental responsibilities we all have as developers change much. As developers, we have a responsibility to create code that solves real problems in a way that can be understood and maintained by others. Part of the way of doing this is ensuring that your code adheres to our standards and conventions (backend, frontend). But beyond that, you are also responsible for reviewing the output of the tools you use to ensure that they conform to those standards, conventions, and the common patterns we use across the code base.

This brings up what is the main issue with AI coding agents to contribute to an open source project—pretty much any open source project: LLMs tend to confidently produce an answer for the instructions you give them, but it may not be the “right” one. This is basically what LLMs are trained to do. They generally won’t stop to ask questions like:

  • Is this feature I’m working on well-specified enough to build?
  • Is the place I’m telling the agent to build this code the right place to build it?
  • How are similar features built in other OpenMRS modules?

In my experience, the best results with AI coding agents are produced not by simply prompting the agent with the problem or challenge you are trying to solve, but in doing at least some of the intellectual work of software development up front, i.e., having a clear understanding of what the problem is, analyzing why and where the issue occurs, coming up with a reproduction of the issue (ideally as some kind of unit test or tests), looking at similar solutions to a problem.

Working with AI agents can, in some ways, feel more like a higher-level role, i.e., your main responsibility may no longer be writing the code but reviewing it, ensuring that it’s not just a solution, but a good solution, that the solution is understandable, maintainable and won’t easily regress.


Concretely we’ve been getting a lot of LLM-written or largely LLM-written submissions and it’s been raising a few issues. Primarily, we rely on community members to review PRs as well as write them. Code reviews can be a time-consuming process, which is why we usually have a checklist of minimal things you can do to ensure your code is more likely to be easy to review. This checklist is just a bare minimum. You should be doing more to ensure your code meets quality standards. Making your PRs as clear, concise, and readable as possible is the best strategy to ensure your PR is reviewed and merged—regardless of whether you use an AI coding agent or not.

Long PRs and PRs with many changes irrelevant to the change are less likely to be reviewed or accepted. You want to make sure that when you open a PR, it looks like something that could be merged in, and the best way to do that is to familiarize yourself with what PRs we already merge look like and ensure your PR looks as much like that as possible.

As a last piece of advice: you should almost certainly be hand-editing the code even if you’re using a coding agent to generate most of it. Agents have a tendency to write code that is plausible, but not exactly right. There may be a better way to express things, or things just need to be slightly tweaked, etc.


I don’t want to leave this as just a post scolding, so I’ll point to how I think this can be done. For example, here is a PR I wrote that was primarily coded by Claude Code. Creating that PR took a series of conversations; I wrote something like 70 individual messages, both an initial prompt, course corrections on various things that it was doing, ensuring we were running through the plan (outlined separately) in a reasonable order, detailed instructions on a few really minor technical points (e.g., how to correctly add a new dependency and how to update dependencies; Claude consistently forgot that if it adds a new dependency, it needed to re-run yarn install for the dependency to be installed).

FWIW, here was my initial prompt (yes, ideally some of this would be in a CLAUDE.md or AGENT.md file):

Inside the notes folder, there’s a document called “UNIT_TESTING_PLAN.md”. We’re going to start working on that. Since the state-management component is kind of core to a lot of the other key functionality (e.g., auth data and extensions are stored in state-management stores), that seems like the logical place to start.

For testing, we’re using vitest with appropriate mocks. Since there are no tests in esm-state currently, we’ll need to start by adding the vitest library. We’re currently using ^3.1.4 and for compatibility with the rest of the packages in the monorepo, we’ll want to use exactly that string. We’re using yarn for package management and yarn workspaces for monorepo organization. Monorepo-wide tasks tend to be run with turbo, but each package’s tests should be executable using, e.g., yarn workspace @openmrs/esm-state test.

We’ll also need to add at least a test script to the packages without tests and a vitest.config.ts file, which should be as minimal as possible. While some use of a setupFile is fine (which should be named setup-tests.ts and live at the same level as package.json), I prefer that we import test code explicitly in the tests themselves.

Please read through the document and these notes and then start setting up the project. Ideally at the end of the setup phase we’ll have a single minimal test to prove the configuration works. Use the esm-expression-evaluator package as a model and feel free to ask questions for any points that need to be clarified. I prefer we clear things up before writing code, even though code can be rewritten.

You might notice:

  1. I had a Markdown file already describing a plan for how to proceed
  2. While the final PR ended up upgrading to Vitest 4, initially, we were working with Vitest 3 (I ended up adding in code coverage support and it was simpler to do this while migrating to Vitest 4)

One of the things this PR added tests for was the shallowEqual() function from @openmrs/esm-utils, which is meant to mimic React’s shallow-equals comparison. This actually caused the agent a lot of difficulty with because it sort of assumed that the implementation was something like:

export function shallowEqual(a: any, b: any) {
  return a === b;
}

Rather than the actual implementation (which is a bit more complex)


Now for the more interactive part of this:

  • What standards should we, as a community, have for LLM-written code?
  • How can we best ensure that LLM-written code doesn’t become a pain to review and so is actually meaningfully contributing towards the goals of OpenMRS?
  • Are there considerations I’ve missed? Things I’ve gotten wrong here?
13 Likes

Really appreciate the timeliness and thoughtfulness of this post, Ian!

7 Likes

People Often Confuse Vibe Coding and AI-Assisted Engineering

Vibe coding is a process where you use AI (or even your own quick sketches) to bring ideas to life without worrying about architecture, robustness, or long-term maintenance. The developer is not fully sure about the implementation details — the goal is just to explore possibilities quickly.

In AI-assisted engineering, vibe coding is only a subset. Here, the developer still generates ideas and scaffolds rapidly, but then goes into the code, understands what’s written, evaluates trade-offs, and makes the necessary refinements. The developer is not blind to the internals — and the effectiveness of this depends heavily on the person’s level of expertise. You may find something entirely new during code review discussions, especially when senior engineers weigh in.

And honestly, this isn’t new.

Back in 2019 (pre-LLMs), were all pull requests perfect? Of course not. Software engineering has always been about iterative improvement. Repos evolve every day.

The only difference now is the speed of evolution. Code can be generated 10x faster — developers don’t need to manually craft every line. But the review, understanding, and architectural thinking still matter just as much.

The “threat” is often overstated. If a PR solves a real problem, maintainers will still take time to review it. The merge button doesn’t get clicked automatically.

In fact, during a recent online conference, we discussed how to accelerate the review process using powerful code-assist tools (yes, LLMs) — even something as simple as generating a high-quality PR summary can save maintainers time.

Don’t think of it as “AI code.” Think of it as if an intern wrote it: :backhand_index_pointing_right: Question it :backhand_index_pointing_right: Evaluate it :backhand_index_pointing_right: Improve it

That mindset removes the hype and keeps engineering grounded.

If you want to understand more about AI-assisted engineering, I recommend this Podcast: https://www.youtube.com/watch?v=dHIppEqwi0g&t=1035s

check out this PR where the AI and humans work together to bring an idea to life.

1 Like

The point here is not about perfection. We all open pull requests that get improved with reviews. We improve what we understand.

Frankly speaking, as a maintainer, i will never waste time reviewing an AI generated pull request where the author does not understand what is in it, however much they claim to solve a real problem.

The very first step in accelerating the review process is to ensure that you are charge of the LLMs code output. A high quality summary (usually also generated by the LLM), does not help when you do not understand the contents of the pull request.

As long as the intern does not understand it, when i try to evaluate and question it, they will just run to the LLM for answers. This is what i would not waste my time improving.

Engineering that is grounded will always involve learning to be able to evaluate the LLM’s non deterministic output. Tools are good and a must to use, but when misused, we both lose. The maintainer loses time wasted reviewing what they should just have closed, and the pull request author also loses an opportunity to learn and grow towards being a better developer.

5 Likes

“The point here is not about perfection. We all open pull requests that get improved with reviews. We improve what we understand.”

Answer :- I did not point out about perfection too, yes we make a pull request to improve what we understand, I was pointing out even we humans beings tend to make mistakes…

Frankly speaking, as a maintainer, i will never waste time reviewing an AI generated pull request where the author does not understand what is in it, however much they claim to solve a real problem.

Answer:- That is what i am saying too, if the dev understands what the ai has generated , I dont see any problem, that is the reason i clearly made a distinction between vibe coding and AI assisted engneering, but if i was a maintainer i need to check what the pr is even about.

The very first step in accelerating the review process is to ensure that you are charge of the LLMs code output. A high quality summary (usually also generated by the LLM), does not help when you do not understand the contents of the pull request.

Answer :- that is exactly what ai assisted engnerring is about.

As long as the intern does not understand it, when i try to evaluate and question it, they will just run to the LLM for answers. This is what i would not waste my time improving.

Answer :- Then the intern is a pure vibe coder, I am talking about AI assisted engineering

Engineering that is grounded will always involve learning to be able to evaluate the LLM’s non deterministic output. Tools are good and a must to use, but when misused, we both lose. The maintainer loses time wasted reviewing what they should just have closed, and the pull request author also loses an opportunity to learn and grow towards being a better developer.

Answer :- yup this is actually what i am talking about, I already made a disticntion

Its great to see that as a maintainer you are concerned about the person who is making a pull request, but not every maintainer is great like you @dkayiwa they only check if it is making positive impact to their project. That’s how the world is running..

I am also concerned with people who think that vibe coding is pure engneerring, that is the reason i linked the podcast which gave me insights.

I feel like we are assuming every pr has the same difficullty the user should know when to use ai effectively depending on the various levels. we dont need to do that level of thinking for a readme update :joy:

Every PR is unique lol.

1 Like

Hi,

I am afraid there will a lot of friction and/or gross gate keeping and non-reviewed contributions that would help keep the community goals.

I would suggest we publish a wiki page that explains whether AI/LLM use is allowed and to what extend, what metadata to include, and the review path in addition to this thread. This reduces friction for contributors and avoids ambiguous PRs.

openinfra uses this here not sure its feasible or useful in community

1 Like

@jayasanka what do you think about all this in relation to GSoC?

I’m not exactly sure what issues you think having an AI policy would resolve? I’m not aware of any PRs that have been rejected for being AI written. To me, the biggest issue is that we have more contributors who are willing to try to submit code that we have contributors willing to review code and too many PRs that are submitted for review in a state where they are not ready for review.

1 Like

Earlier today at the community Show case we demoed a PR reviewing agent that could help both committer and reviewer. It has great potential to escalate review counts and cut down on review time if fed with the right context.

Only challenges which we’re working on are

  1. Experiment with self hosted models so we can mitigate costs that come with vendor lock in.

  2. Giving the agent capabilities to understand the context of any Ticket whose PR it’s reviewing, so that the final review is as close to the human result as can possibly be

1 Like

We’ve experimented with coding review agents before and found the results to be not just unhelpful, but actually to slow-down the code review process. If there’s something that’s workable, that’s great, but I don’t think we should be just assuming “AI can fix this”. As I said in the body of the main post, AI agents struggle with code bases as large and fragmented as OpenMRS.

1 Like

Well the idea is not to surrender PR review completely over to agents, but utilize them as assistants to cut down the total number of minutes / hours we’d spend doing a Review. The Human reviewer should be the final man . IMO

But since you’ve already experimented and process is slowed down, it’s not worth investing more time.

Thanks for this @ibacher I love the discussion. Having such discussions i believe is a step forward in having the best practices for using AI agents for development in the community.

I would like to suggest the following as standards for contributors

  1. Every PR should/must include:

    • Problem Definition & Context: What specific issue is being addressed and why it matters. This should align exactly with the creator of the ticket being worked on.

    • Acceptance Criteria Verification: For ticketed work, explicitly demonstrate how each criterion has been met with screenshots, test results, or working demos

    • Implementation Rationale: Why the approach being used is best over alternatives, particularly when deviating from established OpenMRS patterns. May be having clear commit messages for each change made I dont know!

    • Testing Evidence: Show that the solution works before reviewers dive into code

    This shifts the burden of proof to the contributor and makes reviewers’ jobs easier, they can verify the solution works before examining implementation details. With out these contributors may be informed that their work may not get reviewed until this is satisfied.

  2. Have contributors Write tests first when possible, create reproduction cases or unit tests that define success. This can help contributors clearly understand the problem being solved. Once they know the problem they can then develop towards solving it.

  3. Build shared knowledge around AI best practices by documenting effective prompting strategies, common pitfalls specific to OpenMRS architecture, and example workflows that have worked well (like the vitest PR example). Building on our conference discussions in Uganda, we could create a community resource that helps everyone leverage AI tools more effectively.

  4. Create guidance on “AI-friendly” vs “AI-hard” tasks within OpenMRS. Some scenarios (like boilerplate generation, test creation, Documentation) work well with AI, while others (architectural decisions, cross-module integrations) require more human judgment. Documentation highlighting where AI excels versus where it struggles would help contributors make better tool choices and avoid common pitfalls like over-engineering and having verbose code.

  5. We could also look into Standardize on AI tooling (with caveats). While I’m hesitant about mandating specific tools, there could be value in the community having a “recommended” AI agent (like Claude or GitHub Copilot) that we optimize documentation for. If we go this route, we could provide AGENT.md files in key repos with OpenMRS-specific context and patterns. However, we’d need to carefully consider:

    • Security implications of shared configuration files (recent research shows they can be exploited as attack vectors)

    • Keeping such files optional and supplementary, not required

2 Likes

Thanks @ibacher

I think you have outlined some of the issues like PR/MR state and more I think @suubi7 has hinted on that would require policy direction or guidance.

1 Like

I’m not saying this to say this isn’t something that’s worth pursuing, just that we need an appropriate pilot to show how to do this correctly. Quickly dumping in a coding agent doesn’t add a lot of value, but maybe that’s because we lack sufficient tooling? What I’d love to see is someone build out a repo where a code review agent could be shown to provide useful and meaningful reviews. I just meant to caveat that doing so requires more than just hooking up a tool; we’d kind of need some actual engineering of how we configure a repository to use this tooling correctly.

2 Likes

Oh yeah, I think it will be a great to feature on our next community AI call

1 Like

True! But I think an official SKILL.md could still help improve the quality of AI‑generated contributions. This is a basic one I’ve come up with, though the scope is slightly different, and you can see some of the experiments in the same repo.