Principles of AI Coding in Bioinformatics Analyses
4 min read
Don’t over-engineer for hypothetical edge cases
Traditional software engineering often emphasizes handling many possible user behaviors gracefully. In bioinformatics, the situation is different: an analysis is usually built for a specific dataset, a specific biological question, and a single operator.
So instead of designing for every hypothetical edge case, prioritize strict validation of the assumptions your analysis depends on. If those assumptions are violated, the pipeline should stop and report the issue clearly. Many “edge cases” in bioinformatics are actually signs of data quality problems, metadata inconsistencies, or incorrect preprocessing, and they should be investigated rather than silently worked around.
Avoid silent analytical fallbacks; throw errors explicitly
AI agents often tend to adding fallback mechanisms when a primary method fails. However, this is not ideal for the bioinformatics stack, as we typically want the primary (best-practice) method to work. If it fails, we should first fix the code or data to accommodate it rather than falling back to a secondary method.
Use a specialized bioinformatics agent to guide your research narrative and direction
The importance of system prompts is well-known. General-purpose coding agents (such as those in Claude Code, Codex, and OpenCode) are useful for implementation, but they are not always good at choosing appropriate methods, identifying caveats, or deciding what biological questions should come next. Domain-specific bioinformatics agents are often better suited for analysis planning, interpretation, and strategic direction.
You can set up a good system prompt using Prompt Optimizer. Or if you are using OpenCode, you can use opencode agent create to create a agent.
Encourage AI to do web search
Web search is particularly valuable in two aspects: one is in deciding what methods, parameters, and best practices should be used for a given analytical step, as well as the potential caveats of that step; the other is when writing biological narratives and determining downstream directions, requiring consultation of external literature.
Specific web search tools may vary depending on the agent tool. Currently, a search tool I find quite useful is GuDaStudio/GrokSearch at grok-with-tavily, which is a MCP server.
Review each completed step critically
You’ve successfully run an analysis step, generated result files, and encountered no errors—everything appears fine. But hold off on celebrating. As mentioned earlier, AI might add fallbacks or implicit error handling without your awareness. Worse still, if your task is sufficiently complex, AI may skip over certain steps and produce some shitty placeholder results. Opening a separate session to have AI review the code and results from this step typically uncovers significant issues and improvement opportunities.
While the AI may flag many existing problems or areas needing improvement, some of these suggestions might be trivial. You will need to discern which issues genuinely require attention and refinement.
Try to make each step granular and parallelizable when useful
Each bioinformatics analysis step should be as “atomized” as possible, meaning each step should focus on completing a single small objective or task. This approach has two benefits: 1) It allows AI to focus on completing the current small objective, one step at a time, making it less prone to errors. 2) It enables caching of already successfully executed steps, reducing the time needed to re-run steps during debugging. This is especially useful when you’re using workflow managers (targets, snakemake, nextflow, etc.).
Optimize for reproducibility, not just task completion
AI should generate workflows that can be rerun reliably, not just code that works once. This means data should never be manually edited but always manipulated by code scripts. Additionally, pay specific attention to non-deterministic steps (e.g., clustering in scRNA-seq), which can produce different results even with unchanged code. One solution is to incorporate seeds into the code.