Unraveled Strands

Tag: chatgpt

Beyond Coding: Giving Claude a Seat at the Meeting Table
I recently came across the following article: “It’s Hard to Use AI as a Team. These 3 Practices Can Help.” The authors argue that while most organizations expect AI to automatically improve teamwork, the opposite often happens—without intentional integration, AI can actually narrow participation and shift ownership away from the team.

Specifically, the article mentioned 3 specific tips for working more efficiently with an AI:

1. Engage with AI as a team. Instead of siloed interaction, the whole group should interact with the AI together.

2. Use AI in flexible roles. Move beyond passive tasks and give the AI active, rotating responsibilities.

3. Keep ownership with the group. Ensure everyone participates in maintaining and vetting the AI interaction so the team remains

Our Experiment

Our workspace is highly innovative and provides opportunities to experiment at work. I believe that experimentation is a key to building a superteam.

Taking advice from the article, we and extend our use of AI from solo productivity to treating it as a literal member of the team during our design reviews.

Since many of our teammates are remote, we facilitated this by having the presenter start a Claude Code instance and present it side-by-side during the presentation. To make Claude a truly effective teammate, we ensured it had full context of our environment:
- Local Repository Context.Claude runs directly in the repository where the code changes are being made.
- Integrated MCPs. It is equipped with JIRA, Confluence, and GitLab integrations.
- Global Code Search. It has integrated access to Sourcegraph for quick searches across the broader codebase.
- Real-Time Research. It has full web access to pull in external documentation or context.
This setup gave the AI a visual presence in the meeting, making it a shared resource for everyone on the call rather than a hidden tool for a single developer. Here is how we applied those core practices and some insights into how they turned Claude into an effective teammate.

The results were excellent with real tangible outcomes (see below).

And here how we applied and some insights how we applied these core practices to turn Claude into an effective teammate.

1. Engage with AI as a Team

One of the biggest mistakes in AI adoption is “siloed” use. For remote teams, this is even more dangerous as it can fragment the conversation. In our meeting, we didn’t have one person “driving” Claude in secret. Because it was shared side-by-side, everyone could see the prompts and the logic as it unfolded.

We started by introducing every member of the team to the AI: three software engineers and one bioinformatics engineer (who is also a primary user of the platform we were discussing).

Why this mattered

By telling Claude who was asking each question, the AI was able to adjust its technical depth and context. It spoke to the bioinformatics engineer about user-facing impact and to the developers about concurrency and cache manipulation. Because the output was visible to the whole remote group, it acted as a “living record” that kept everyone aligned.

2. Use AI in Flexible Roles

While the original article describes using AI in different behavioral personas (like a “Challenger” or “Customer”), we adapted this by giving our AI teammate multiple functional roles that shifted throughout the hour based on its capabilities:
- The Verifier. Claude had access to our specs via a Confluence MCP and the code base. During the meeting, we constantly asked it to verify if our verbal proposals were actually in line with the written spec and the existing codebase.
- The Technician. When deep technical questions arose—like how to handle cache manipulation or complex concurrency—Claude provided immediate suggestions based on our actual code.
- The Scrum Master. We had pre-created several Jira tickets. As the discussion evolved, we asked Claude to update those tickets in real-time to ensure they captured the latest consensus and technical requirements.
We are also looking for opportunities in the future to test out more of those behavioral personas mentioned in the research article to see how they might push our team’s critical thinking even further.

3. Keep Ownership with the Group

“Ownership” here doesn’t mean a single person is in charge of the AI. Instead, it means everyone participates in maintaining the interaction. The “quiet reading” phase is a staple of our reviews—we spent 15 minutes at the start reading the design and listing our own comments so that our own judgment remained the primary driver.

We are already looking to extend this sense of collective maintenance to other tools. For instance, our Ops bots are maintained and used collectively on Ona.

The Tangible Outcomes

By the end of a typical 60-minute review, the “teammate” approach yields results that used to take hours/days of follow-up work. Our immediate outcomes now include:

1. An updated Confluence spec page The spec is automatically updated to contain all discussed points and factors, ensuring no “institutional knowledge” is lost.

2. Synced Jira tickets and epics. All relevant tickets are updated or created during the meeting, reflecting the latest consensus.

3. Minimal follow-ups Because the AI helped us resolve technical questions and document decisions in real-time, there are no more follow-up tasks required after the meeting. We leave the room with the work actually done.

The “On-Ramp” Challenge

It’s important to note that this isn’t always easy or automatic. We found that it is very natural to just jump into a discussion and start a meeting without the AI.

We often had moments where, 20 minutes in, we’d realize: “Hey, we should have started the Claude instance earlier.” We’d be sitting there with technical questions or spec uncertainties that we could have checked immediately if we had been using our “teammate” from the start. Building the habit of bringing the AI into the room before the questions arise is a significant part of the learning curve.

The Takeaway

Using AI as a teammate isn’t about letting the machine do the work. It’s about leveraging its ability to process vast amounts of context (code, specs, tickets) to elevate the human conversation.

The most surprising result? Design reviews are fun now. There’s a new kind of energy in the room when you can resolve a technical debate in seconds or verify a spec on the fly. I find myself constantly looking for new opportunities to bring this “teammate” into other parts of our workflow.

When everyone in the room—including the AI—knows their role and the context of their peers, the “unraveled strands” of a complex system start to untangle much faster.

What do you think? Have you tried bringing an AI into your team meetings?

Footnote – The industry is rapidly moving toward making AI a native part of the video call experience. Technologies like

Google Beam (formerly Project Starline) are pushing this further with 3D volumetric video and AI personas like “Sophie” that can interact in real-time. Our experiment with sharing a Claude instance side-by-side is a low-friction version of this future—bringing the AI out of the private chat window and into the shared team space.
May 31, 2026
Field Report – AgentCon Sillicon Valley (Part 2)
The Conference

AgentCon Silicon Valley is a free, one-day, in-person conference for developers building with AI agents. This post is about my personal experience and thoughts about the evnet. This continue second part of two parts series – see my first post at https://unraveledstrands.com/2026/05/10/agentcon-silicon-valley-2026-part-1/. A major theme of the event was about sharing skills to be an agent boss – a builder who build tools and frameworks to delegate and Control agents.

In this post, i will focus about the talks that happened at the PM session of the conference.
- Lessons from a No-Code Library – Drew Breunig
- Securing Coding Agents: Sandboxes, Guardrails, and Real-World Attacks – Dan Ndombe
- From One-Shot to Agentic: Optimizing Shop Intelligence with DSPy – Kshetrajna Raghavan
- GitHub Agentic Workflows – Peli de Halleux
- Client side Web AI Agents for the agentic internet of the future – Jason Mayes
Lessons from a No-Code Library – Drew Breunig

This talk was probably my favorite of the conference. It addressed the challenges of Spec-Driven Development (SDD) with AI coding—a workflow that is now practiced everywhere in the industry in one form or another. In my team, we use Confluence and Jira as the source of truth, with spec files guiding the agent during implementation.

However, as many others have found, the difficulty is that agents generate code faster than we can review it. Since specs are never 100% perfect, an agent will inevitably make assumptions to fill in the gaps. Without a feedback loop, these “silent decisions” are never documented or tested. This is exactly why code review subagents and concepts like “ultrareview” have become so necessary.

The Framework: The Spec-Driven Development Triangle

Drew Breunig explored this problem while developing whenword, a library containing almost no manual code—only specs and tests. The implementation was left entirely to an agent, which often
results in “spec drift” as the code evolves away from the original documentation.

To manage this, Drew introduced the Spec-Driven Development Triangle, which balances Spec, Test, and Code. He uses an LLM as a judge to compare the final implementation against the original spec. The model identifies exactly where the agent filled in gaps or deviated from the requirements, flagging those points for the developer to review.

This approach mimics closely to Behavior-Driven Development (BDD) – Combining BDD to define the ground truth with an LLM judge to verify the implementation is a practical way to maintain
oversight.

Image from https://www.dbreunig.com/2026/03/04/the-spec-driven-development-triangle.html

The Tool: plumb

To automate this feedback loop, Drew built plumb. The tool integrates directly into the development workflow via git hooks—specifically pre-commit and post-commit hooks.

When a developer attempts to commit code, the pre-commit hook triggers an LLM analysis of the staged changes and the
agent’s conversation history. It identifies any “silent decisions” made during implementation and blocks the commit if there are undocumented changes. Once the developer approves the findings, plumb automatically syncs those decisions back into the specs and tests, ensuring the documentation remains an accurate reflection of the software

I find it really inspiring on how Drew successfully took the philosophical lessons from his whenword experiment and translated them into a pragmatic, usable tool that solves a real engineering bottleneck.

Securing Coding Agents: Sandboxes, Guardrails, and Real-World Attacks – Dan Ndombe

Dan Ndombe is from docker and he was there to talk about docker sandboxes. Providing a secure enviornment is defintiely a very important aspect of agentic workflows and workload. you are giving agent autonomy to drive, but there must be guardrails and safety net. This is simliar to my talk at the 2026 Backgorund agent summit where i spoke of using Ona a envionrment for agents to run background implemntation.

Docker sandbox are MicroVM-isolated environments. By using a hardware-level hypervisor, each agent gets its own dedicated Linux kernel. This is in contrast to regualr dockerc aontiners, which share the host’s kernel, a “jailbroken” agent could
theoretically escape to the host machine via a kernel exploit.

Features Dan highlighted include:
- Hypervisor Isolation: Each sandbox runs in a lightweight MicroVM (using Apple Hypervisor, Windows WHP, or KVM),
  isolating the agent from the host processes entirely.
- Network Guardrails: All egress is routed through a proxy that enforces strict domain allow-lists, preventing agents
  from exfiltrating secrets.
- Private Docker Daemons: The sandboxes include their own private Docker engine, allowing agents to run docker build
  or docker compose (Docker-in-Docker) without needing dangerous “privileged” access to the host
I was most interested in the egress proxy. For long-running tasks and agents operating ‘in the wild,’ preventing the agent from having direct access to secrets stored in environment variables or accessible files is going to be extremely relevant and important

Dan Ndombe’s core message provided a powerful summary of the “Agent Boss” era: “An AI agent is only as safe as we want it to be.” It isn’t about taking away the agent’s tools, but about ensuring those tools are used within a secure, kernel-level boundary that the human ‘Boss’ controls.

From One-Shot to Agentic: Optimizing Shop Intelligence with DSPy – Kshetrajna Raghavan

Kshetrajna Raghavan, a Principal Engineer at Shopify, delivered what I found to be the best case study for scaling agents in production. He shared the journey of “Shop Intelligence”—Shopify’s system for extracting structured data from millions of highly customized, non-standard merchant stores.

His talk perfectly illustrated a lession that I have learnt and applying at work: Being an Agent Boss requires intense pragmatism.

Given enough attempts, compute, and liberty, an advanced AI could probably solve the majority of the technical issues
we face. But at what cost? If a task costs more to execute than the value it generates, it ceases to be a practical solution. Are we building Lamborghinis to do the job of a tow truck?

Image from https://cleantechnica.com/2023/12/24/texas-dps-is-less-amused-with-the-model-y-superheavy-than-we-are/

Shopify’s journey highlighted this exact tension:
1. The “One-Shot” Wall
  Initially, Shopify used single, large API calls to OpenAI models to analyze store content. While this worked for
  simple cases, it was unsustainable. Not only was the cost of processing millions of stores astronomical, but a single
  prompt couldn’t handle the messy reality of diverse store layouts.
2. Moving to the ReAct Loop
  The first major shift was moving from a static “blob” of text to an autonomous agent using a ReAct (Reasoning and
  Action) loop. Instead of guessing based on a single snapshot, the agent could explore the store—deciding which pages
  to visit and which data points were missing.
3. The Swarm of Sub-Agents
  To manage the complexity of exploration, they broke the “do-it-all” agent into a swarm of specialized sub-agents. One
  agent might focus on brand identity, while another focuses on product categorization. This modularity doubled their
  precision, but relying on third-party APIs for this many distinct agent interactions was still prohibitively
  expensive.
4. Self-Hosting and Compiling with DSPy
  The final, most impressive step was using DSPy to programmatically optimize the entire pipeline so they could bring it
  in-house. Instead of manually tuning prompts for OpenAI, they treated their workflows as code that could be compiled
  against specific metrics. This optimization allowed them to move away from third-party APIs entirely. By spinning up their own H100 clusters and
  leveraging self-hosted Qwen models, the economics of the system fundamentally changed. The results were incredible:
- 75x Cost Reduction: The combination of an optimized agentic architecture and self-hosted models drove the cost down
  by a factor of 75, while actually improving data quality.
- Universal Coverage: This efficiency allowed Shopify to scale the system to analyze every single store on the
  platform, something that would have been financially impossible on a per-token API model.
Raghavan’s core message was that “architecture compounds.”
Success didn’t come from a single breakthrough, but from the steady evolution from one-shot API calls to a self-hosted swarm of sub-agents, all programmatically optimized for the real world.

I am also applying this approach in my own work. We started using agentic workflows for basic debugging and have since built on those guardrails and architecture to allow agents to handle more complex tasks, like background agents performing code implementation and self-improvement. While we haven’t found a need to self-host models yet, switching to cheaper models (like MiniMax or AWS Nova Lite) for basic tasks has helped manage costs and made the system much more practical for everyday use.

GitHub Agentic Workflows – Peli de Halleux

I find GitHub Agentic Workflows to be an easy and practical way for engineering teams already using Github to adopt agentic workloads. This is especially true for large enterprises since which often have more restrictions and compliance requirements to follow. Also applying AI in a more controlled environment like ci/cd process should has a less negative consequences when things goes sour as oppose to a customer facing agent

The Agent Factory in Action

Peli described a setup that feels a bit wild but makes total sense: AI writing software that writes AI software. In Peli’s Agent Factory, they’ve stopped trying to build one giant, monolithic agent. Instead, they have over 100 little “bite-sized” workflows doing highly specific jobs.

Of course, when you have over 100 agents running around a repo, no human can read all their outputs. Peli’s solution to this is to just add more agents. They use “meta-agents” whose entire job is to watch the other agents and make sure they are behaving.

screenshot from https://github.github.com/gh-aw/blog/2026-01-12-welcome-to-pelis-agent-factory/

Something that ties perfectly back to what Drew Breunig’s talk was: GH AW has moved to a spec-only contribution model. This is the whenword experiment playing out at an enterprise scale. If you want to contribute to the Agent Factory, you don’t write code. You write a Markdown file—a spec—that describes exactly what you want the agent to do. The system then “compiles” that natural language into a secure GitHub Action.

It’s an interesting look at the future. Seeing a giant like GitHub bet so heavily on spec-driven contributions makes
me think this isn’t just a neat trick—it’s probably the only way we are going to safely manage these systems at scale.

I managed to speak with Peli after his talk to get some extra tips and tricks for improving agentic CI/CD workflows. He recommended checking out qmd (https://github.com/tobi/qmd) as a code wiki. We also briefly discussed how to avoid reward hacking in the CI/CD process—a conversation that eventually inspired me to write my post on

Notes on Autonomous Development – Prevent AI Reward Hacking In Gitlab CI/CD

Specs vs. Code for Security

This also made me think a lot about security. In a normal open-source or enterprise repo, you have to read every line of code to make sure someone hasn’t introduced a vulnerability. But with spec-driven development, you’re just auditing the intent. Because the actual implementation is generated within strict, pre-defined guardrails, a spec-driven contribution might actually be way more secure than a human-written one.

Client side Web AI Agents for the agentic internet of the future – Jason Mayes

Find out more https://github.com/jasonmayes/WebAIAgent

I caught the final session of the conference, which focused on client-side Web AI agents. The speaker used flight booking as an example. Finding a flight usually requires navigating rigid, static filters—picking dates, checkboxes, and price sliders. The demo showed how a user could instead use voice to surface flights, with the UI changing to fit the user’s intent. This shift to client-side LLMs highlights a few distinct points for running AI in production:
- Latency: Processing the prompt locally on the device’s hardware removes the round-trip delay to cloud APIs.
- Task-Appropriate Models: As the Shopify talk noted, you don’t need a state-of-the-art model for every single task. A webpage doesn’t require a massive, generalized model; it just needs one capable of mapping user intent to specific local functions.
- Production Economics: Moving inference to the client device removes the cloud infrastructure costs of running agents at scale.
This architecture could change how we approach the specificity paradox in web design. Currently, developers and designers spend time building custom user experiences for every edge case, trying to predict what a user might do next. With native cognitive capability in the browser, websites could simply expose their tools and data structures via protocols like WebMCP, allowing a local model to parse the page and handle the specific workflow a user requests in that moment. This also points to a practical reason behind Google’s open-source strategy with Gemma 4. If lightweight models are going to run natively within the browser environment (like Chrome’s built-in AI architecture), the model weights must live on local consumer devices. Making Gemma open-weight aligns with a framework where rendering and agent orchestration happen entirely on the client side.

Closing thoughts

AgentCon was timely and highly relevant. It’s clear that being an “Agent Boss” will be a mandatory skillset in the AI
era. This means taking responsibility for the agent’s environment—whether that’s a secure sandbox, a CI/CD pipeline,
or a web browser. Ultimately, our success will be defined by how well we provide the right context and guardrails to
turn autonomous actions back into human-led intent.
May 24, 2026
Field Report – AgentCon Silicon Valley 2026 (Part 1)
AgentCon Silicon Valley is a free, one-day, in-person conference for developers building with AI agents.

One of the peaks of living in the bay is that every week there will be a tech conference that is worth going to. Last week, I attended AgentCon Silicon at the Computer History Museum, Mountain View California, which btw, is a fantastic venue. Because there was so much great content to digest, I’m breaking my report into two parts to stay within a reasonable “context window” for an article.

The Conferecne

AgentCon 2026 happened on May 4th (yes lots of Star War reference) is a small to mid size conference with two to three concurrent tracks happening at the same time. The key sponsors were:

The conference was great – Majority (if not all) of the speakers were engineers and developers. Content were all very applicable to my daily work. The full schedule can be found at the event’s page.

All together I went to a total of 9 talks. I am very glad that there were always seats available and attendances were pretty evenly distributed between concurrent talks.

These are the talks I attended:

Part 1 (this entry, morning session)
- Will The Real Autonomous Agent Please Stand Up – Patrick Chanezon, Dona Sarkar
- Your agent needs a sandbox, not a desert – Samuel Colvin
- How to Build Auditable Agents Using Context Graphs – Nyah Macklin
Part 2 (afternoon session)
- Agents Don’t Know What They Don’t Know – Rob Zuber
- Lessons from a No-Code Library – Drew Breunig
- Securing Coding Agents: Sandboxes, Guardrails, and Real-World Attacks – Dan Ndombe
- From One-Shot to Agentic: Optimizing Shop Intelligence with DSPy – Kshetrajna Raghavan
- GitHub Agentic Workflows – Peli de Halleux
- Client side Web AI Agents for the agentic internet of the future – Jason Mayes
Will The Real Autonomous Agent Please Stand Up – Patrick Chanezon, Dona Sarkar

The session was covered by two excellent speakers – and I would have loved to hear them speak longer

To open AgentCon, Dona Sarkar shared two key concepts that framed the rest of the day. She spoke about the evolution of becoming an ‘Agent Boss’ and, at the same time, reminded us to check if we’re building real AI innovation or simply
settling for ‘faster horses.’ Building faster horses are fine – if that free us up to work on transformative AI. These two thoughts really resonated throughout the sessions.

If I had asked people what they wanted, they would have said faster horses,” – Not Henry Ford.

Dona also explored the various form factors AI will manifest in. Her talk was funny, entertaining, and highly informative.

Patrick Chanezon

Unfortunately, due to a work matter, I missed the first half of his talk. However, in the portion I caught, he spoke about the shifting roles of ICs and Managers in the agentic evolution. Doubling down on the theme of “Agent Bosses,” he explained that an IC’s success now depends on how effectively they can oversee their agents. He also referenced a key ACM article suggesting that as AI automates entry-level tasks, the industry must adopt a “preceptor” model to ensure junior developers still gain the critical judgment needed to become the next generation of senior engineers. To be successful in this new landscape, you will need the following skills:

(taken from https://www.youtube.com/watch?v=0HI3OIi-YJY)

Your agent needs a sandbox, not a desert – Samuel Colvin

Samuel Colvin (the creator of Pydantic) introduced Monty, a Rust-based Python interpreter designed for safe agentic use. Unlike CPython, it operates on a “deny-all” security model, starting with zero capabilities. Because it is an interpreter, it boasts microsecond startup times. It’s an ideal tool for basic Python tasks, text manipulation, and math within a secure environment.

How to Build Auditable Agents Using Context Graphs – Nyah Macklin

Nyah discussed leveraging Neo4j to build context and memory graphs. This was one of my favorite talks because it was immediately applicable to my current projects. I am so happy to see Neo4j releasing these tools as completely open-source rather than “fauxpen” source.

If you are serious about becoming an “Agent Boss,” a scalable, distributed context and memory system is a must. Research (such as the CommGPT paper) and practical application consistently show that you can achieve significantly better performance by providing a robust Knowledge Graph and RAG system rather than relying solely on fine-tuning.

Key Takeaway: The primary advantage of GraphRAG over traditional vector-based RAG is its inherent ability to map and understand complex, interconnected relationships within data. In her talks she introduced and mentioned that while neo4j-labs/agent-memory use semantic search for core retrieval, they leverage Knowledge Graph structures for organization, deduplication, and context assembly.

Closing Thoughts on Part 1

The morning sessions at AgentCon made one thing clear: we are moving past the “AI as a chatbot” phase and into the “AI as a workforce” era. Becoming an Agent Boss isn’t just a catchy phrase; it’s a fundamental shift in how we think about code, memory, and security. Whether it was the security of Rust-based sandboxes or the structural power of Knowledge Graphs, the bar for “real innovation” is being set higher every day.

I’m still processing the implications for the future of the engineering profession, but the transition toward an Agentic SDLC is clearly well underway.

Coming Next in Part 2

In the next entry, I’ll dive into the remaining six talks from the afternoon tracks, focusing on the practical “how-to” of securing and optimizing agents.

Here’s what I’ll be covering:
- Agents Don’t Know What They Don’t Know: Handling uncertainty with Rob Zuber.
- Lessons from a No-Code Library: Drew Breunig on simplifying complexity.
- Securing Coding Agents: A deep dive into guardrails and real-world attacks.
- Optimizing Shop Intelligence: Using DSPy to move beyond one-shot prompts.
- GitHub Agentic Workflows: How the industry giants are orchestrating agents.
- Client-side Web AI Agents: The future of the agentic internet.
Stay tuned—Part 2 will be live shortly!
May 10, 2026
Notes on Autonomous Development – Prevent AI Reward Hacking In Gitlab CI/CD
“When a measure becomes a target, it ceases to be a good measure.” – Goodhart’s Law

I frequently run autonomous agents in the background to handle development tasks. I am using a Trunk-Based Development modelwhere the agents are able to merge to branch via auto-merge when the CI/CD pases. However, I have observed agents attempted “reward hack” – skipping tests or bypassing checks to satisfy their goal of completing a task.

To ensure the integrity of my CI/CD pipeline, i have implemented the following guardrails

Enforcing “Pipeline Must Succeed”

This is a baseline requirement, but it is insufficient on its own. There is nothing stopping an agent from removing or editing a test suite to ensure the pipeline passes, thereby triggering an undeserved merge.

Utilizing CODEOWNERS

The CODEOWNERS file assigns ownership to specific files or directories. By combining this with branch protection rules, you can ensure that any changes to critical files require manual approval from a human owner.

It is vital to include self-protection rules. You must prevent the agent from modifying the CODEOWNERS file itself, as well as any related CI configuration files. If an agent can edit your CI YAML, it can simply “silence” the test steps. Inside your CODEOWNERS file, you should add:
Inside your CODEOWNERS file, add: CODEOWNERS @your-username very_important_tests.py @your-username .gitlab-ci.yml @your-username # Include any other nested CI includes or scripts ci/scripts/* @your-username
Using Pipeline Execution Policies

If we look at industries with the higest stakes for software engineering (i.e Aviation / Space) – external verification becomes very important. Relying on a single repository to guard itself is starting to feel like a loop that’s too easy to break. It’s either insufficient or just really inefficient to manage.

GitLab’s Pipeline Execution Policies allow teams to enforce mandatory, immutable CI/CD jobs across specific projects. These policies ensure that critical validation gates cannot be bypassed or modified by an autonomous agent, as the configuration lives outside the agent’s reach.

Futhermore, pipeline execution policy jobs can be assigned to one of the two reserved stages:
- .pipeline-policy-pre: Runs at the very beginning of the pipeline (before the .pre stage). This is ideal for security scans or IaC (Infrastructure as Code) validation to prevent unwanted code from executing.
- .pipeline-policy-post: Runs at the very end (after the .post stage). This is the place for integration tests, ensuring test coverage levels are maintained, and preventing “spec drift.”
Other Mechanisms and Conclusion

There are several other tools to enhance CI/CD verification that are worth exploring:
- External Status Checks: Requiring a “green light” from an external service.
- Webhooks: Triggering secondary validation layers.
- Scan Result Policies: Blocking merges if new vulnerabilities are detected.
- Push Rules: Prohibiting specific file changes or naming conventions.
Software development is evolving, and our CI/CD practices must evolve with it. We have moved from simple “build and test” routines to a world where we are governing autonomous intelligence. It is a challenging, yet incredibly exciting time to be a software engineer.
May 9, 2026
AI, Code, and Verification: A Simple Trick for Accurate Results
TLDR
- LLM can be terrible at math or generating response that require precision.
- A simple rule is to ask LLM to generate code to do math instead of using its answer. This can be achieve with a simple prompt like –
  When asked to do any calculations or conversions, always generate code and run it instead of generating a response immediately
Hallunication

It’s a known problem that AIs “hallucinate,” especially when you need a precise answer – like doing math or counting.

This was famously exposed when earlier generation LLMs got stumped by ‘gotcha’ questions like, “How many ‘r’s are in strawberry?”, which showed they weren’t really thinking. While most advanced models today have now learned to answer that question correctly, this isn’t necessarily because they’ve learned to reason, but because they have been specifically trained or prompted to patch that obvious flaw.

Taken from https://www.reddit.com/r/singularity/comments/1enqk04/how_many_rs_in_strawberry_why_is_this_a_very/

While this shows progress, it also reveals that their accuracy can be a result of targeted training rather than innate computational ability.

This exact issue resurfaced for me with a more practical, real-world problem – and this is what I am doing now to prevent it!

Feeling Lazy

I was debugging an issue in MongoDB and had a seemingly simple task: convert a MongoDB ObjectId, 6616b9157bac1647326e11e1, into a human-readable timestamp.

For those who are unfamiliar with MongoDB ObjectIds, or have been using MongoDB but is unaware – A MongoDB ObjectId is a 12-byte value that includes a 4-byte timestamp in its initial segment. This timestamp represents the number of seconds that have passed since the Unix epoch (January 1, 1970). (see docs)

The Hallucination

And… it wasn’t just an answer—ChatGPT delivered it with the full swagger of a lead engineer who’s 100% sure of themselves. It laid out the whole thing step-by-step, explaining the ID format, how it pulled the timestamp, and all that.

The correct answer should have been 2025-07-09T06:01:39.000Z

The timestamp it gave me seemed legit at first since it was the right day. But something felt off; the time seemed to be off by a few hours Thank goodness I listened to that little voice in my head and ran the conversion myself. Sure enough, ChatGPT was wrong!

Not Just ChatGPT

Curious, I tried the same prompt with Grok, Gemini, and Claude. The results were a mixed bag of confidently incorrect answers. This experience was a stark reminder that while the most obvious flaws are being patched, the underlying weakness in performing novel, precise conversions still persists.

The Better Approach: Ask for the Code, Not the Answer

This brings me to the core lesson I learned from this: instead of asking an LLM for the final answer, ask it to write code to produce the answer. My experience with Cursor was a perfect example. While the answer in its chat was wrong, it also provided a code snippet.

Always ask for code!

That code was the correct path. This approach plays to the AI’s strengths, shifting the task from a weak point (calculation) to a strong point (code generation). Ideally, the model would then execute that code in a sandboxed environment to provide a verified result.

That’s right!

A Simple Rule

Here’s a simple rule: if it involves math or a conversion, always ask the LLM to write code.

Here is a short example on how to do that with a simple prompt –

When ask to do any calucations or converstion always generate a code and run it instead of generating a response immediately.

This too works for counting “R”s =)
July 9, 2025