Tag: chatgpt

  • Field Report – AgentCon Sillicon Valley (Part 2)

    Field Report – AgentCon Sillicon Valley (Part 2)

    The Conference

    AgentCon Silicon Valley is a free, one-day, in-person conference for developers building with AI agents. This post is about my personal experience and thoughts about the evnet. This continue second part of two parts series – see my first post at https://unraveledstrands.com/2026/05/10/agentcon-silicon-valley-2026-part-1/. A major theme of the event was about sharing skills to be an agent boss – a builder who build tools and frameworks to delegate and Control agents.

    In this post, i will focus about the talks that happened at the PM session of the conference.

    • Lessons from a No-Code Library – Drew Breunig
    • Securing Coding Agents: Sandboxes, Guardrails, and Real-World Attacks – Dan Ndombe
    • From One-Shot to Agentic: Optimizing Shop Intelligence with DSPy – Kshetrajna Raghavan
    • GitHub Agentic Workflows – Peli de Halleux
    • Client side Web AI Agents for the agentic internet of the future – Jason Mayes

    Lessons from a No-Code Library – Drew Breunig

    This talk was probably my favorite of the conference. It addressed the challenges of Spec-Driven Development (SDD) with AI coding—a workflow that is now practiced everywhere in the industry in one form or another. In my team, we use Confluence and Jira as the source of truth, with spec files guiding the agent during implementation.

    However, as many others have found, the difficulty is that agents generate code faster than we can review it. Since specs are never 100% perfect, an agent will inevitably make assumptions to fill in the gaps. Without a feedback loop, these “silent decisions” are never documented or tested. This is exactly why code review subagents and concepts like “ultrareview” have become so necessary.

    The Framework: The Spec-Driven Development Triangle

    Drew Breunig explored this problem while developing whenword, a library containing almost no manual code—only specs and tests. The implementation was left entirely to an agent, which often
    results in “spec drift” as the code evolves away from the original documentation.

    To manage this, Drew introduced the Spec-Driven Development Triangle, which balances Spec, Test, and Code. He uses an LLM as a judge to compare the final implementation against the original spec. The model identifies exactly where the agent filled in gaps or deviated from the requirements, flagging those points for the developer to review.

    This approach mimics closely to Behavior-Driven Development (BDD) – Combining BDD to define the ground truth with an LLM judge to verify the implementation is a practical way to maintain
    oversight.

    Image from https://www.dbreunig.com/2026/03/04/the-spec-driven-development-triangle.html

    The Tool: plumb

    To automate this feedback loop, Drew built plumb. The tool integrates directly into the development workflow via git hooks—specifically pre-commit and post-commit hooks.

    When a developer attempts to commit code, the pre-commit hook triggers an LLM analysis of the staged changes and the
    agent’s conversation history. It identifies any “silent decisions” made during implementation and blocks the commit if there are undocumented changes. Once the developer approves the findings, plumb automatically syncs those decisions back into the specs and tests, ensuring the documentation remains an accurate reflection of the software

    I find it really inspiring on how Drew successfully took the philosophical lessons from his whenword experiment and translated them into a pragmatic, usable tool that solves a real engineering bottleneck.

    Securing Coding Agents: Sandboxes, Guardrails, and Real-World Attacks – Dan Ndombe

    Dan Ndombe is from docker and he was there to talk about docker sandboxes. Providing a secure enviornment is defintiely a very important aspect of agentic workflows and workload. you are giving agent autonomy to drive, but there must be guardrails and safety net. This is simliar to my talk at the 2026 Backgorund agent summit where i spoke of using Ona a envionrment for agents to run background implemntation.

    Docker sandbox are MicroVM-isolated environments. By using a hardware-level hypervisor, each agent gets its own dedicated Linux kernel. This is in contrast to regualr dockerc aontiners, which share the host’s kernel, a “jailbroken” agent could
    theoretically escape to the host machine via a kernel exploit.

    Features Dan highlighted include:

    • Hypervisor Isolation: Each sandbox runs in a lightweight MicroVM (using Apple Hypervisor, Windows WHP, or KVM),
      isolating the agent from the host processes entirely.
    • Network Guardrails: All egress is routed through a proxy that enforces strict domain allow-lists, preventing agents
      from exfiltrating secrets.
    • Private Docker Daemons: The sandboxes include their own private Docker engine, allowing agents to run docker build
      or docker compose (Docker-in-Docker) without needing dangerous “privileged” access to the host

    I was most interested in the egress proxy. For long-running tasks and agents operating ‘in the wild,’ preventing the agent from having direct access to secrets stored in environment variables or accessible files is going to be extremely relevant and important

    Dan Ndombe’s core message provided a powerful summary of the “Agent Boss” era: “An AI agent is only as safe as we want it to be.” It isn’t about taking away the agent’s tools, but about ensuring those tools are used within a secure, kernel-level boundary that the human ‘Boss’ controls.

    From One-Shot to Agentic: Optimizing Shop Intelligence with DSPy – Kshetrajna Raghavan

    Kshetrajna Raghavan, a Principal Engineer at Shopify, delivered what I found to be the best case study for scaling agents in production. He shared the journey of “Shop Intelligence”—Shopify’s system for extracting structured data from millions of highly customized, non-standard merchant stores.

    His talk perfectly illustrated a lession that I have learnt and applying at work: Being an Agent Boss requires intense pragmatism.

    Given enough attempts, compute, and liberty, an advanced AI could probably solve the majority of the technical issues
    we face. But at what cost? If a task costs more to execute than the value it generates, it ceases to be a practical solution. Are we building Lamborghinis to do the job of a tow truck?

    Image from https://cleantechnica.com/2023/12/24/texas-dps-is-less-amused-with-the-model-y-superheavy-than-we-are/

    Shopify’s journey highlighted this exact tension:

    1. The “One-Shot” Wall
      Initially, Shopify used single, large API calls to OpenAI models to analyze store content. While this worked for
      simple cases, it was unsustainable. Not only was the cost of processing millions of stores astronomical, but a single
      prompt couldn’t handle the messy reality of diverse store layouts.
    2. Moving to the ReAct Loop
      The first major shift was moving from a static “blob” of text to an autonomous agent using a ReAct (Reasoning and
      Action) loop. Instead of guessing based on a single snapshot, the agent could explore the store—deciding which pages
      to visit and which data points were missing.
    3. The Swarm of Sub-Agents
      To manage the complexity of exploration, they broke the “do-it-all” agent into a swarm of specialized sub-agents. One
      agent might focus on brand identity, while another focuses on product categorization. This modularity doubled their
      precision, but relying on third-party APIs for this many distinct agent interactions was still prohibitively
      expensive.
    4. Self-Hosting and Compiling with DSPy
      The final, most impressive step was using DSPy to programmatically optimize the entire pipeline so they could bring it
      in-house. Instead of manually tuning prompts for OpenAI, they treated their workflows as code that could be compiled
      against specific metrics. This optimization allowed them to move away from third-party APIs entirely. By spinning up their own H100 clusters and
      leveraging self-hosted Qwen models, the economics of the system fundamentally changed. The results were incredible:
    • 75x Cost Reduction: The combination of an optimized agentic architecture and self-hosted models drove the cost down
      by a factor of 75, while actually improving data quality.
    • Universal Coverage: This efficiency allowed Shopify to scale the system to analyze every single store on the
      platform, something that would have been financially impossible on a per-token API model.

    Raghavan’s core message was that “architecture compounds.”
    Success didn’t come from a single breakthrough, but from the steady evolution from one-shot API calls to a self-hosted swarm of sub-agents, all programmatically optimized for the real world.

    I am also applying this approach in my own work. We started using agentic workflows for basic debugging and have since built on those guardrails and architecture to allow agents to handle more complex tasks, like background agents performing code implementation and self-improvement. While we haven’t found a need to self-host models yet, switching to cheaper models (like MiniMax or AWS Nova Lite) for basic tasks has helped manage costs and made the system much more practical for everyday use.

    GitHub Agentic Workflows – Peli de Halleux

    I find GitHub Agentic Workflows to be an easy and practical way for engineering teams already using Github to adopt agentic workloads. This is especially true for large enterprises since which often have more restrictions and compliance requirements to follow. Also applying AI in a more controlled environment like ci/cd process should has a less negative consequences when things goes sour as oppose to a customer facing agent

    The Agent Factory in Action

    Peli described a setup that feels a bit wild but makes total sense: AI writing software that writes AI software. In Peli’s Agent Factory, they’ve stopped trying to build one giant, monolithic agent. Instead, they have over 100 little “bite-sized” workflows doing highly specific jobs.

    Of course, when you have over 100 agents running around a repo, no human can read all their outputs. Peli’s solution to this is to just add more agents. They use “meta-agents” whose entire job is to watch the other agents and make sure they are behaving.

    screenshot from https://github.github.com/gh-aw/blog/2026-01-12-welcome-to-pelis-agent-factory/

    Something that ties perfectly back to what Drew Breunig’s talk was: GH AW has moved to a spec-only contribution model. This is the whenword experiment playing out at an enterprise scale. If you want to contribute to the Agent Factory, you don’t write code. You write a Markdown file—a spec—that describes exactly what you want the agent to do. The system then “compiles” that natural language into a secure GitHub Action.

    It’s an interesting look at the future. Seeing a giant like GitHub bet so heavily on spec-driven contributions makes
    me think this isn’t just a neat trick—it’s probably the only way we are going to safely manage these systems at scale.

    I managed to speak with Peli after his talk to get some extra tips and tricks for improving agentic CI/CD workflows. He recommended checking out qmd (https://github.com/tobi/qmd) as a code wiki. We also briefly discussed how to avoid reward hacking in the CI/CD process—a conversation that eventually inspired me to write my post on

    Specs vs. Code for Security

    This also made me think a lot about security. In a normal open-source or enterprise repo, you have to read every line of code to make sure someone hasn’t introduced a vulnerability. But with spec-driven development, you’re just auditing the intent. Because the actual implementation is generated within strict, pre-defined guardrails, a spec-driven contribution might actually be way more secure than a human-written one.

    Client side Web AI Agents for the agentic internet of the future – Jason Mayes

    Find out more https://github.com/jasonmayes/WebAIAgent

    I caught the final session of the conference, which focused on client-side Web AI agents. The speaker used flight booking as an example. Finding a flight usually requires navigating rigid, static filters—picking dates, checkboxes, and price sliders. The demo showed how a user could instead use voice to surface flights, with the UI changing to fit the user’s intent. This shift to client-side LLMs highlights a few distinct points for running AI in production:

    • Latency: Processing the prompt locally on the device’s hardware removes the round-trip delay to cloud APIs.
    • Task-Appropriate Models: As the Shopify talk noted, you don’t need a state-of-the-art model for every single task. A webpage doesn’t require a massive, generalized model; it just needs one capable of mapping user intent to specific local functions.
    • Production Economics: Moving inference to the client device removes the cloud infrastructure costs of running agents at scale.

    This architecture could change how we approach the specificity paradox in web design. Currently, developers and designers spend time building custom user experiences for every edge case, trying to predict what a user might do next. With native cognitive capability in the browser, websites could simply expose their tools and data structures via protocols like WebMCP, allowing a local model to parse the page and handle the specific workflow a user requests in that moment. This also points to a practical reason behind Google’s open-source strategy with Gemma 4. If lightweight models are going to run natively within the browser environment (like Chrome’s built-in AI architecture), the model weights must live on local consumer devices. Making Gemma open-weight aligns with a framework where rendering and agent orchestration happen entirely on the client side.

    Closing thoughts

    AgentCon was timely and highly relevant. It’s clear that being an “Agent Boss” will be a mandatory skillset in the AI
    era. This means taking responsibility for the agent’s environment—whether that’s a secure sandbox, a CI/CD pipeline,
    or a web browser. Ultimately, our success will be defined by how well we provide the right context and guardrails to
    turn autonomous actions back into human-led intent.

  • Field Report – AgentCon Silicon Valley 2026 (Part 1)

    Field Report – AgentCon Silicon Valley 2026 (Part 1)

    AgentCon Silicon Valley is a free, one-day, in-person conference for developers building with AI agents.

    One of the peaks of living in the bay is that every week there will be a tech conference that is worth going to. Last week, I attended AgentCon Silicon at the Computer History Museum, Mountain View California, which btw, is a fantastic venue. Because there was so much great content to digest, I’m breaking my report into two parts to stay within a reasonable “context window” for an article.

    The Conferecne

    AgentCon 2026 happened on May 4th (yes lots of Star War reference) is a small to mid size conference with two to three concurrent tracks happening at the same time. The key sponsors were:

    The conference was great – Majority (if not all) of the speakers were engineers and developers. Content were all very applicable to my daily work. The full schedule can be found at the event’s page.

    All together I went to a total of 9 talks. I am very glad that there were always seats available and attendances were pretty evenly distributed between concurrent talks.

    These are the talks I attended:

    Part 1 (this entry, morning session)

    • Will The Real Autonomous Agent Please Stand Up – Patrick Chanezon, Dona Sarkar
    • Your agent needs a sandbox, not a desert – Samuel Colvin
    • How to Build Auditable Agents Using Context Graphs – Nyah Macklin

    Part 2 (afternoon session)

    • Agents Don’t Know What They Don’t Know – Rob Zuber
    • Lessons from a No-Code Library – Drew Breunig
    • Securing Coding Agents: Sandboxes, Guardrails, and Real-World Attacks – Dan Ndombe
    • From One-Shot to Agentic: Optimizing Shop Intelligence with DSPy – Kshetrajna Raghavan
    • GitHub Agentic Workflows – Peli de Halleux
    • Client side Web AI Agents for the agentic internet of the future – Jason Mayes

    Will The Real Autonomous Agent Please Stand Up – Patrick Chanezon, Dona Sarkar

    The session was covered by two excellent speakers – and I would have loved to hear them speak longer

    To open AgentCon, Dona Sarkar shared two key concepts that framed the rest of the day. She spoke about the evolution of becoming an ‘Agent Boss’ and, at the same time, reminded us to check if we’re building real AI innovation or simply
    settling for ‘faster horses.’ Building faster horses are fine – if that free us up to work on transformative AI. These two thoughts really resonated throughout the sessions.

    If I had asked people what they wanted, they would have said faster horses,” – Not Henry Ford.

    Dona also explored the various form factors AI will manifest in. Her talk was funny, entertaining, and highly informative.

    Patrick Chanezon

    Unfortunately, due to a work matter, I missed the first half of his talk. However, in the portion I caught, he spoke about the shifting roles of ICs and Managers in the agentic evolution. Doubling down on the theme of “Agent Bosses,” he explained that an IC’s success now depends on how effectively they can oversee their agents. He also referenced a key ACM article suggesting that as AI automates entry-level tasks, the industry must adopt a “preceptor” model to ensure junior developers still gain the critical judgment needed to become the next generation of senior engineers. To be successful in this new landscape, you will need the following skills:

    (taken from https://www.youtube.com/watch?v=0HI3OIi-YJY)

    Your agent needs a sandbox, not a desert – Samuel Colvin

    Samuel Colvin (the creator of Pydantic) introduced Monty, a Rust-based Python interpreter designed for safe agentic use. Unlike CPython, it operates on a “deny-all” security model, starting with zero capabilities. Because it is an interpreter, it boasts microsecond startup times. It’s an ideal tool for basic Python tasks, text manipulation, and math within a secure environment.

    How to Build Auditable Agents Using Context Graphs – Nyah Macklin

    Nyah discussed leveraging Neo4j to build context and memory graphs. This was one of my favorite talks because it was immediately applicable to my current projects. I am so happy to see Neo4j releasing these tools as completely open-source rather than “fauxpen” source.

    If you are serious about becoming an “Agent Boss,” a scalable, distributed context and memory system is a must. Research (such as the CommGPT paper) and practical application consistently show that you can achieve significantly better performance by providing a robust Knowledge Graph and RAG system rather than relying solely on fine-tuning.

    Key Takeaway: The primary advantage of GraphRAG over traditional vector-based RAG is its inherent ability to map and understand complex, interconnected relationships within data. In her talks she introduced and mentioned that while neo4j-labs/agent-memory use semantic search for core retrieval, they leverage Knowledge Graph structures for organization, deduplication, and context assembly.

    Closing Thoughts on Part 1

    The morning sessions at AgentCon made one thing clear: we are moving past the “AI as a chatbot” phase and into the “AI as a workforce” era. Becoming an Agent Boss isn’t just a catchy phrase; it’s a fundamental shift in how we think about code, memory, and security. Whether it was the security of Rust-based sandboxes or the structural power of Knowledge Graphs, the bar for “real innovation” is being set higher every day.

    I’m still processing the implications for the future of the engineering profession, but the transition toward an Agentic SDLC is clearly well underway.

    Coming Next in Part 2

    In the next entry, I’ll dive into the remaining six talks from the afternoon tracks, focusing on the practical “how-to” of securing and optimizing agents.

    Here’s what I’ll be covering:

    • Agents Don’t Know What They Don’t Know: Handling uncertainty with Rob Zuber.
    • Lessons from a No-Code Library: Drew Breunig on simplifying complexity.
    • Securing Coding Agents: A deep dive into guardrails and real-world attacks.
    • Optimizing Shop Intelligence: Using DSPy to move beyond one-shot prompts.
    • GitHub Agentic Workflows: How the industry giants are orchestrating agents.
    • Client-side Web AI Agents: The future of the agentic internet.

    Stay tuned—Part 2 will be live shortly!

  • Notes on Autonomous Development – Prevent AI Reward Hacking In Gitlab CI/CD

    Notes on Autonomous Development – Prevent AI Reward Hacking In Gitlab CI/CD

    “When a measure becomes a target, it ceases to be a good measure.” – Goodhart’s Law

    I frequently run autonomous agents in the background to handle development tasks. I am using a Trunk-Based Development modelwhere the agents are able to merge to branch via auto-merge when the CI/CD pases. However, I have observed agents attempted “reward hack” – skipping tests or bypassing checks to satisfy their goal of completing a task.

    To ensure the integrity of my CI/CD pipeline, i have implemented the following guardrails

    Enforcing “Pipeline Must Succeed”

    This is a baseline requirement, but it is insufficient on its own. There is nothing stopping an agent from removing or editing a test suite to ensure the pipeline passes, thereby triggering an undeserved merge.

    Utilizing CODEOWNERS

    The CODEOWNERS file assigns ownership to specific files or directories. By combining this with branch protection rules, you can ensure that any changes to critical files require manual approval from a human owner.

    It is vital to include self-protection rules. You must prevent the agent from modifying the CODEOWNERS file itself, as well as any related CI configuration files. If an agent can edit your CI YAML, it can simply “silence” the test steps. Inside your CODEOWNERS file, you should add:

    Inside your CODEOWNERS file, add:
    CODEOWNERS @your-username
    very_important_tests.py @your-username
    .gitlab-ci.yml @your-username
    # Include any other nested CI includes or scripts
    ci/scripts/* @your-username

    Using Pipeline Execution Policies

    If we look at industries with the higest stakes for software engineering (i.e Aviation / Space) – external verification becomes very important. Relying on a single repository to guard itself is starting to feel like a loop that’s too easy to break. It’s either insufficient or just really inefficient to manage.

    GitLab’s Pipeline Execution Policies allow teams to enforce mandatory, immutable CI/CD jobs across specific projects. These policies ensure that critical validation gates cannot be bypassed or modified by an autonomous agent, as the configuration lives outside the agent’s reach.

    Futhermore, pipeline execution policy jobs can be assigned to one of the two reserved stages:

    • .pipeline-policy-pre: Runs at the very beginning of the pipeline (before the .pre stage). This is ideal for security scans or IaC (Infrastructure as Code) validation to prevent unwanted code from executing.
    • .pipeline-policy-post: Runs at the very end (after the .post stage). This is the place for integration tests, ensuring test coverage levels are maintained, and preventing “spec drift.”

    Other Mechanisms and Conclusion

    There are several other tools to enhance CI/CD verification that are worth exploring:

    • External Status Checks: Requiring a “green light” from an external service.
    • Webhooks: Triggering secondary validation layers.
    • Scan Result Policies: Blocking merges if new vulnerabilities are detected.
    • Push Rules: Prohibiting specific file changes or naming conventions.

    Software development is evolving, and our CI/CD practices must evolve with it. We have moved from simple “build and test” routines to a world where we are governing autonomous intelligence. It is a challenging, yet incredibly exciting time to be a software engineer.

  • AI, Code, and Verification: A Simple Trick for Accurate Results

    AI, Code, and Verification: A Simple Trick for Accurate Results

    TLDR

    • LLM can be terrible at math or generating response that require precision.
    • A simple rule is to ask LLM to generate code to do math instead of using its answer. This can be achieve with a simple prompt like –
      When asked to do any calculations or conversions, always generate code and run it instead of generating a response immediately

    Hallunication

    It’s a known problem that AIs “hallucinate,” especially when you need a precise answer – like doing math or counting.

    This was famously exposed when earlier generation LLMs got stumped by ‘gotcha’ questions like, “How many ‘r’s are in strawberry?”, which showed they weren’t really thinking. While most advanced models today have now learned to answer that question correctly, this isn’t necessarily because they’ve learned to reason, but because they have been specifically trained or prompted to patch that obvious flaw.

    Taken from https://www.reddit.com/r/singularity/comments/1enqk04/how_many_rs_in_strawberry_why_is_this_a_very/


    While this shows progress, it also reveals that their accuracy can be a result of targeted training rather than innate computational ability.

    This exact issue resurfaced for me with a more practical, real-world problem – and this is what I am doing now to prevent it!

    Feeling Lazy

    I was debugging an issue in MongoDB and had a seemingly simple task: convert a MongoDB ObjectId, 6616b9157bac1647326e11e1, into a human-readable timestamp.

    For those who are unfamiliar with MongoDB ObjectIds, or have been using MongoDB but is unaware – A MongoDB ObjectId is a 12-byte value that includes a 4-byte timestamp in its initial segment. This timestamp represents the number of seconds that have passed since the Unix epoch (January 1, 1970). (see docs)

    The Hallucination

    And… it wasn’t just an answer—ChatGPT delivered it with the full swagger of a lead engineer who’s 100% sure of themselves. It laid out the whole thing step-by-step, explaining the ID format, how it pulled the timestamp, and all that.

    The correct answer should have been 2025-07-09T06:01:39.000Z

    The timestamp it gave me seemed legit at first since it was the right day. But something felt off; the time seemed to be off by a few hours Thank goodness I listened to that little voice in my head and ran the conversion myself. Sure enough, ChatGPT was wrong!

    Not Just ChatGPT

    Curious, I tried the same prompt with Grok, Gemini, and Claude. The results were a mixed bag of confidently incorrect answers. This experience was a stark reminder that while the most obvious flaws are being patched, the underlying weakness in performing novel, precise conversions still persists.


    The Better Approach: Ask for the Code, Not the Answer

    This brings me to the core lesson I learned from this: instead of asking an LLM for the final answer, ask it to write code to produce the answer. My experience with Cursor was a perfect example. While the answer in its chat was wrong, it also provided a code snippet.

    Always ask for code!

    That code was the correct path. This approach plays to the AI’s strengths, shifting the task from a weak point (calculation) to a strong point (code generation). Ideally, the model would then execute that code in a sandboxed environment to provide a verified result.

    That’s right!

    A Simple Rule

    Here’s a simple rule: if it involves math or a conversion, always ask the LLM to write code.

    Here is a short example on how to do that with a simple prompt –

    When ask to do any calucations or converstion always generate a code and run it instead of generating a response immediately.

    This too works for counting “R”s =)